Monday, October 4, 2010 – you suck - work in progress

Disclaimer - these words are my own thoughts - they do not convey the message from either of the companies I own or part own nor do they convey the message of anyone I deal with in any business matters. This is my personal blog.

Another year, another rush for tickets .Another year of getting hyped up for the main event of the year. The doom and gloom of winter and October brings the positively amazing ability to pay your £50 deposit (or pay outright) for you glasto ticket for the following year. What should be a smooth and faultless process  (given it is 2010) has always been a complete sham – for as long as the responsibility of graceful execution of this process has lain at the foot of  at least.

This isn’t just another rant about another failed ticket attempt. I’ve been wanting to write this for years, I just never thought they’d drag it out this long to actually give me the motivation to write this. I don’t accept that there is any excuse in this day and age to run a commerce website which behaves so horrifically when it is asked to do what it was (or at least bloody well should have been) designed to do. I don’t understand why those great guys at Glasto put up with it and I don’t know why the developers and managers at don’t pull their fingers out and fix it.

What qualifies me to say this? Before I start breaking this issue down and eventually offering some constructive advice for the guys at some background on me might help. I’m Wayne. I’m a contract software developer. For those who don’t know what that is. I generally get called in for one of the following reasons:

·      To work on projects nearing a very tight deadline and the shit’s about to hit the fan.
·      To work on projects where the in-house developers have bought a project to it’s knees and both the staff need guidance and the project needs straightening
·      To work on projects the in-house developers either don’t have time or skills to carry out.

In all honesty it’s generally the third option. However I regularly give architectural advice and need to know a broad spectrum of technologies intimately. The projects I work on are in the £10,000,000 range.

I’m currently working on an e-commerce integration project for a large DVD/games website. Dealing with massive throughput and huge figures in terms of credit card transactions. We’re integrating with IBM WebSphere Commerce Server – the total bundle license for the software we’re using runs into the millions. This is important – I’ll tell you later on.

Back to

What’s unacceptable about their performance this weekend? And last year? And the year before that?  You can’t run a website in this day and age that fails so horrifically. While I was trying to use the site yesterday all I got was page not found errors. From 9:00am all the way throught o 13:30 – that’s 4.5 hours the site was completely screwed bar for the lucky few who did get in. No page apologizing for not being able to serve the payment page nothing. Users who did get in and managed to enter their card details (I suspect – this is based on previous years horrors) probably received a long wait and another 404 page not found error after click the submit payment button. Did they take the money? Was I just a victim of some horrific scam they’ll be asking themselves. Did it only take the money once (this happened to us once where the payment was taken twice) – will I be able to afford my food shopping this week if they did take it more than once?!

Perhaps the good guys at don’t have over £1,000,000 to spend on the latest and greatest commerce offering from IBM with safe enterprise messaging and clusters of DB servers to offer fault tolerance and redundancy.

You don’t need it – and that’s the joke.

Make the most of what the open source community offers. It’s what a lot of the big guys do; it’s what a lot of the little guys do. Blowing many millions on a software license is a preference – not a necessity.  Let’s have a look at what could do to improve their service – on a budget – that will instantly provide a better service. I’m sure they don’t need to operate on a budget – they must be making a mint – but let’s do this on a budget anyway.

There are a lot of issues involved here. It’s not a simple problem – but that’s not to say it’s hard. It’s not to say they’re the first people to go through growing pains – although most people outgrow the growing pains pretty quick or they sink. The first and main issue they face is the lack of ability to ramp their hardware capacity from a trickle of users to a burst of (probably) somewhere between 300,000 to 500,000 users.  Being able to provide the necessary redundancy to cope with this was traditionally very, very expensive – not any more. Hardware is cheap – very, very cheap.

But the problem isn’t a simple hardware issue. That will get you somewhere to easing the congestion (I remember reading about 5 years ago that the website was ran off of a single Windows 2000 machine!) – I hope that’s not the case anymore! If you imagine the machine as a baggage handling depot at Heathrow.  Imagine if the depot was just one guy and one machine responsible for handling everyone’s baggage.  Imagine he’s OK if there’s just one two seater plane landing at any one time and one two seater taking off. Simple. He can handle that. Now imagine if that, all of a sudden, turned into the Heathrow

Well firstly – he can’t process the thousands of mounting bags at his feet – not in a  million years.  What can we do to help our man out?

To start – he can’t be responsible for taking the bags, processing them and handing the luggage to departures. Why don’t we split that out? We have one area where new baggage arrives, one area where all the processing goes on and one where people can pick up the baggage.

That’s made it simpler but the person responsible for taking the baggage is now under the gun, thousands of people trying to get their luggage accepted all at once – it’s still not right. Well, we can just put more staff at the front desk – that way we can have 10 or 20 people accepting new baggage. How does this relate to an e-comms website? Think of the guys on the front desk as web servers. The guys processing the baggage as application servers. If a site comes under huge strain we can do a thing called load balancing where we split all the traffic and send it to many servers. Of course, this isn't the silver bullet fix for all of the issues, but again, this is going to get them some way to solving the issue - and I suspect this is where is now - or at least I hope. 

What about then, if we have too many people accepting the baggage and then we get a massive pile of un-handled baggage building up inside the depot - the same thing - we're stuffed. A pile of baggage is a pile of baggage. Whether it's on the front desk or behind the front desk. Well, luckilly we can pull the same trick - we can add mroe bagage processing machines and staff. 

Now we're talking, we have redundancy so that when things are slow, we have 50% of our baggage handling machine sittinge idle, but when things go potty, we can ramp up to handle the traffic. But baggage handling machines are still expensive to run and maintain - and having laods of machines sat around doing nothing is wasteful.

What can we do now? 

Well - how's about managing the capacity of throughput without havign to scale to unlimited potential. How can we do that? 

Let's think about the problem at hand. What actually happens? The customers give the baggage to the checkin desk, the checkin desk does some minimal processing and hands that baggage to the baggage handlers who do all the magic and process the baggage. The customer doesn't need to wait until the baggage has reached its destination before moving on. They just want to go away in the knowledge that when time comes to get their baggage - it will just be there. Where they expect it. They get a little bit of paper that says - we've taken your baggage and we're going to do what you'd expect with it - don't you worry about a thing - go grab a beer and enjoy your flight. There's an agreement about the level of service and the expected outcome can be based upon this. Your baggage is added to a queue on a conveyor belt and processed in time. 

Note: in software terms - the message queue isn't something an end user is even aware of. It simply (in the terms of a ticketing system) allows the system to take the info of how ever many people they have tickets for and then releases a page saying there are no more tickets. The users are oblivious to the queue. If the form which the user fills out is being hammered then the ticketing system should use a circuit breaker and release a static page with an explanation (described below).

In web terms this would mean that when you enter your credit card details - we don't need to take the payment right now, we just add your request to a queue and process it in time. That way if the application server that processes all of the payments gets bogged down, the front end that handles the website doesn't fail and die. In software this is called a message bus. Where as for my current project, the license for the IBM enterprise service bus cost in excess of £1,000,000 - there are (of course) very capable and very stable open source/free or at the very least very, very cheap offerings:
  • nService bus (free)
  • Rhino.ETL (free)
  • MSMQ (free)
  • AppFabric - part of windows azure cloud services (cheap)
  • Amazon ESB (cheap)
  • MassTransit (free)
What this does is effectivley allow us to queue up all of the requests in a way that releases the pressure from both the web servers and the application servers. By definition this type of design also implies another pattern called bulkhead (as in - if Titanic used bulkheads - we'd still be sailing the good ship Titanic). Where you accept that failure is a reality and you design to deal with that. By breaking up the system into components that - when they fail - they don't bring down the rest of the system. If the card processing component can't keep up with the requests, should the whole system fail? NO - of course not!

Next up we have a thing called self monitoring systems. Sounds simple - and pretty much is. You code the system to monitor itself by creating requests to itself - if the processing time of any of those requests begins to lag beyond an acceptable threshold you create a cutoff and provide a static HTML page offering an appology, explanation and alternative action.

There's a million and one things we could do to make more effective when trying to deal with the annual wrath of Glasto fans. I've detailed some here which would turn their site into a more capable system over night and at no real extra cost. The devs just nee to use their loafs, remember why they got into software in the first place and pull their fingers out.

Finally - last year they were running ASP3 (a technology any respecting company wouldn't use out of choice for the last 10 years - in tech terms that's about a thousand human years) - I doubt that would have changed this year. It is susseptable to SQL injection attacks, probably old buffer overrun attacks and god only knows what else. It's not efficient, secure, reliable, scalable, robust, well designed or any of the features of a modern framework. It's not responsibleof, in the context of data integrity, user experience and in the context of the public image of glastonbury for them to continue to use such crap and unsuitable technology - especially when you consider sorting out the mess is free and is only a case of effort on their behalf.

Note: ASP3 is good for beginners starting out programming (in my opinion it isn't academic enough and could teach some anti patterns and bad habits, aside from the point that the newer .NET itteration is just as easy to learn but I accept the point). It's simple and quick to turn around. That doesn't mean it should be used in any large scale or high throughput systems. My point is that it's much, much harder to protect against huge security flaws such as SQL injection attacks. It's much, much harder to test so any guarantees a company gives to its client about stability can't be upheld. Any company that deals with a company who uses ASP3 would not be able to offer any sort of high end guarantee regarding service level agreements. There simply is no reason to use it.

If on the other hand - can't introduce these measures because they are commited to using their archaic code simply because they have invested so much time and effort into it and refuse to let that go - then I feel sorry for them. The monopoly won't always be such an easy one to take. When the new kid comes on the block and proves to be more agile, capable and willing to embrace more modern, industry standrard, efficient and scalable technology - they will not be able to keep up.

This advise is free. It's free because you can google most of these issues and the results will be instantaniously presented for free. Any n00b developer can provide this advice for free and the means to implement it are free. The system I'm working on right now is tested to handle a minimum of 2000 requests per second - these are complex queries too. This is on a 2 server set up. In the test environment.

We're busy creating a new site for our users. Rest assured we wont be suffering the same issues suffer (should it become as popular as we hope!) - and we do it on zero budget.I'll blog about that in a few days - we're gonna need a bit of help from our followers in terms of ideas and testing but we're proper buzzed about it - we hope you are too. It seems to us that websites in the feestival/music events space are making the whole thing look painful and are just there to rinse the money - that needs to change.