Performance Matters

The very definition of software performance will vary depending on whom you ask. If you asked the end user he would immediately mention the “speed” of the application he has to work with. If you asked the CIO he would probably define performance as “throughput” measured in transactions per seconds. Finally if you asked an IT guy who has to deal with the hardware end of the system he would say that he needs scalability so that his duties are limited to provisioning more hardware when demand increases. All these elements: response time, throughput and scalability are desired components of software performance.

I have spent last 12 months working pretty much continuously on performance optimisation and James Saul asked me to share some of my findings with a wider audience. To start somewhere I went on to dig up some resources on wikipedia and came across an interesting article on performance engineering. According to the article one of the objectives of this discipline is to “Increase business revenue by ensuring the system can process transactions within the requisite timeframe”. In other words performance is money and there is probably no better example of how it is lost than total meltdown of the Debenham’s website which took place just before last Christmas. I have to admit here that I have no idea what went wrong at Debenhams but I can easily imagine a number of ways to build a software product which breaks under heavy load. As they say there is more than one way to skin a cat and build poor quality software but this time round I will focus primarily on the “process” issues, rather than particular technical aspects.

Small database syndrome (aka SDS)

Personally I think that SDS is the major contributor towards building poorly performing programmes: if the development team works against a tiny database, they are very likely to get in serious trouble further down the line and there are a number of reasons for it. The most obvious is the fact that there will be more data (surprise, surprise) so naturally more work will be required to get whatever you want out of the database. Secondly, query plans will be turned upside down in light of larger tables and distribution of data will influence it heavily as well. And last but not least when working against a small datasets it is impossible to spot any potential performance problems as everything will (or at least should) execute rather quickly.

The best example of spectacular “volume related” failure I have witnessed not so long ago is an application which when fired for the first time against fully populated database, executed 40 000 SQL queries during its start-up and the whole procedure took a better part of 40 minutes. To add insult to an injury, some of the tables involved in the queries were missing rather crucial indexes while others were indexed incorrectly (not that it matters a lot when you execute 40 000 queries to start one instance of the app). This potential fiasco made everyone involved in the project somewhat embarrassed and steps were taken to avoid such mishaps in the future. Luckily for the team this accident happened early enough in the project lifecycle and fixing it was relatively cheap and easy. But as you can hopefully see from this example SDS is a serious risk and I find it somewhat difficult to understand that people oftentimes try to find all possible excuses not to use properly sized database for development and/or testing. The one I hear most often is related to cost, measured in terms of either time or money; resources which someone has to spend to produce the data. But given the availability of data generation tools like the one provided by Redgate this is indeed a very poor explanation. It is even worse considering that cost of maintaining such a dataset is just a fraction of the total cost of the project.

“We will have better hardware in production”

This is another one of my favourites which I hear a lot when people testing an application realise that something is not quite right performance-wise. Accepting that the app is sluggish usually means that someone has to admit to a failure of sorts and nobody likes it. So people usually go into denial and try to find excuses not to tackle the problem now and then. If you consider that most of us developers work on single processor machines, it is not hard to see how people may fall into this trap, but even so basic calculations often prove that hoping to kill the problem with hardware may be nothing more than wishful thinking. Let me illustrate it with an example: lets consider a sample operation which takes 10 seconds on a modern single processor PC with plenty of RAM. It is easy to imagine that production hardware may be 20 times more powerful, leading to a false conclusion that in production the same process will take 1/20th of 10 seconds i.e. 500 ms. Job done. The failure in such reasoning is first of all an assumption that the production hardware will be serving one user at a time or that concurrent user load will have no influence on performance. Secondly the more powerful hardware may be indeed 20 times the capacity of the PC, but this capacity will be available only when you are able to parallelise the algorithm! If the original process is sequential (single-threaded) in nature, adding more processors to the server will not change response time at all. So the only conclusion we can draw from running software on inferior hardware is that if it works on a PC, there is a chance that it will work on a big server, provided of course that the software is scalable. On the other hand if it does not work well on your PC, the chances that it will ever work anywhere else under substantially heavier user load are close to zero.

“We have no [time|requirement|money|resources] for performance testing”

Some wise people say that if you have no time for testing, than you better have time for fixing last minute bugs and patching the app. The same is pretty much true when it comes to performance. When building systems which will potentially face high user load it is absolutely imperative that load and stress testing have to be executed unless you want to face similar fate as the website I have mentioned earlier. I may be biased here because I like to load test software, but load testing the app is probably the best way to make sure that it actually works. Let me give you an example here: about 18 months ago I participated in a POC at Microsoft’s working on a a website for an airline. Together with another guy from EMC we were responsible for doing the back end of the system: the database and WCF based app server. As we finished our job earlier than expected I decided that it would not be a bad idea to actually make use of available resources and take that thing for a spin a see what it can do. The app server was running on an 8-way 64 bit machine with ample amounts of RAM so I whacked some unit tests simulating users’ journeys through the website, plugged the whole lot to a VSTS load testing machine and pressed the green button. As soon as I pressed it we discovered that the whole thing grinds to a rather embarrassing halt within several seconds from being started… After a bit of head scratching we decided that it is a rather good idea to close server connections once you are done with them and repeated the test scoring a rather measly result of 100 method invocation per second. To cut the long story short during the next few days we have discovered that from performance point of view it is actually wiser to use ADO .NET rather than LINQ to SQL, that when building high performance systems it is better to have network cables which work at full capacity of the switch they are connected to, and that SQL Server 2008 rocks and it would take load from 3 app servers before it became fully saturated. In the meantime our load testing machine ran out of puff and we needed to use two more 4-way boxes in order to generate enough load to saturate the app server. The end result of this exercise was 18 fold (sic!) increase in system performance not to mention the fact that it was happy working for hours with no end. And when it came to presentation of the finished website everyone was raving about how quick the whole thing was. The moral of the story is however that things will inevitably break under heavy load. If you load test them before handing them to the users, chances are that they will get much more robust system and you as application developer will save yourself potentially huge embarrassment.

PS: I know that this post is barely technical, but I promise to improve next time round :)

August 8 2009
blog comments powered by Disqus