How we used the chrome dev tools to optimise our html5 game

We wanted to experiment and see whether we could build a simple cross platform game using the latest web tech. The idea was that it would run across the mobile platforms and be easy to develop using our existing skills and tools. So for some background we are using cocoonjs to hardware accelerate canvas drawing and melonjs which is an OSS game engine which is easy to run on cocoon.

Our initial attempt was running very smoothly at 60fps on our powerful desktop browsers, however I was getting half that on my Galaxy Nexus. Given how simple the game was we were concerned and looked to find out why. We are developing using Chrome Canary which has the latest developer tools within.

CPU Profile

This was the first place we looked to see what was happening.

image

The trace is telling us we spend a majority of our time rendering rather than executing game logic. The sprites we are drawing are basic and we’ve made them as small as possible so this was the first surprise.

image

The flame chart puts into perspective how much idle time we have on a desktop machine and that each frame is speedily rendered in approximately 5ms. Initially that sounded good but given the lack of complexity in our graphics this performance is extremely disappointing and is a good indicator as to why the a mobile device might be struggling.

It wasn’t enough to know what to fix yet and for that we used the Canvas Debugger.

Canvas Debugger

This is an experimental chrome feature which means you will need to manually enable it. I used the learningthreejs blog which has a good video explanation but if you prefer something more textual you can follow the guide at html5rocks.

With the ability to inspect each canvas call both at the API level and visually we could track down where we were losing the performance. Below is a gif animation of cycling through each draw call shown in the debugger:

before

And with that visual representation it has become quite obvious where the extra draw calls are coming from, the background gradient is a 1px wide repeating image! Ironically we chose to do it this way for performance thinking loading a smaller image would be lighter on resources.

We were able to easily fix this in the map editor and it resulted in a big reduction in draw calls.

after

The gif animation also highlights the draw call for each tile which is imported from the map editor and this could be a further avenue to investigate if we want to target even lower performance devices.

image

Each frame now takes between 2 and 3ms to complete and more importantly the draw portion of that has been greatly reduced. It only takes 1ms to render on the desktop and the game code is now visible in the profile.

image

These changes were not only enough to run the game at 60fps on our mobiles, but has allowed us to increase the animation and visual fidelity while keeping the frame rate smooth. If you are working with canvasses either for game development or visualisations like d3 I recommend you grab the latest chrome tools and give them a go.

@naeemkhedarun

Getting started with Redis on Windows and a look at performance for logging

There’s a few scenarios in a few different projects where I need a fast dumping ground for data. In my deployments I need somewhere quick to pump logging messages to. These messages do not need to be written to disk or read out immediately so can be processed shortly after the fact out of process. I have the same requirements when rendering machine performance on a UI; I need a fast and non-intrusive way to collect the data from different machines and processes which can be read quickly later.

Redis is something which has been coming up frequently when talking linux and much has been said about its performance, so I wanted to take a quick look at performance in the context of logging. But first let’s get it up and running. I’ll be using PowerShell, chocolately and npm, so get those set up first.

Let’s begin by using chocolately to install redis.

cinst redis

The package should have installed a windows service running redis, let’s double check using powershell.

C:\> get-service *redis*

Status   Name               DisplayName
------   ----               -----------
Running  redis              Redis Server

Great, now time to get a UI to make using redis a little easier, this time the package is on npm.

C:\> npm install –g redis-commander;
C:\> redis-commander
No config found.
Using default configuration.
path.existsSync is now called `fs.existsSync`.
listening on  8081
Redis Connection 127.0.0.1:6379 Using Redis DB #0

The manager runs on port 8081 by default, let's take a look in a browser (use another shell as redis-commander will block your current one).

start http://localhost:8081

If you can see the UI then you are all set up and its time to push some data in and run some performance tests. Our build and deployment framework logs to the filesystem. This worked great for builds and simple deployments, but for scaled out environments and production which runs on many machines this is problematic. To avoid deadlocks and waits we manage a log file per machine and cat the tail into the main log when we can. We could aggregate these files post-deployment but its time to rethink how we do our logging. For now I just want to check raw speed and basic features of a few different technologies, today its redis.

We’ll use measure-command to find out how quickly we can log a bunch of messages to a file and then to redis. Using our framework to deploy to an environment for performance testing generates around 100K lines of logs across 123 log files and 27 machines. These get aggregated onto the deployment server for convenience but its still a lot of logs which are written during the deployment.

measure-command { 1..100000 | %{ Add-Content -Path C:\log1.txt -Value $_ } }
TotalSeconds : 319.9256509

This command is analgous to what we do in our PowerShell framework, since we do not keep a handle open on any one file we have to open it before writing every time. It takes 5 minutes to write that many lines of logs to the file system using Add-Content. This is a long time to spend logging and as we only write numbers in the benchmark involves less IO then deployment messages. We will benchmark redis for comparison and use the PowerRedis module.

We will make a single connection to redis as cmdlet takes care of the global state of the connection.

Connect-RedisServer

And the benchmark:

measure-command {1..100000 | %{ Add-RedisListItem -Name "deployment:log" -ListItem $_ } }
TotalSeconds      : 14.0009325

A fantastic improvement and the ability to log from multiple machines at the same time makes it a good option to consider. I can't help but wonder how fast we can log to a file if we did it more responsibly though.

C:\> $writer = New-Object System.IO.StreamWriter C:\log3.log
C:\> measure-command { 1..100000 | %{ $writer.WriteLine($_); } }
TotalSeconds      : 2.5816333
C:\> $writer.close()

So there are a few lessons to learn here. Avoid add-content if you’re writing more than a few messages. Do always measure your code so you do not waste minutes(!) logging to files. Redis is certainly fast enough for logging.

Performance Optimization: Rules of Engagement

On my way to work from London’s Waterloo Station one day I noticed a building on Southwark St which got me intrigued: “Kirkaldy’s Testing and Experimenting Works”. As it turns out there is a rather fascinating industrial history behind the building but it would not be worth mentioning here if it was not for a motto above the entrance: “Facts Not Opinions”. I can hardly imagine an area of software development where the motto would be more applicable than performance engineering. You see, far too many problems with performance come from the fact that we spend our time and resources “optimizing” code in areas which do not need optimization at all. We oftentimes do it because “everybody knows that you should do XYZ” or because we want to mitigate perceived performance risks by taking “proactive” action (aka premature optimization). If we were to follow the mantra of Mr Kirkaldy, we could avoid all of the above by doing just one thing: testing and measuring (and perhaps experimenting). So if you were to stop reading just now, please take this no 1 rule of performance optimization with you: measure first. Measuring is not only important when fixing code: it is also vital if you want to evaluate risk of potential design approach. So instead of doing “XYX because everybody knows we should”, whack a quick prototype and take it for a spin in a profiler

image

One of my favourite performance myths is that you should “always cache WCF service proxies because they are expensive to create” (and of course everybody knows that). As I have heard this technique specifically mentioned in context of ASP .NET web app running in IIS I could immediately hear alarm bells ringing for miles… The problems with sharing proxies between IIS sessions/threads are numerous but I will not bother you with the details here, my main doubt was if WCF proxy can be efficiently shared between multiple threads using it (executing methods on it) at the same time. So I created a simple WCF service with one method simulating 5 sec wait. I set the instance mode “per call” and then started calling the service from 5 threads on the client side using the same proxy shared between all of the them. I used a ManualResetEvent to start the threads simultaneously and expected them to finish 5 seconds later (give or take a millisecond or two). Guess what: they did not, as they blocked each other on some of the WCF internals and the whole process took 20 seconds instead of 5. So now imagine what would have happened if you used this approach on a busy website: your “clients” would effectively be queuing to get access to the WCF service and you would end up with potentially massive scalability issue. To make things worse creating WCF proxies is nowadays relatively cheap (provided that you know how to do it efficiently). The moral of the story is simple: when in doubt – measure. Do not apply performance “optimisations” blindly simply because everybody knows that you should….

As good and beneficial as performance “measuring” can be, when doing so you may often come across a phenomenon known in quantum physics as the paradox of Schrödinger's Cat. To put it simply by measuring you may (and most likely will) influence the value being measured. It is important to mention it here as profiling  a live system may become infeasible simply because it would slow it down to an unacceptable level. The level of performance degradation may vary from several percent (in case of tracing SQL being executed using SQL Profiler) to several hundred percent when using code profiler. Keep that in mind when testing your software as this once again illustrates that it is far better to do performance testing in development rather than fight problems in production when your ability to measure may be seriously hampered.

On of the funniest performance bugs I have ever come across was caused by a “tracing infrastructure” which strangely enough took extreme amount of time to do its job. As it turned out someone decided that it would be great to produce output in XML so that it can be processed later in a more structured way than a plain text. The only problem was that XmlSerializer used to create this output was created every time anyone tried to produce some trace output. In comparison with WCF proxies XmlSerializers are extremely expensive to create and this obviously had detrimental impact on application using tracing extensively. I find it rather amusing as tracing is one of the basic tools which can help you measure performance, as long of course as it does not influence it too much…:)

If there is one thing which is certain about software performance though, it is the fact that you can take pretty much any piece of code and make it run faster. For starters if you do not like managed code and overheads of JIT and garbage collection you can go unmanaged and rewrite the piece in say C/C++. You could take it further and perhaps go down to assembler. Still not fast enough? So how about assembler optimised for a particular processor making use of its unique features? Or maybe offload some of the workload to the GPU? I could go along these lines for quite a while but the truth is that every step you make along this route is exponentially more expensive and at a certain point you will make very little progress for a lot of someone’s money. So the next golden rule of performance optimisation is make it only as fast as it needs to be (keep it cheap). This rule sort of eliminates vague performance requirements along the lines “the site is slow” or “make the app faster please”.  In order to tackle any performance problem, the statement of it has to be more precise, more along the lines of “the process of submitting an order is taking 15 sec server-side and we want it to take no more than 3 seconds under average load of 250 orders/minute”. In other words you have to know exactly what the problem is and what is the acceptance criteria before you start any work.

I have to admit here that oftentimes I am tasked with "just sorting this out” when performance of a particular part of the application becomes simply unacceptable form user’s  perspective. Lack of clear performance expectations in such cases is perhaps understandable: it is quite difficult to expect the end user to state that “opening a document should take 1547ms per MB of content”. Other than this the acceptability will depend on how often the task has to be performed, how quickly the user needs it done etc. So sometimes you just have to take him/her through iterative process which stops when he says “yeah, that’s pretty good, I can live with that”.

So say that you have a clear problem statement, agreed expectations, you fire up a profiler and method X() comes up on top of the list consuming 90% of the time. It would be easy to assume that all we have to do now is to somehow optimise X() but surprisingly this would probably be… a mistake! The rule no 4 of code optimisation is to fully understand the call stack before you start optimising anything. Way too many times I have seen developers “jump into action” and try and optimise the code of a method which could be completely eliminated! Elimination is by far the cheapest option: deleting code does not cost much and you immediately improve performance by almost infinite number of percent (I’ll leave it to you to provide a proof for the latter statement:). It may seem as I am not being serious here but you would be surprised how many times I have seen an application execute a piece of code just to discard the results immediately.

And last but not least as developers we sometimes fall into a trap of gold plating: it is often tempting to fix issues you may spot here and there while profiling but the first and foremost question you should be asking is what will be the benefit of it? A method may seem to be inefficient (by the looks of the code), say sequential search which could be replaced with a more efficient dictionary-type lookup, but if profiler indicates that the code is responsible for 1% of overall execution time, my advice is simple: do not bother. I have fallen into this trap in the past and before you know it you end up with “small” changes in 50 different source files and suddenly none of the unit tests seem to work. So the last rule is: go for maximum results with minimum changes, even if it means that you have to leave behind some ugly code which you would love to fix. Once your bottleneck has been eliminated, sure as hell another one will pop its ugly head so keep tackling them one by one until you reach acceptable results. And when you reach a situation when making one thing faster slows something else, as it often happens in database optimization, it means that you are “herding the cats” as we call it on my project and you probably have to apply major refactoring exercise.

My current project has a dialog box with a tree view which used to take several seconds to open. On closer investigation we realised that the problem lies in how child elements of each tree node are retrieved: the algorithm used sequential search through a list of all elements stored in memory along the lines of var myChildren = allElements.Where( element => element.ParentID == this.ID).ToList(). As the dialog used WPF with hierarchical data template, each element in the list had to perform sequential search for its children which gives not so nice o-n-squared type of algorithm. The performance was bad with ~1000 of elements but when the number of elements increased overnight to 4000, resulting 16 fold increase in execution time was unacceptable. You may think that the solution would be to rework the algorithm and this indeed was considered for a while. But in line with “measure” , “keep it cheap”  and “make it as only fast as it needs to be” rules the fix proved to be very simple. As it turned out the major problem was not the algorithm as such but the fact that ParentID property was expensive to evaluate, and even more so if it had to be invoked 16 000 000 times. The final solution was a new 3 lines of code long method IsChildOf(int parentID) which reduced the execution time by a factor of 60. Now that is what I call a result: 6000% improvement for 3 lines of code.

August 17 2009

Performance Matters

The very definition of software performance will vary depending on whom you ask. If you asked the end user he would immediately mention the “speed” of the application he has to work with. If you asked the CIO he would probably define performance as “throughput” measured in transactions per seconds. Finally if you asked an IT guy who has to deal with the hardware end of the system he would say that he needs scalability so that his duties are limited to provisioning more hardware when demand increases. All these elements: response time, throughput and scalability are desired components of software performance.

I have spent last 12 months working pretty much continuously on performance optimisation and James Saul asked me to share some of my findings with a wider audience. To start somewhere I went on to dig up some resources on wikipedia and came across an interesting article on performance engineering. According to the article one of the objectives of this discipline is to “Increase business revenue by ensuring the system can process transactions within the requisite timeframe”. In other words performance is money and there is probably no better example of how it is lost than total meltdown of the Debenham’s website which took place just before last Christmas. I have to admit here that I have no idea what went wrong at Debenhams but I can easily imagine a number of ways to build a software product which breaks under heavy load. As they say there is more than one way to skin a cat and build poor quality software but this time round I will focus primarily on the “process” issues, rather than particular technical aspects.

Small database syndrome (aka SDS)

Personally I think that SDS is the major contributor towards building poorly performing programmes: if the development team works against a tiny database, they are very likely to get in serious trouble further down the line and there are a number of reasons for it. The most obvious is the fact that there will be more data (surprise, surprise) so naturally more work will be required to get whatever you want out of the database. Secondly, query plans will be turned upside down in light of larger tables and distribution of data will influence it heavily as well. And last but not least when working against a small datasets it is impossible to spot any potential performance problems as everything will (or at least should) execute rather quickly.

The best example of spectacular “volume related” failure I have witnessed not so long ago is an application which when fired for the first time against fully populated database, executed 40 000 SQL queries during its start-up and the whole procedure took a better part of 40 minutes. To add insult to an injury, some of the tables involved in the queries were missing rather crucial indexes while others were indexed incorrectly (not that it matters a lot when you execute 40 000 queries to start one instance of the app). This potential fiasco made everyone involved in the project somewhat embarrassed and steps were taken to avoid such mishaps in the future. Luckily for the team this accident happened early enough in the project lifecycle and fixing it was relatively cheap and easy. But as you can hopefully see from this example SDS is a serious risk and I find it somewhat difficult to understand that people oftentimes try to find all possible excuses not to use properly sized database for development and/or testing. The one I hear most often is related to cost, measured in terms of either time or money; resources which someone has to spend to produce the data. But given the availability of data generation tools like the one provided by Redgate this is indeed a very poor explanation. It is even worse considering that cost of maintaining such a dataset is just a fraction of the total cost of the project.

“We will have better hardware in production”

This is another one of my favourites which I hear a lot when people testing an application realise that something is not quite right performance-wise. Accepting that the app is sluggish usually means that someone has to admit to a failure of sorts and nobody likes it. So people usually go into denial and try to find excuses not to tackle the problem now and then. If you consider that most of us developers work on single processor machines, it is not hard to see how people may fall into this trap, but even so basic calculations often prove that hoping to kill the problem with hardware may be nothing more than wishful thinking. Let me illustrate it with an example: lets consider a sample operation which takes 10 seconds on a modern single processor PC with plenty of RAM. It is easy to imagine that production hardware may be 20 times more powerful, leading to a false conclusion that in production the same process will take 1/20th of 10 seconds i.e. 500 ms. Job done. The failure in such reasoning is first of all an assumption that the production hardware will be serving one user at a time or that concurrent user load will have no influence on performance. Secondly the more powerful hardware may be indeed 20 times the capacity of the PC, but this capacity will be available only when you are able to parallelise the algorithm! If the original process is sequential (single-threaded) in nature, adding more processors to the server will not change response time at all. So the only conclusion we can draw from running software on inferior hardware is that if it works on a PC, there is a chance that it will work on a big server, provided of course that the software is scalable. On the other hand if it does not work well on your PC, the chances that it will ever work anywhere else under substantially heavier user load are close to zero.

“We have no [time|requirement|money|resources] for performance testing”

Some wise people say that if you have no time for testing, than you better have time for fixing last minute bugs and patching the app. The same is pretty much true when it comes to performance. When building systems which will potentially face high user load it is absolutely imperative that load and stress testing have to be executed unless you want to face similar fate as the website I have mentioned earlier. I may be biased here because I like to load test software, but load testing the app is probably the best way to make sure that it actually works. Let me give you an example here: about 18 months ago I participated in a POC at Microsoft’s working on a a website for an airline. Together with another guy from EMC we were responsible for doing the back end of the system: the database and WCF based app server. As we finished our job earlier than expected I decided that it would not be a bad idea to actually make use of available resources and take that thing for a spin a see what it can do. The app server was running on an 8-way 64 bit machine with ample amounts of RAM so I whacked some unit tests simulating users’ journeys through the website, plugged the whole lot to a VSTS load testing machine and pressed the green button. As soon as I pressed it we discovered that the whole thing grinds to a rather embarrassing halt within several seconds from being started… After a bit of head scratching we decided that it is a rather good idea to close server connections once you are done with them and repeated the test scoring a rather measly result of 100 method invocation per second. To cut the long story short during the next few days we have discovered that from performance point of view it is actually wiser to use ADO .NET rather than LINQ to SQL, that when building high performance systems it is better to have network cables which work at full capacity of the switch they are connected to, and that SQL Server 2008 rocks and it would take load from 3 app servers before it became fully saturated. In the meantime our load testing machine ran out of puff and we needed to use two more 4-way boxes in order to generate enough load to saturate the app server. The end result of this exercise was 18 fold (sic!) increase in system performance not to mention the fact that it was happy working for hours with no end. And when it came to presentation of the finished website everyone was raving about how quick the whole thing was. The moral of the story is however that things will inevitably break under heavy load. If you load test them before handing them to the users, chances are that they will get much more robust system and you as application developer will save yourself potentially huge embarrassment.

PS: I know that this post is barely technical, but I promise to improve next time round :)

August 8 2009
Older Posts