AOP, Performance Counters and The Big Red Button

A friend of mine who was once involved in a test role at play.com recently mentioned to me a very interesting story. In case you did not know them, play.com sell a lot of stuff over the internet. At the time my friend worked there, play.com had in their dev centre a big button displayed on an even bigger screen indicating the number of orders they were getting every second. As play.com never sleeps, they knew that when the number of orders/sec dropped below a certain threshold and the button went red, alarms had to be raised, SOS emails had to be sent out and the fat controller had to be woken up as something was clearly going wrong. (OK, I made up the fat controller but it’s a different story).

I have no idea how the whole system was implemented but it clearly illustrates a very good use of software instrumentation. If techies at play.com they were as clever as I think they were, the simplest method to get the “button” to work is to use bog standard performance counters as they may serve at least several purposes:

  • As illustrated by creative folk at play.com, performance counters may act as an early warning system. Excessive values may raise automatic alerts through monitoring tools and this is usually highly appreciated by the "ops" people responsible for day to day operation of the software.
  • Performance counters allow us and our clients to trace execution times, throughputs etc which may be the only way to verify whether a third party provides appropriate level of service (SLA). This comes especially handy when your app relies heavily on services provided by others.
  • As all things related to instrumentation performance counters are often immensely helpful in diagnosing software problems when something goes, or is about to go wrong
  • Performance counters give us true picture of behaviour of our software: how many logins, saves, loads does the application perform? How does the workload differ from the requirements and/or expectations?

Knowing that most of us find instrumentation as exciting as having the root canal done, I came up with a plan to use PostSharp AOP to ease some, if not all of this pain. If you are interested in publishing performance data, read on as you may find that the entire process can be made very simple and easy with a couple of AOP tricks.

Anatomy of performance counters

Before we start publishing any performance data we need to "register" our performance counter(s) in the operating system: this can be done in a number of ways but personally I prefer to do it programmatically (if possible). The following snippet illustrates how to do it:

Code Snippet
  1. var counterCreationDataCollection = new CounterCreationDataCollection();
  2. var valueCounter = new CounterCreationData("Counter name", "Help text if any?",
  3.                                            PerformanceCounterType.NumberOfItems32);
  4. counterCreationDataCollection.Add(valueCounter);
  5. PerformanceCounterCategory.Create("Name of category",
  6.                                   "Sample performance counter category.",
  7.                                   PerformanceCounterCategoryType.SingleInstance,
  8.                                   counterCreationDataCollection);

The problem with programmatic approach is security: in order to execute the code, we not only need to have admin privileges but on some operating systems (Vista and Windows 7) the program has to be “Run as admin” which may be a bit tricky. To get around the pesky UAC issues you may need to modify application’s manifest file which is described very nicely here. Other than using programmatic approach, performance counters can be installed by a setup program with a bit of help from custom installer class derived from PerformanceCounterInstaller. Unfortunately once created, PerformanceCounterCategory hosting our counters cannot be modified so if new counters were to be added, the entire category has to be recreated. For this very reason during development I usually have a piece of code which reinstalls all required performance counters on application start-up as over the lifecycle of the project both numbers and types of performance counters are very likely to change. When the application reaches it's final stages of development the task of creating performance counters can be delegated to a setup program.

Usage

Once we have our counters in place we have a couple of ways to interact with them but the choice is relatively simple: we can either use thread safe Increment/Decrement/IncrementBy methods or directly manipulate counter's RawValue as the following sample illustrates. According to MSDN using thread safe approach is about 5 times slower than directly accessing RawValue, but you have to be sensible about it: if you are instrumenting a particularly “hot” piece of code, direct access may indeed be an option but I would not go this route unless profiler indicates that it may provide some tangible benefits (measure first!).

Code Snippet
  1. var counter = new PerformanceCounter("Category", "Counter name", false);
  2. // We can increment the value in a thread safe maner...
  3. counter.Increment();
  4. // ... or set the value directly
  5. counter.RawValue = 123;

Once you start digging into System.Diagnostics documentation you will surely discover that there are 28 different types of performance counters. The list available on MSDN may seem very daunting at first but from the publishing perspective counters really fall into two categories:

  • Performance counters representing "number of items". Examples may include "total number of logins", "sales transactions processed" etc. When publishing this sort of data we usually have to simply capture the number of invocations of a particular operation and use PerformanceCounter.Increment() method. If the number may go down as well as up we can use either Decrement() method or directly set the RawValue property.
  • Performance counters related to time of execution. The time has to be expressed in terms of "timer ticks" and we have to measure how long the operation took and then simply call PerformanceCounter.IncrementBy(executionTime)

Now you may ask how on Earth is it possible to get 28 types of performance counters given that we only set either the number or the time of execution? The answer is simple: majority of the counters use some sort of a magic formula involving one or two counters to come up with a final value. The best example is performance counter of type RateOfItems32. Every time performance monitor samples it's value, it checks how much the value has changed since the last sample was collected and then divides the difference by the time elapsed since the last sample. This gives nice Rate/sec and all we have to do to publish it is to simply increment a counter every time an operation takes place.
Slightly more complex scenario utilises values of two different performance counters; imagine that we want to expose "cache hit ratio" from a caching component. In order to  achieve this we have to create two counters of types RawFraction and RawBase respectively. Whenever we have a "cache hit" we increment RawFraction and every operation (whether it is a hit or a miss) increments RawBase. This way performance monitor will calculate hit ratio as RawFraction/RawBase and we're done.
More interesting things happen when we start dealing with times; we can for example expose a counter which represents number of threads executing given method at any time by simply exposing timer of type CounterTimer. Here is how it is done: imagine that an operation takes 200ms and performance monitors checks the value of the counter every second. If between two samples we have executed the operation 20 times each time incrementing the value of the counter by 200ms, at the end of our sampling interval the total accumulated value will increase by 20 executions times 200 ms = 4000ms. Given this increase, performance monitor will divide the delta by sampling interval which tells us that, on average, during our sampling interval we were executing 4000ms/1000ms = 4 operations simultaneously. Genius! Full list of supported counter types is obviously available on MSDN so I won’t be repeating it here, the only thing you have to pay special attention to is whether you are required to increment counter value by number of executions (or other items) or the time of the execution.

Performance of performance counters

If you read my previous posts you may remember the paradox of Schrödinger's cat. The paradox states that by measuring a value we may affect it. This is very true with performance counters as there is indeed overhead associated with using them. Having said that results captured by my favourite profiler indicate that unless we are doing hundreds of threads executing a method thousands of times per second there is nothing to worry about. This of course does not mean that that every single method of every single class has to be instrumented! As with any tool it makes sense only when it is used wisely: monitoring critical operations, 3rd party services etc.

Putting it all together

Knowing that we can cover 90% of performance counter scenarios by capturing either execution time or number of executions, we can wrap it all up in two simple PostSharp aspects. If you are new to PostSharp I would strongly suggest that you have a look at the website, but the idea is to simply decorate a method with a couple of attributes and let the PostSharp interception mechanism do the magic as the following code illustrates:

Code Snippet
  1. [CountingPerformanceCounter(PerformanceCounters.CATEGORY, PerformanceCounters.NUMBER_OF_CALLS)]
  2. [CountingPerformanceCounter(PerformanceCounters.CATEGORY, PerformanceCounters.CALL_RATE)]
  3. [CountingPerformanceCounter(PerformanceCounters.CATEGORY, PerformanceCounters.EXECUTION_TIME_BASE)]
  4. [TimingPerformanceCounter(PerformanceCounters.CATEGORY, PerformanceCounters.ACTIVE_THREADS)]
  5. [TimingPerformanceCounter(PerformanceCounters.CATEGORY, PerformanceCounters.EXECUTION_TIME)]
  6. private void InstrumentedMethod()
  7. {
  8.     Thread.Sleep(_pause);
  9. }

As you can imagine there is no rocket science behind the aspects themselves. Here is the code of the TimingPerformanceCounter which is slightly more involved then the other one:

Code Snippet
  1. using System;
  2. using System.Diagnostics;
  3. using PostSharp.Laos;
  4. namespace EMC.PostSharp.PerformanceCounters
  5. {
  6.     /// <summary>
  7.     /// Implements aspect publishing method's execution time through performance counters
  8.     /// </summary>
  9.     [Serializable]
  10.     public class TimingPerformanceCounterAttribute : PerformanceCounterAttributeBase
  11.     {
  12.         /// <summary>
  13.         /// Initializes a new instance of the <see cref="CountingPerformanceCounterAttribute"/> class.
  14.         /// </summary>
  15.         /// <param name="categoryName">Name of the category.</param>
  16.         /// <param name="performanceCounterName">Name of the performance counter.</param>
  17.         public TimingPerformanceCounterAttribute(string categoryName, string performanceCounterName)
  18.             : base(categoryName, performanceCounterName)
  19.         {
  20.         }
  21.         /// <summary>
  22.         /// Method executed <b>before</b> the body of methods to which this aspect is applied.
  23.         /// </summary>
  24.         /// <param name="eventArgs">Event arguments specifying which method
  25.         /// is being executed, which are its arguments, and how should the execution continue
  26.         /// after the execution of <see cref="M:PostSharp.Laos.IOnMethodBoundaryAspect.OnEntry(PostSharp.Laos.MethodExecutionEventArgs)"/>.</param>
  27.         /// <remarks>
  28.         /// If the aspect is applied to a constructor, the current method is invoked
  29.         /// after the <b>this</b> pointer has been initialized, that is, after
  30.         /// the base constructor has been called.
  31.         /// </remarks>
  32.         public override void OnEntry(MethodExecutionEventArgs eventArgs)
  33.         {
  34.             // Record the start time here
  35.             eventArgs.MethodExecutionTag = Stopwatch.GetTimestamp();
  36.             base.OnEntry(eventArgs);
  37.         }
  38.         /// <summary>
  39.         /// Method executed <b>after</b> the body of methods to which this aspect is applied,
  40.         /// even when the method exists with an exception (this method is invoked from
  41.         /// the <b>finally</b> block).
  42.         /// </summary>
  43.         /// <param name="eventArgs">Event arguments specifying which method
  44.         /// is being executed and which are its arguments.</param>
  45.         public override void OnExit(MethodExecutionEventArgs eventArgs)
  46.         {
  47.             base.OnExit(eventArgs);
  48.             long start = (long) eventArgs.MethodExecutionTag;
  49.             // Increment the counter by the value of time elapsed
  50.             PerformanceCounter.IncrementBy(Stopwatch.GetTimestamp() - start);
  51.         }
  52.     }
  53. }

In case you want to take the whole lot for a spin, the file attached to this post contains all the necessary ingredients including some unit tests. To build the app you will need the latest version of PostSharp installed on your machine. The sample app uses multiple threads and throttling to expose some typical scenario of performance counter utilisation (Calls per/sec, Active threads, Total calls and Average execution time).

image

October 12 2009

Performance Optimization: Rules of Engagement

On my way to work from London’s Waterloo Station one day I noticed a building on Southwark St which got me intrigued: “Kirkaldy’s Testing and Experimenting Works”. As it turns out there is a rather fascinating industrial history behind the building but it would not be worth mentioning here if it was not for a motto above the entrance: “Facts Not Opinions”. I can hardly imagine an area of software development where the motto would be more applicable than performance engineering. You see, far too many problems with performance come from the fact that we spend our time and resources “optimizing” code in areas which do not need optimization at all. We oftentimes do it because “everybody knows that you should do XYZ” or because we want to mitigate perceived performance risks by taking “proactive” action (aka premature optimization). If we were to follow the mantra of Mr Kirkaldy, we could avoid all of the above by doing just one thing: testing and measuring (and perhaps experimenting). So if you were to stop reading just now, please take this no 1 rule of performance optimization with you: measure first. Measuring is not only important when fixing code: it is also vital if you want to evaluate risk of potential design approach. So instead of doing “XYX because everybody knows we should”, whack a quick prototype and take it for a spin in a profiler

image

One of my favourite performance myths is that you should “always cache WCF service proxies because they are expensive to create” (and of course everybody knows that). As I have heard this technique specifically mentioned in context of ASP .NET web app running in IIS I could immediately hear alarm bells ringing for miles… The problems with sharing proxies between IIS sessions/threads are numerous but I will not bother you with the details here, my main doubt was if WCF proxy can be efficiently shared between multiple threads using it (executing methods on it) at the same time. So I created a simple WCF service with one method simulating 5 sec wait. I set the instance mode “per call” and then started calling the service from 5 threads on the client side using the same proxy shared between all of the them. I used a ManualResetEvent to start the threads simultaneously and expected them to finish 5 seconds later (give or take a millisecond or two). Guess what: they did not, as they blocked each other on some of the WCF internals and the whole process took 20 seconds instead of 5. So now imagine what would have happened if you used this approach on a busy website: your “clients” would effectively be queuing to get access to the WCF service and you would end up with potentially massive scalability issue. To make things worse creating WCF proxies is nowadays relatively cheap (provided that you know how to do it efficiently). The moral of the story is simple: when in doubt – measure. Do not apply performance “optimisations” blindly simply because everybody knows that you should….

As good and beneficial as performance “measuring” can be, when doing so you may often come across a phenomenon known in quantum physics as the paradox of Schrödinger's Cat. To put it simply by measuring you may (and most likely will) influence the value being measured. It is important to mention it here as profiling  a live system may become infeasible simply because it would slow it down to an unacceptable level. The level of performance degradation may vary from several percent (in case of tracing SQL being executed using SQL Profiler) to several hundred percent when using code profiler. Keep that in mind when testing your software as this once again illustrates that it is far better to do performance testing in development rather than fight problems in production when your ability to measure may be seriously hampered.

On of the funniest performance bugs I have ever come across was caused by a “tracing infrastructure” which strangely enough took extreme amount of time to do its job. As it turned out someone decided that it would be great to produce output in XML so that it can be processed later in a more structured way than a plain text. The only problem was that XmlSerializer used to create this output was created every time anyone tried to produce some trace output. In comparison with WCF proxies XmlSerializers are extremely expensive to create and this obviously had detrimental impact on application using tracing extensively. I find it rather amusing as tracing is one of the basic tools which can help you measure performance, as long of course as it does not influence it too much…:)

If there is one thing which is certain about software performance though, it is the fact that you can take pretty much any piece of code and make it run faster. For starters if you do not like managed code and overheads of JIT and garbage collection you can go unmanaged and rewrite the piece in say C/C++. You could take it further and perhaps go down to assembler. Still not fast enough? So how about assembler optimised for a particular processor making use of its unique features? Or maybe offload some of the workload to the GPU? I could go along these lines for quite a while but the truth is that every step you make along this route is exponentially more expensive and at a certain point you will make very little progress for a lot of someone’s money. So the next golden rule of performance optimisation is make it only as fast as it needs to be (keep it cheap). This rule sort of eliminates vague performance requirements along the lines “the site is slow” or “make the app faster please”.  In order to tackle any performance problem, the statement of it has to be more precise, more along the lines of “the process of submitting an order is taking 15 sec server-side and we want it to take no more than 3 seconds under average load of 250 orders/minute”. In other words you have to know exactly what the problem is and what is the acceptance criteria before you start any work.

I have to admit here that oftentimes I am tasked with "just sorting this out” when performance of a particular part of the application becomes simply unacceptable form user’s  perspective. Lack of clear performance expectations in such cases is perhaps understandable: it is quite difficult to expect the end user to state that “opening a document should take 1547ms per MB of content”. Other than this the acceptability will depend on how often the task has to be performed, how quickly the user needs it done etc. So sometimes you just have to take him/her through iterative process which stops when he says “yeah, that’s pretty good, I can live with that”.

So say that you have a clear problem statement, agreed expectations, you fire up a profiler and method X() comes up on top of the list consuming 90% of the time. It would be easy to assume that all we have to do now is to somehow optimise X() but surprisingly this would probably be… a mistake! The rule no 4 of code optimisation is to fully understand the call stack before you start optimising anything. Way too many times I have seen developers “jump into action” and try and optimise the code of a method which could be completely eliminated! Elimination is by far the cheapest option: deleting code does not cost much and you immediately improve performance by almost infinite number of percent (I’ll leave it to you to provide a proof for the latter statement:). It may seem as I am not being serious here but you would be surprised how many times I have seen an application execute a piece of code just to discard the results immediately.

And last but not least as developers we sometimes fall into a trap of gold plating: it is often tempting to fix issues you may spot here and there while profiling but the first and foremost question you should be asking is what will be the benefit of it? A method may seem to be inefficient (by the looks of the code), say sequential search which could be replaced with a more efficient dictionary-type lookup, but if profiler indicates that the code is responsible for 1% of overall execution time, my advice is simple: do not bother. I have fallen into this trap in the past and before you know it you end up with “small” changes in 50 different source files and suddenly none of the unit tests seem to work. So the last rule is: go for maximum results with minimum changes, even if it means that you have to leave behind some ugly code which you would love to fix. Once your bottleneck has been eliminated, sure as hell another one will pop its ugly head so keep tackling them one by one until you reach acceptable results. And when you reach a situation when making one thing faster slows something else, as it often happens in database optimization, it means that you are “herding the cats” as we call it on my project and you probably have to apply major refactoring exercise.

My current project has a dialog box with a tree view which used to take several seconds to open. On closer investigation we realised that the problem lies in how child elements of each tree node are retrieved: the algorithm used sequential search through a list of all elements stored in memory along the lines of var myChildren = allElements.Where( element => element.ParentID == this.ID).ToList(). As the dialog used WPF with hierarchical data template, each element in the list had to perform sequential search for its children which gives not so nice o-n-squared type of algorithm. The performance was bad with ~1000 of elements but when the number of elements increased overnight to 4000, resulting 16 fold increase in execution time was unacceptable. You may think that the solution would be to rework the algorithm and this indeed was considered for a while. But in line with “measure” , “keep it cheap”  and “make it as only fast as it needs to be” rules the fix proved to be very simple. As it turned out the major problem was not the algorithm as such but the fact that ParentID property was expensive to evaluate, and even more so if it had to be invoked 16 000 000 times. The final solution was a new 3 lines of code long method IsChildOf(int parentID) which reduced the execution time by a factor of 60. Now that is what I call a result: 6000% improvement for 3 lines of code.

August 17 2009

Integration Testing WCF Services with TransactionScope

Integration testing WCF services can be right pain and I experienced it firsthand on my current project. If you imagine a hypothetical person repository service exposing 4 CRUD operations, it would make sense to test all four of them. If the operations were to be tested in random order, it would be perfectly feasible that testing update after a delete may fail if the object to be updated has been deleted by a previous test. The same obviously applies to reads following updates etc. In other words tests are dependent on each other and this dependency is really evil for a number of reasons: first of all it usually means that you cannot run tests in isolation as they may depend on modifications made by other tests. Secondly, one failing test may cause a number of failures in the tests which follow and thirdly, by the end of the test run underlying database is in a right mess as it contains modified data, forcing you to redeploy it should you wish to re-run the tests. Considering the fact that running integration tests is usually time consuming exercise, this vicious circle of deploy/run/fix becomes extremely expensive as the project goes on.

TransactionScope to the rescue

Fortunately for us WCF supports distributed transactions and if there is one place where they make perfect sense it is in the integration testing. Imagine a test class written along the following lines:

image

The idea behind it is that whenever a test starts, a new transaction gets initiated. When the test completes, regardless of its outcome, the changes are rolled back leaving the underlying database in pristine condition. This means that we can break dependency between tests, run them in any order and rerun the whole lot without the need for redeploying the database. Holy grail of integration testing :) To make it work however, the service needs to support distributed transactions (which is usually a not a bad idea anyhow). Having said that  have to be aware of various and potentially serious gotchas which I will cover later.

To make a service "transaction aware" following changes have to be made (I assume default, out of the box WCF project): first of all the service has to expose an endpoint which uses a binding which in turn supports distributed transactions (e.g. WsHttpBinding) and the binding has to be configured to allow transaction flow. This configuration has to be applied on both client (unit test project) and the server side:

image

Secondly, all operations which are supposed to participate in transaction have to be marked as such in the service contract:

image

The TransactionFlowOptions enumeration includes NotAllowed, Allowed and Required flags which I hope are self explanatory. Using Allowed flag is usually the safest bet as the operation will allow callers to call the service with or without transaction scope. Making the service transaction aware as illustrated above is usually enough to make this whole idea work.

The third change which is optional but I highly recommend it, is to decorate all methods which accept inbound transactions with [OperationBehaviorAttribute(TransactionScopeRequired=true, TransactionAutoComplete = true)]. By doing so we state that regardless of the client "flowing" transaction or not, the method will execute within transaction scope. If the scope is not provided by the client, WCF will simply create one for us which means that code remains identical regardless of the client side transaction being provided. The TransactionAutoComplete option means that unless the method throws an exception, the transaction will commit. This also means that we do not have to worry about making calls to BeginTransaction/Commit/Rolback anymore. The default for TransactionAutoComplete is true so strictly speaking it is not necessary to set it but I did it here for illustration purposes.

image

The attached sample solution contains a working example of person repository and may be useful to get you started.

The small print

Important feature of WCF is the default isolation level for distributed transactions which is Serializable. This means that more often than not, your service is likely to suffer badly from scalability problems should the isolation level remain set to the default value. Luckily for us WCF allows us to adjust it; the service implementation has to simply specify required level using ServiceBehaviorAttribute. Unless you know exactly what you are doing I would strongly recommend setting the isolation level to ReadCommitted. This is the default isolation level in most SQL Server implementations and it also gives you some interesting options.

image

Having done this the caller has to explicitly specify its required isolation level as well when constructing transaction scope.

image

An interesting "feature" of using transaction scope, in testing in particular, is the fact that your test may deadlock on itself if not all operations being executed within the transaction scope participate in it. The main reason for which this may happen is lack of TransactionFlowAttribute decorating the operation in service contract. In the test below if the GetPerson operation was not supporting transactions, yet the DeletePerson was, then an attempt to read the value deleted by another transaction would cause a deadlock. Feel free to modify the code and try it for yourself. 

image 

Distributed transactions will require MSDTC running on all machines participating in the transaction i.e. the client, the WCF server and the database server. This is usually the first stumbling block as MSDTC may be disabled or may be configured in a way which prevents it from accepting distributed transactions. To configure MSDTC you will have to use "Administrative Tools\Component services" applet from the control panel. MSDTC configuration is hidden in the context menu of "My Computer\Properties". Once you activate this option you will have to navigate to MSDTC tab and make sure that security settings allow "Network access" as well as  "Inbound/Outbound transactions".

image

Performance

One issue which people usually raise with regards to distributed transactions is performance: these concerns are absolutely valid and have to be given some serious consideration. The first problem is the fact that if the service has to involve transaction managers (MSDTC) in order to get the job done it usually means some overhead. Luckily, the transaction initiated in TransactionScope does not always need to use MSDTC. Microsoft provides Local Transaction Manager which will be used by default as long as the transaction meets some specific criteria: transactions involving just one database will remain local incurring almost no overhead (~1% according to my load tests). As soon as your transaction involves other resources (databases or WCF services) it will be promoted to distributed and will get a performance hit (in my test case it is 25% decrease in performance but your mileage may vary). To check if a method executes within local or distributed transaction you may inspect Transaction.Current.TransactionInfo.DistributedIdentifier: value equal to Guid.Empty means that transaction is local. The second issue affecting performance is the fact that transactions will usually take longer to commit/rollback meaning that database locks will be held for longer. In case of WCF services the commit will happen when the results have been serialized back to the client which can introduce serious scalability issues due to locking. This problem can be usually alleviated by using ReadCommitted isolation level and row versioning in the database.

Parting shots

The project I am currently working on contains some 2500 integration tests, 600 of which test our WCF repository. In order to make sure that every test obeys the same rules with regard to transactions we have a unit test in place which inspects all test classes in the project and makes sure all of them derive from the common base class which is responsible for setting up and cleaning the transaction. I would strongly recommend to follow this approach in any non trivial project as otherwise you may end up with some misbehaving tests breaking the whole concept.

Happy testing!

November 16 2008
Newer Posts Older Posts