Perverting Agile

Long, long time ago I bought myself a book about taekwondo: to my disappointment, the book clearly stated that unfortunately it is not possible to teach yourself a martial art just by reading a book. It is probably true for any sort of human activity that it helps to practice it with others, and better yet with people who actually know what they are doing. Agile is no different in this respect and it seems to me that way too many people picked agile from a book or the internet in the process somehow missing the small print. Here are some examples from the last couple of weeks…

An hour long stand-up

As the name implies, daily morning team meeting should be done while standing, and the reason to do it this way is to prevent the meeting to take too long. If there are issues which need to be discussed, they are raised during stand-up and the follow up happens later. It is also important that only “pigs” have a right of to say anything during stand-up. Chickens, management etc may by all means participate but they should remain silent. Stand-up is not a management  report meeting!

Marriage programming

Pair programming is an extremely valuable tool in agile toolkit, but it has to be used when and as required. And when it is used, the idea is that pairs change over the course of the project, day or whatever. Putting people in pairs which never change completely misses the idea of cross-pollination of skills! So if you do want your team to do pair programming, do not force your developers into contractual marriages.

Roll-call retrospective

The purpose of the retrospective is to gather feedback from the team as to what went well and what needs improvement. If the retrospective changes into a roll-call type meeting where everyone says what he thinks in predetermined order, I doubt it will inspire anyone. Retrospectives have to be a bit chaotic and cause discussion (or event better an argument) to be productive in any sense. Teams paralyzed by fear and disciplined like a small army have little chance of coming up with any genuinely interesting insight.

“We don’t do design ‘cause we’re agile”

Complete nonsense if you ask me. If you can get away with a bit of scribbling on the back of an envelope or whiteboard, by all means do so. But just because simple problems can be tackled this way it does not mean that when you come across problem/requirement substantially more complex, you should not use more formal design approaches.  Inspect and adapt: agile is among other things about using the right tools at the right times. Back of an envelope is good “tool” but sadly not one suitable for all classes of problems.

March 16 2010

WCF Bindings, Security and Bandwidth Utilisation

Good things come to those who measure! I have been recently asked to investigate ways to reduce bandwidth utilisation of a WCF service and thought that it would not be a bad idea to share the results. The test was a relatively simple method call (search operation) performed on the client which would return ~500 objects from the server side. I tested the app using various WCF bindings and the table below contains the sizes of the payloads for the bindings I have tested. (the results have been obtained using Wireshark):

Binding

Payload size (bytes)

Relative size

WS HTTP with message security

2747082

100%

NET-TCP with message security

2068024

75%

NET-TCP with transport level security and encryption

393205

14%

NET-TCP with transport level security and no encryption

392948

14%

Your mileage will obviously vary, this example however illustrates that you can get substantial reduction of payload sizes by changing a couple of configuration files. This is what I call a result! :)

November 13 2009

AOP, Performance Counters and The Big Red Button

A friend of mine who was once involved in a test role at play.com recently mentioned to me a very interesting story. In case you did not know them, play.com sell a lot of stuff over the internet. At the time my friend worked there, play.com had in their dev centre a big button displayed on an even bigger screen indicating the number of orders they were getting every second. As play.com never sleeps, they knew that when the number of orders/sec dropped below a certain threshold and the button went red, alarms had to be raised, SOS emails had to be sent out and the fat controller had to be woken up as something was clearly going wrong. (OK, I made up the fat controller but it’s a different story).

I have no idea how the whole system was implemented but it clearly illustrates a very good use of software instrumentation. If techies at play.com they were as clever as I think they were, the simplest method to get the “button” to work is to use bog standard performance counters as they may serve at least several purposes:

  • As illustrated by creative folk at play.com, performance counters may act as an early warning system. Excessive values may raise automatic alerts through monitoring tools and this is usually highly appreciated by the "ops" people responsible for day to day operation of the software.
  • Performance counters allow us and our clients to trace execution times, throughputs etc which may be the only way to verify whether a third party provides appropriate level of service (SLA). This comes especially handy when your app relies heavily on services provided by others.
  • As all things related to instrumentation performance counters are often immensely helpful in diagnosing software problems when something goes, or is about to go wrong
  • Performance counters give us true picture of behaviour of our software: how many logins, saves, loads does the application perform? How does the workload differ from the requirements and/or expectations?

Knowing that most of us find instrumentation as exciting as having the root canal done, I came up with a plan to use PostSharp AOP to ease some, if not all of this pain. If you are interested in publishing performance data, read on as you may find that the entire process can be made very simple and easy with a couple of AOP tricks.

Anatomy of performance counters

Before we start publishing any performance data we need to "register" our performance counter(s) in the operating system: this can be done in a number of ways but personally I prefer to do it programmatically (if possible). The following snippet illustrates how to do it:

Code Snippet
  1. var counterCreationDataCollection = new CounterCreationDataCollection();
  2. var valueCounter = new CounterCreationData("Counter name", "Help text if any?",
  3.                                            PerformanceCounterType.NumberOfItems32);
  4. counterCreationDataCollection.Add(valueCounter);
  5. PerformanceCounterCategory.Create("Name of category",
  6.                                   "Sample performance counter category.",
  7.                                   PerformanceCounterCategoryType.SingleInstance,
  8.                                   counterCreationDataCollection);

The problem with programmatic approach is security: in order to execute the code, we not only need to have admin privileges but on some operating systems (Vista and Windows 7) the program has to be “Run as admin” which may be a bit tricky. To get around the pesky UAC issues you may need to modify application’s manifest file which is described very nicely here. Other than using programmatic approach, performance counters can be installed by a setup program with a bit of help from custom installer class derived from PerformanceCounterInstaller. Unfortunately once created, PerformanceCounterCategory hosting our counters cannot be modified so if new counters were to be added, the entire category has to be recreated. For this very reason during development I usually have a piece of code which reinstalls all required performance counters on application start-up as over the lifecycle of the project both numbers and types of performance counters are very likely to change. When the application reaches it's final stages of development the task of creating performance counters can be delegated to a setup program.

Usage

Once we have our counters in place we have a couple of ways to interact with them but the choice is relatively simple: we can either use thread safe Increment/Decrement/IncrementBy methods or directly manipulate counter's RawValue as the following sample illustrates. According to MSDN using thread safe approach is about 5 times slower than directly accessing RawValue, but you have to be sensible about it: if you are instrumenting a particularly “hot” piece of code, direct access may indeed be an option but I would not go this route unless profiler indicates that it may provide some tangible benefits (measure first!).

Code Snippet
  1. var counter = new PerformanceCounter("Category", "Counter name", false);
  2. // We can increment the value in a thread safe maner...
  3. counter.Increment();
  4. // ... or set the value directly
  5. counter.RawValue = 123;

Once you start digging into System.Diagnostics documentation you will surely discover that there are 28 different types of performance counters. The list available on MSDN may seem very daunting at first but from the publishing perspective counters really fall into two categories:

  • Performance counters representing "number of items". Examples may include "total number of logins", "sales transactions processed" etc. When publishing this sort of data we usually have to simply capture the number of invocations of a particular operation and use PerformanceCounter.Increment() method. If the number may go down as well as up we can use either Decrement() method or directly set the RawValue property.
  • Performance counters related to time of execution. The time has to be expressed in terms of "timer ticks" and we have to measure how long the operation took and then simply call PerformanceCounter.IncrementBy(executionTime)

Now you may ask how on Earth is it possible to get 28 types of performance counters given that we only set either the number or the time of execution? The answer is simple: majority of the counters use some sort of a magic formula involving one or two counters to come up with a final value. The best example is performance counter of type RateOfItems32. Every time performance monitor samples it's value, it checks how much the value has changed since the last sample was collected and then divides the difference by the time elapsed since the last sample. This gives nice Rate/sec and all we have to do to publish it is to simply increment a counter every time an operation takes place.
Slightly more complex scenario utilises values of two different performance counters; imagine that we want to expose "cache hit ratio" from a caching component. In order to  achieve this we have to create two counters of types RawFraction and RawBase respectively. Whenever we have a "cache hit" we increment RawFraction and every operation (whether it is a hit or a miss) increments RawBase. This way performance monitor will calculate hit ratio as RawFraction/RawBase and we're done.
More interesting things happen when we start dealing with times; we can for example expose a counter which represents number of threads executing given method at any time by simply exposing timer of type CounterTimer. Here is how it is done: imagine that an operation takes 200ms and performance monitors checks the value of the counter every second. If between two samples we have executed the operation 20 times each time incrementing the value of the counter by 200ms, at the end of our sampling interval the total accumulated value will increase by 20 executions times 200 ms = 4000ms. Given this increase, performance monitor will divide the delta by sampling interval which tells us that, on average, during our sampling interval we were executing 4000ms/1000ms = 4 operations simultaneously. Genius! Full list of supported counter types is obviously available on MSDN so I won’t be repeating it here, the only thing you have to pay special attention to is whether you are required to increment counter value by number of executions (or other items) or the time of the execution.

Performance of performance counters

If you read my previous posts you may remember the paradox of Schrödinger's cat. The paradox states that by measuring a value we may affect it. This is very true with performance counters as there is indeed overhead associated with using them. Having said that results captured by my favourite profiler indicate that unless we are doing hundreds of threads executing a method thousands of times per second there is nothing to worry about. This of course does not mean that that every single method of every single class has to be instrumented! As with any tool it makes sense only when it is used wisely: monitoring critical operations, 3rd party services etc.

Putting it all together

Knowing that we can cover 90% of performance counter scenarios by capturing either execution time or number of executions, we can wrap it all up in two simple PostSharp aspects. If you are new to PostSharp I would strongly suggest that you have a look at the website, but the idea is to simply decorate a method with a couple of attributes and let the PostSharp interception mechanism do the magic as the following code illustrates:

Code Snippet
  1. [CountingPerformanceCounter(PerformanceCounters.CATEGORY, PerformanceCounters.NUMBER_OF_CALLS)]
  2. [CountingPerformanceCounter(PerformanceCounters.CATEGORY, PerformanceCounters.CALL_RATE)]
  3. [CountingPerformanceCounter(PerformanceCounters.CATEGORY, PerformanceCounters.EXECUTION_TIME_BASE)]
  4. [TimingPerformanceCounter(PerformanceCounters.CATEGORY, PerformanceCounters.ACTIVE_THREADS)]
  5. [TimingPerformanceCounter(PerformanceCounters.CATEGORY, PerformanceCounters.EXECUTION_TIME)]
  6. private void InstrumentedMethod()
  7. {
  8.     Thread.Sleep(_pause);
  9. }

As you can imagine there is no rocket science behind the aspects themselves. Here is the code of the TimingPerformanceCounter which is slightly more involved then the other one:

Code Snippet
  1. using System;
  2. using System.Diagnostics;
  3. using PostSharp.Laos;
  4. namespace EMC.PostSharp.PerformanceCounters
  5. {
  6.     /// <summary>
  7.     /// Implements aspect publishing method's execution time through performance counters
  8.     /// </summary>
  9.     [Serializable]
  10.     public class TimingPerformanceCounterAttribute : PerformanceCounterAttributeBase
  11.     {
  12.         /// <summary>
  13.         /// Initializes a new instance of the <see cref="CountingPerformanceCounterAttribute"/> class.
  14.         /// </summary>
  15.         /// <param name="categoryName">Name of the category.</param>
  16.         /// <param name="performanceCounterName">Name of the performance counter.</param>
  17.         public TimingPerformanceCounterAttribute(string categoryName, string performanceCounterName)
  18.             : base(categoryName, performanceCounterName)
  19.         {
  20.         }
  21.         /// <summary>
  22.         /// Method executed <b>before</b> the body of methods to which this aspect is applied.
  23.         /// </summary>
  24.         /// <param name="eventArgs">Event arguments specifying which method
  25.         /// is being executed, which are its arguments, and how should the execution continue
  26.         /// after the execution of <see cref="M:PostSharp.Laos.IOnMethodBoundaryAspect.OnEntry(PostSharp.Laos.MethodExecutionEventArgs)"/>.</param>
  27.         /// <remarks>
  28.         /// If the aspect is applied to a constructor, the current method is invoked
  29.         /// after the <b>this</b> pointer has been initialized, that is, after
  30.         /// the base constructor has been called.
  31.         /// </remarks>
  32.         public override void OnEntry(MethodExecutionEventArgs eventArgs)
  33.         {
  34.             // Record the start time here
  35.             eventArgs.MethodExecutionTag = Stopwatch.GetTimestamp();
  36.             base.OnEntry(eventArgs);
  37.         }
  38.         /// <summary>
  39.         /// Method executed <b>after</b> the body of methods to which this aspect is applied,
  40.         /// even when the method exists with an exception (this method is invoked from
  41.         /// the <b>finally</b> block).
  42.         /// </summary>
  43.         /// <param name="eventArgs">Event arguments specifying which method
  44.         /// is being executed and which are its arguments.</param>
  45.         public override void OnExit(MethodExecutionEventArgs eventArgs)
  46.         {
  47.             base.OnExit(eventArgs);
  48.             long start = (long) eventArgs.MethodExecutionTag;
  49.             // Increment the counter by the value of time elapsed
  50.             PerformanceCounter.IncrementBy(Stopwatch.GetTimestamp() - start);
  51.         }
  52.     }
  53. }

In case you want to take the whole lot for a spin, the file attached to this post contains all the necessary ingredients including some unit tests. To build the app you will need the latest version of PostSharp installed on your machine. The sample app uses multiple threads and throttling to expose some typical scenario of performance counter utilisation (Calls per/sec, Active threads, Total calls and Average execution time).

image

October 12 2009

Performance Optimization: Rules of Engagement

On my way to work from London’s Waterloo Station one day I noticed a building on Southwark St which got me intrigued: “Kirkaldy’s Testing and Experimenting Works”. As it turns out there is a rather fascinating industrial history behind the building but it would not be worth mentioning here if it was not for a motto above the entrance: “Facts Not Opinions”. I can hardly imagine an area of software development where the motto would be more applicable than performance engineering. You see, far too many problems with performance come from the fact that we spend our time and resources “optimizing” code in areas which do not need optimization at all. We oftentimes do it because “everybody knows that you should do XYZ” or because we want to mitigate perceived performance risks by taking “proactive” action (aka premature optimization). If we were to follow the mantra of Mr Kirkaldy, we could avoid all of the above by doing just one thing: testing and measuring (and perhaps experimenting). So if you were to stop reading just now, please take this no 1 rule of performance optimization with you: measure first. Measuring is not only important when fixing code: it is also vital if you want to evaluate risk of potential design approach. So instead of doing “XYX because everybody knows we should”, whack a quick prototype and take it for a spin in a profiler

image

One of my favourite performance myths is that you should “always cache WCF service proxies because they are expensive to create” (and of course everybody knows that). As I have heard this technique specifically mentioned in context of ASP .NET web app running in IIS I could immediately hear alarm bells ringing for miles… The problems with sharing proxies between IIS sessions/threads are numerous but I will not bother you with the details here, my main doubt was if WCF proxy can be efficiently shared between multiple threads using it (executing methods on it) at the same time. So I created a simple WCF service with one method simulating 5 sec wait. I set the instance mode “per call” and then started calling the service from 5 threads on the client side using the same proxy shared between all of the them. I used a ManualResetEvent to start the threads simultaneously and expected them to finish 5 seconds later (give or take a millisecond or two). Guess what: they did not, as they blocked each other on some of the WCF internals and the whole process took 20 seconds instead of 5. So now imagine what would have happened if you used this approach on a busy website: your “clients” would effectively be queuing to get access to the WCF service and you would end up with potentially massive scalability issue. To make things worse creating WCF proxies is nowadays relatively cheap (provided that you know how to do it efficiently). The moral of the story is simple: when in doubt – measure. Do not apply performance “optimisations” blindly simply because everybody knows that you should….

As good and beneficial as performance “measuring” can be, when doing so you may often come across a phenomenon known in quantum physics as the paradox of Schrödinger's Cat. To put it simply by measuring you may (and most likely will) influence the value being measured. It is important to mention it here as profiling  a live system may become infeasible simply because it would slow it down to an unacceptable level. The level of performance degradation may vary from several percent (in case of tracing SQL being executed using SQL Profiler) to several hundred percent when using code profiler. Keep that in mind when testing your software as this once again illustrates that it is far better to do performance testing in development rather than fight problems in production when your ability to measure may be seriously hampered.

On of the funniest performance bugs I have ever come across was caused by a “tracing infrastructure” which strangely enough took extreme amount of time to do its job. As it turned out someone decided that it would be great to produce output in XML so that it can be processed later in a more structured way than a plain text. The only problem was that XmlSerializer used to create this output was created every time anyone tried to produce some trace output. In comparison with WCF proxies XmlSerializers are extremely expensive to create and this obviously had detrimental impact on application using tracing extensively. I find it rather amusing as tracing is one of the basic tools which can help you measure performance, as long of course as it does not influence it too much…:)

If there is one thing which is certain about software performance though, it is the fact that you can take pretty much any piece of code and make it run faster. For starters if you do not like managed code and overheads of JIT and garbage collection you can go unmanaged and rewrite the piece in say C/C++. You could take it further and perhaps go down to assembler. Still not fast enough? So how about assembler optimised for a particular processor making use of its unique features? Or maybe offload some of the workload to the GPU? I could go along these lines for quite a while but the truth is that every step you make along this route is exponentially more expensive and at a certain point you will make very little progress for a lot of someone’s money. So the next golden rule of performance optimisation is make it only as fast as it needs to be (keep it cheap). This rule sort of eliminates vague performance requirements along the lines “the site is slow” or “make the app faster please”.  In order to tackle any performance problem, the statement of it has to be more precise, more along the lines of “the process of submitting an order is taking 15 sec server-side and we want it to take no more than 3 seconds under average load of 250 orders/minute”. In other words you have to know exactly what the problem is and what is the acceptance criteria before you start any work.

I have to admit here that oftentimes I am tasked with "just sorting this out” when performance of a particular part of the application becomes simply unacceptable form user’s  perspective. Lack of clear performance expectations in such cases is perhaps understandable: it is quite difficult to expect the end user to state that “opening a document should take 1547ms per MB of content”. Other than this the acceptability will depend on how often the task has to be performed, how quickly the user needs it done etc. So sometimes you just have to take him/her through iterative process which stops when he says “yeah, that’s pretty good, I can live with that”.

So say that you have a clear problem statement, agreed expectations, you fire up a profiler and method X() comes up on top of the list consuming 90% of the time. It would be easy to assume that all we have to do now is to somehow optimise X() but surprisingly this would probably be… a mistake! The rule no 4 of code optimisation is to fully understand the call stack before you start optimising anything. Way too many times I have seen developers “jump into action” and try and optimise the code of a method which could be completely eliminated! Elimination is by far the cheapest option: deleting code does not cost much and you immediately improve performance by almost infinite number of percent (I’ll leave it to you to provide a proof for the latter statement:). It may seem as I am not being serious here but you would be surprised how many times I have seen an application execute a piece of code just to discard the results immediately.

And last but not least as developers we sometimes fall into a trap of gold plating: it is often tempting to fix issues you may spot here and there while profiling but the first and foremost question you should be asking is what will be the benefit of it? A method may seem to be inefficient (by the looks of the code), say sequential search which could be replaced with a more efficient dictionary-type lookup, but if profiler indicates that the code is responsible for 1% of overall execution time, my advice is simple: do not bother. I have fallen into this trap in the past and before you know it you end up with “small” changes in 50 different source files and suddenly none of the unit tests seem to work. So the last rule is: go for maximum results with minimum changes, even if it means that you have to leave behind some ugly code which you would love to fix. Once your bottleneck has been eliminated, sure as hell another one will pop its ugly head so keep tackling them one by one until you reach acceptable results. And when you reach a situation when making one thing faster slows something else, as it often happens in database optimization, it means that you are “herding the cats” as we call it on my project and you probably have to apply major refactoring exercise.

My current project has a dialog box with a tree view which used to take several seconds to open. On closer investigation we realised that the problem lies in how child elements of each tree node are retrieved: the algorithm used sequential search through a list of all elements stored in memory along the lines of var myChildren = allElements.Where( element => element.ParentID == this.ID).ToList(). As the dialog used WPF with hierarchical data template, each element in the list had to perform sequential search for its children which gives not so nice o-n-squared type of algorithm. The performance was bad with ~1000 of elements but when the number of elements increased overnight to 4000, resulting 16 fold increase in execution time was unacceptable. You may think that the solution would be to rework the algorithm and this indeed was considered for a while. But in line with “measure” , “keep it cheap”  and “make it as only fast as it needs to be” rules the fix proved to be very simple. As it turned out the major problem was not the algorithm as such but the fact that ParentID property was expensive to evaluate, and even more so if it had to be invoked 16 000 000 times. The final solution was a new 3 lines of code long method IsChildOf(int parentID) which reduced the execution time by a factor of 60. Now that is what I call a result: 6000% improvement for 3 lines of code.

August 17 2009
Newer Posts Older Posts