Booo SRW Locks – Fooling Around

Windows Vista introduced new synchronization API: slim reader/writer (SRW) locks. Being already armed with critical sections, one perhaps would not desperately need an SRW lock, but still it offers a great option to provide both exclusive (critical section alike) mode and shared mode when 2+ threads can enter protected section simultaneously. Some time earlier, I already touched SRW lock recursion issue.

This time it is more about performance. Let us suppose we have a code fragment:

static const SIZE_T g_nCount = 100000000;
for(SIZE_T nIndex = 0; nIndex < g_nCount; nIndex++)
{
    AcquireSRWLockShared(&m_Lock);
    ReleaseSRWLockShared(&m_Lock);
}

How fast is this? Provided that execution took 1.6 ms on an idle lock variable, how fast it is going to be with another shared “reader” on another thread who acquired once access in shared mode? This part comes up confusing: it appears almost twice as slow: 2.9 ms, with also a number of contentions (and context switches) on the way.

SRW lock is advertised as lightweight and fast API, no recursion, no extra features, no fool proof checks. So it could be assumed to get you top performance, but 50 lines of code class can definitely outperform it.

A perhaps simplest SRW lock can be backed on a LONG (or LONGLONG) volatile variable accessed with interlocked API functions. The rules are simple:

initial value of zero means idle state
shared “reader” enters incrementing by one, positive values indicates one ore more readers
exclusive “writer” enters decrementing by a hundred (well, this needs to be any value big enough to be less than minus maximal amount of simultaneous concurrent readers)
releasing acquired state the value is decremented (incremented) back
in case of contention thread yields execution to continue later with possibly better luck with shared resource (option: implement spic count for several attempts in a row, which is better suitable for multi-processor systems)

It is clear that concurrent readers touch value only once with a single increment only, leaving no opportunity to yield execution due to contention. How fast it can be?

static const SIZE_T g_nCount = 100000000;
for(SIZE_T nIndex = 0; nIndex < g_nCount; nIndex++)
{
    while(InterlockedIncrement(&m_nNativeLock) < 0)
    {
        InterlockedDecrement(&m_nNativeLock);
        SwitchToThread();
    }
    InterlockedDecrement(&m_nNativeLock);
}

It is 1.3 ms regardless whether there are any shared readers on concurrent threads. It appears that a simple custom SRW lock class is going to be superior to the API:

faster due to zero API overhead (inline compiled code versus WINAPI convention functions)
faster in shared mode due to not being subject to additional overhead and contentions
flexible spin counts possible to tune performance for specific use
extensibility:
- recursion is allowed shared mode
- easy to implement upgrades and downgrades between shared and exclusive modes

Sample code is Visual Studio 2010 C++ project accessible from SVN repository.

Update: Critical Section Test. This is how SRW lock API performance compares to entry/leaving critical section (provided that critical section is never locked at entry time). Critical section is about 15% slower to enter and leave.

C:\Projects\...\SrwLockTest02\x64\Release>SrwLockTest02.exe
API: 100M iterations, 1575 ms
API: 100M iterations, 2839 ms
Native: 100M iterations, 1311 ms
Native: 100M iterations, 1326 ms
Critical Section: 100M iterations, 1841 ms

Leave a Reply Cancel reply