Tag Archives: synchronization

Booo SRW Locks

Windows Vista introduced new synchronization API: slim reader/writer (SRW) locks. Being already armed with critical sections, one perhaps would not desperately need an SRW lock, but still it offers a great option to provide both exclusive (critical section alike) mode and shared mode when 2+ threads can enter protected section simultaneously. Some time earlier, I already touched SRW lock recursion issue.

This time it is more about performance. Let us suppose we have a code fragment:

static const SIZE_T g_nCount = 100000000;
for(SIZE_T nIndex = 0; nIndex < g_nCount; nIndex++)
{
    AcquireSRWLockShared(&m_Lock);
    ReleaseSRWLockShared(&m_Lock);
}

How fast is this? Provided that execution took 1.6 ms on an idle lock variable, how fast it is going to be with another shared “reader” on another thread who acquired once access in shared mode? This part comes up confusing: it appears almost twice as slow: 2.9 ms, with also a number of contentions (and context switches) on the way.

SRW lock is advertised as lightweight and fast API, no recursion, no extra features, no fool proof checks. So it could be assumed to get you top performance, but 50 lines of code class can definitely outperform it.

A perhaps simplest SRW lock can be backed on a LONG (or LONGLONG) volatile variable accessed with interlocked API functions. The rules are simple:

  • initial value of zero means idle state
  • shared “reader” enters incrementing by one, positive values indicates one ore more readers
  • exclusive “writer” enters decrementing by a hundred (well, this needs to be any value big enough to be less than minus maximal amount of simultaneous concurrent readers)
  • releasing acquired state the value is decremented (incremented) back
  • in case of contention thread yields execution to continue later with possibly better luck with shared resource (option: implement spic count for several attempts in a row, which is better suitable for multi-processor systems)

It is clear that concurrent readers touch value only once with a single increment only, leaving no opportunity to yield execution due to contention. How fast it can be?

static const SIZE_T g_nCount = 100000000;
for(SIZE_T nIndex = 0; nIndex < g_nCount; nIndex++)
{
    while(InterlockedIncrement(&m_nNativeLock) < 0)
    {
        InterlockedDecrement(&m_nNativeLock);
        SwitchToThread();
    }
    InterlockedDecrement(&m_nNativeLock);
}

It is 1.3 ms regardless whether there are any shared readers on concurrent threads. It appears that a simple custom SRW lock class is going to be superior to the API:

  • faster due to zero API overhead (inline compiled code versus WINAPI convention functions)
  • faster in shared mode due to not being subject to additional overhead and contentions
  • flexible spin counts possible to tune performance for specific use
  • extensibility:
    • recursion is allowed shared mode
    • easy to implement upgrades and downgrades between shared and exclusive modes

Sample code is Visual Studio 2010 C++ project accessible from SVN repository.

Update: Critical Section Test. This is how SRW lock API performance compares to entry/leaving critical section (provided that critical section is never locked at entry time). Critical section is about 15% slower to enter and leave.

C:\Projects\...\SrwLockTest02\x64\Release>SrwLockTest02.exe
API: 100M iterations, 1575 ms
API: 100M iterations, 2839 ms
Native: 100M iterations, 1311 ms
Native: 100M iterations, 1326 ms
Critical Section: 100M iterations, 1841 ms

Thread synchronization and context switches

A basic task in thread synchronization is putting something on one thread and getting it out on another thread for further processing. Two or more threads of execution are accessing certain data, and in order to keep data consistent and solid the access is split into atomic operations which are only allowed for one thread at a time. Before one thread completes its thing, another is not allowed to touch stuff, such as waiting for so called wait state. This is what synchronization objects and critical sections in particular for. Furthermore, a thread which is waiting for stuff to be available has nothing to do, so it uses one of the wait functions to not waste CPU time, and both threads are using event or similar objects to notify and receive notifications waking up from wait state.

Let us see what is the cost of doing things not quite right. Let  us take a send thread which is generating data/events which is locking shared resource and setting an event when something is done and requires another receive thread to wake up and take over. Send thread might be doing something like:

CComCritSecLock<CComAutoCriticalSection> DataLock(m_CriticalSection);
m_nSendCount++;
ATLVERIFY(m_AvailabilityEvent.Set());

And receive thread will wait and take over like this:

CComCritSecLock<CComAutoCriticalSection> DataLock(m_CriticalSection);
m_nReceiveCount++;
ATLVERIFY(m_AvailabilityEvent.Reset());

Let us have three send threads and one receive thread running in parallel:

The simplicity is tempting and having run this the result over 60 seconds is:

Read more »

Recursive SRW Locks

Windows Vista added new synchronization API, Slim Reader/Writer (SRW) Locks, which is a powerful alternative to critical sections. The detailed description is, as always on MSDN, and what makes it really cool is simple:

  • unlike critical sections, SRW Locks provide reader and writer access synchronization making it possible for 2 and more reader to not block one another
  • SRW Locks do not reference any resources and have size of a pointer, which is the simplest possible scenario; as a result, they don’t need a destructor and their initialization is simple zeroing of memory/variable (for which you however should use InitializeSRWLock API

Being lightweight they are efficient. To understand how at all they can work, one can imagine that a reader might be trying to InterlockedIncrement a synchronization variable. If result is positive, then it’s OK to go. Otherwise, reader should decrement it back, wait and retry. A writer, instead, does InterlockedAdd with an argument of -0×1000 and checks that result of the operation is exactly -0×1000.

This post is about a trap one cat enter into by neglecting one of the SRW lock warnings:

… so SRW locks cannot be acquired recursively. In addition, a thread that owns an SRW lock in shared mode cannot upgrade its ownership of the lock to exclusive mode.

SRW locks cannot be acquired recursively, but it is very easy to make a mistake. If you attempt to recursively acquire, you are likely to succeed, without a warning, error code, exception or assertion failure. You pass this point and you can write quite some code before you realize something is wrong.

It can be as simple as this:

Read more »

XviD video encoder and window messages from a worker thread

While encoding video, Xvid Video Encoder provides an optional status window displaying information on encoding progress.

Xvid Encoder Status Window

Since streaming typically takes place in a worker thread, with possibly no windows at all, and no message pump, the window is to be created on GUI thread and the encoder needs to synchronize progress updates with the window.

Such synchronization is a typical point where a deadlock can occur. Carelessly a window message may be sent in a blocking way so that worker thread is waiting while the message is processed on the window thread. If this sending happens at the moment of GUI thread trying to stop worker thread, a deadlock takes place. This problem often comes up when message sending takes place indirectly, e.g. being wrapped by an API call, such as SetWindowText, or as it can be seen below IsDlgButtonChecked.

In order to avoid deadlocks one should never SendMessage from a worker thread. Instead there should be a PostMessage with possibly a wait function which waits for either synchronization event (set by window) or worker thread termination request:

{
  CLock Lock(m_CriticalSection);
  m_sText = sText;
  m_SynchronizationEvent.Reset();
}
PostMessage(...);
HANDLE phObjects[] = { ThreadTerminationEvent, m_SynchronizationEvent };
const DWORD nWaitResult =  WaitForMultipleObjects(..., phObjects, ...);
if(nWaitResult == WAIT_OBJECT_0 + 1 ) // m_SynchronizationEvent
{
  CLock Lock(m_CriticalSection);
  // ...
}

Back to Xvid encoder: trying to abort encoding process, the application deadlocks. Thread checking shows worker thread state:

Xvid Video Encoder Deadlocked Thread Call Stack

That is, a described deadlock in action.

While it is already made this way letting deadlock occur, is there any workaround to avoid locking? Yes, there is. Stopping encoding, GUI thread should keep pumping messages while waiting for worker thread termination. Before closing worker thread handle, one needs to signal the thread to terminate and wait using thread handle with MsgWaitForMultipleObjects, so that messages possibly sent while terminating the thread are dispatched to target windows.