libx264 illustrated

As libx264 has so many presets and tunes, I was curious how they all related one to another when it comes to encode video info H.264. I was more interested in single pass encoding for live video, so the measurements are respectively for this mode of operation with encoder running in CRF (constant rate factor, X264_RC_CRF).

So I took Robotica_1080.wmv HD video in 1440×1080 resolution and batch-transcoded into H.264 using libx264 (build 128) in various modes operation:

  • Presets: “ultrafast”, “superfast”, “veryfast”, “faster”, “fast”, “medium”, “slow”, “slower”, “veryslow”
  • Tunes: “film”, “animation”, “grain”, “stillimage”, “psnr”, “ssim”, “fastdecode”, “zerolatency”, “touhou”
  • CRFs: 14, 17, 20, 23, 26

It is worth mentioning that libx264 does EXCELLENT job in transcoding in terms of performance. Transcoding operation was a DirectShow graph of the following topology:

Some measurements are obviously not quite accurate because not only encoding time counts, WMV decoding time counts also etc. Still this should give a good idea how modes stand side by side one with another.

For every transcoding run I have the following values (Excel spreadsheet attached below):

  • Processor Time: number of processor-milliseconds spent on the transcoding; I was measuring in 8 core system, so with 100% load processor time could be up to eight times higher than Elapsed Time (below) provided that all cores were used in full
  • Elapsed Time: milliseconds spent on the transcoding; regardless of how many actual cores were in use, because original clip is 20 seconds long everything below that is faster than realtime processing
  • Output File Size: size of resulting MP4 video only file, some headers count as well however it is obviously mostly payload data; for a 20 seconds clip, 20 MB is 8 mbit/s bitrate

Another derivative value is:

  • Processor Time/Elapsed Time: which shows fullness of use of multicore system; some modes are clearly not using all available cores, while other do

Let us start watching pictures.

Average Elapsed Time for Preset/Tune (covers runs with different rate factors) shows that slow+ modes take exponentially more time for encoding. psnr and ssim tunes do transcoding slightly faster, while zerolatency tune is the most expensive.

ultrafast and superfast presets produced significantly larger files, about 2x as large as other presets.

Once again exponential scale of Elapsed Time, and similar Processor Time chart:

It is worth mentioning that fastest presets are not using all CPU cores. Apart from being faster on their own, they leave some CPU time for other processing which can be useful for live encoding applications, and those processing multiple streams at once.

And finally detailed file size dependency from preset and CRF rate. As we already discovered, ultrafast and superfast produce larger stream, while output of other modes not so much differ (within a few percent, mostly on the slowest end). A step in rate factor of three gives about 0.7x decrease in amount of produces bytes.

More fun charts can be obtained from the attached .XLS file.

Download links:

Obtaining number of thread context switches programmatically

Previous post on thread synchronization and context switches used number of thread context switches as one of the performance indicators. One might have hard times getting the number from operating system though.

The only well documented access to amount of context switches seems to be accessing corresponding performance counters. Thread performance counter will list available thread instances and counters “Thread(<process-name>/<thread-number>)/Context Switches/sec” will provide context switch rate per second.

While access to performance counters is far not the most convenient API, to access data programmatically one would really prefer absolute number of switches rather than rate per second (which is still good for interactive monitoring).

A gate into kernel world to grab the data of interest is provided with NtQuerySystemInformation function. Although mentioned in documentation, it is marked as unreliable for use, and Windows SDK static library is missing it so one has to obtain it using GetModuleHandle/GetProcAddress it explicity.

typedef NTSTATUS (WINAPI *NTQUERYSYSTEMINFORMATION)(SYSTEM_INFORMATION_CLASS SystemInformationClass, PVOID SystemInformation, ULONG SystemInformationLength, PULONG ReturnLength);
NTQUERYSYSTEMINFORMATION NtQuerySystemInformation = (NTQUERYSYSTEMINFORMATION) GetProcAddress(GetModuleHandle(_T("ntdll.dll")), "NtQuerySystemInformation");
ATLVERIFY(NtQuerySystemInformation(...) == 0);

Having this done, the function is capable of providing SystemProcessInformation/SYSTEM_PROCESS_INFORMATION data about running processes.


Thread synchronization and context switches

A basic task in thread synchronization is putting something on one thread and getting it out on another thread for further processing. Two or more threads of execution are accessing certain data, and in order to keep data consistent and solid the access is split into atomic operations which are only allowed for one thread at a time. Before one thread completes its thing, another is not allowed to touch stuff, such as waiting for so called wait state. This is what synchronization objects and critical sections in particular for. Furthermore, a thread which is waiting for stuff to be available has nothing to do, so it uses one of the wait functions to not waste CPU time, and both threads are using event or similar objects to notify and receive notifications waking up from wait state.

Let us see what is the cost of doing things not quite right. Let  us take a send thread which is generating data/events which is locking shared resource and setting an event when something is done and requires another receive thread to wake up and take over. Send thread might be doing something like:

CComCritSecLock<CComAutoCriticalSection> DataLock(m_CriticalSection);

And receive thread will wait and take over like this:

CComCritSecLock<CComAutoCriticalSection> DataLock(m_CriticalSection);

Let us have three send threads and one receive thread running in parallel:

The simplicity is tempting and having run this the result over 60 seconds is:


ProcessSnapshot: Create process minidump for port-mortem debugging

ProcessSnapshot is a utility to take a snapshot of process call stacks, and the snapshot taken is written into a human friendly text file.

ProcessSnapshot is taking process minidump files

Additionally to this, the utility has been given a capability to create process minidump files, on user request. The minidump files can be used with debugger to analyze the context of the process using feature rich debug environment, esp. Microsoft Visual Studio. To create a minidump for a process, check a corresponding box and press “Take a Dump” button. A file named “<process-image-name> – <date> <time>.dmp” will be created in the directory of the utility executable.

See also:

A binary [Win32, x64] and partial Visual C++ .NET 2008 source code are available from SVN.

Sharing Memory Allocators while at the same time Handling Dynamic Media Type Changes

Sharing memory allocators between input and output pins is an important concept to keep performance of filter graph. Unlike more frequent scenario with different allocators, a filter (referred to as “middle filter” below) which has equal media types on input and output pins has an advantage to avoid memory-to-memory copy operation for every frame processed, by delivering downstream the buffer obtained from an upstream filter. With a high resolution video, at high rate, multiple streams running simultaneously this is the expense one would try to avoid for performance reasons.

Memory allocators are (or can be) shared by well known filters, such as Sample Grabber Filter, Infinite Tee Pin Filter and in-place transformation base filters (CTransInPlaceFilter Class).

Still handling Dynamic Format Changes (not only from video renderer filter) filters that share memory allocators may run into the problem of being notified of media type change. Because allocator are typically owned by another filter (e.g. Video Mixing Renderer Filter) and originally its buffer is queried by an upstream filter, the upstream filter obtains allocated buffer independently from the middle filter that shares memory allocators. If the upstream filter decides to never deliver this buffer, however the buffer has a media type attached (see AM_SAMPLE2_PROPERTIES::pMediaType), there is no way for the middle filter to learn about dynamic format change completed.

As a workaround for handling Format Changes from the Video Renderer, when resolution is not changed and it is only stride which might be extended, middle filter might be checking data size in lActual field and learn about the change from an increase in this value.

To be reliably notified on media type change the middle filter is to take extra measures while sharing the allocator. Instead using raw allocator obtained from one pin on another pin (typically output pin’s allocator to be used on an input pin), middle filter may be using an internal proxy object, which implements IMemAllocator interface and forward calls to internal IMemAllocator, obtained originally. Additionally to that, the proxy can check for attached media types on every buffer taken from the allocator, and once the change is noticed – at the moment upstream filter is requesting the buffer – the proxy has a timely chance to remember the new media type so that in the following IMemInputPin::Receive call this media type can be checked for the case upstream buffer decided to not deliver the buffer with attached media type.

    // ...
    ATLASSERT((InputMediaSampleProperties.pMediaType != NULL) ^ !(InputMediaSampleProperties.dwSampleFlags & AM_SAMPLE_TYPECHANGED));
        CRoCriticalSectionLock DataLock(GetDataCriticalSection());
        const CObjectPtr<CProxyMemAllocator>& pInputProxyMemAllocator = m_pInputPin->GetProxyMemAllocatorReference();
        CMediaType pMediaType;
        if(pInputProxyMemAllocator && pInputProxyMemAllocator->GetDynamicallyChangedMediaType(pMediaType, TRUE))
            // ...
        // ...
    DeliverMediaSample(pMemInputPin, pInputMediaSample);

ProcessSnapshot to take a snapshot of process modules, threads, stacks and performance

While troubleshooting released application on remote production site, it is very useful to grasp a state of the process for further analysis. There are several scenarios in which the following information about process state is helpful:

  • modules (DLLs) loaded into process and their versions
  • threads and their call stacks
  • process and thread performance

An utility ProcessSnapshot takes advantage of Debugging Tools API (dbghelp.dll – note the dialog also displays DLL version in the right bottom corner) and writes this helpful information to text file and it can also take a sequence of the snapshots to compare thread performance and/or stacks and check the difference.

The generated file is in the directory of the utility application and looks like:

  System Time: 10/14/2008 8:46:33 PM
  Local Time: 10/14/2008 11:46:33 PM

  Creation System Time: 10/14/2008 8:46:28 PM
  Kernel Time: 0.094 s
  User Time: 0.031 s


  Module: ProcessSnapshot.exe @00400000
    Base Address: 0x00400000
    Base Size: 0x0005b000 (372736)
    Name: ProcessSnapshot.exe
    Path: D:\Projects\Utilities\ProcessSnapshot\Release\ProcessSnapshot.exe
    Product Version:
    File Version:

  Module: ntdll.dll @7c900000
    Base Address: 0x7c900000
    Base Size: 0x000af000 (716800)
    Name: ntdll.dll
    Path: C:\WINDOWS\system32\ntdll.dll
    Product Version: 5.1.2600.5512
    File Version: 5.1.2600.5512


  Thread: 3824
    Base Priority: 8
    Creation System Time: 10/14/2008 8:46:57 PM
    Kernel Time: 0.063 s
    User Time: 0.031 s
    Call Stack
      ntdll!7c90e4f4 KiFastSystemCallRet (+ 0) @7c900000
      USER32!7e4249c4 GetCursorFrameInfo (+ 460) @7e410000
      USER32!7e424a06 DialogBoxIndirectParamAorW (+ 54) @7e410000
      USER32!7e4247ea DialogBoxParamW (+ 63) @7e410000
      ProcessSnapshot!00403f45 ATL::CDialogImpl<CMainDialog,ATL::CWindow>::DoModal (+ 67) [c:\program files\microsoft visual studio 9.0\vc\atlmfc\include\atlwin.h, 3478] (+ 28) @00400000
      ProcessSnapshot!00403b6f CProcessSnapshotModule::RunMessageLoop (+ 74) [d:\projects\utilities\processsnapshot\processsnapshot.cpp, 67] (+ 0) @00400000
      ProcessSnapshot!004049b9 ATL::CAtlExeModuleT<CProcessSnapshotModule>::Run (+ 17) [c:\program files\microsoft visual studio 9.0\vc\atlmfc\include\atlbase.h, 3552] (+ 0) @00400000
      ProcessSnapshot!004041c3 ATL::CAtlExeModuleT<CProcessSnapshotModule>::WinMain (+ 48) [c:\program files\microsoft visual studio 9.0\vc\atlmfc\include\atlbase.h, 3364] (+ 5) @00400000
      ProcessSnapshot!00434477 wWinMain (+ 5) [*d:\projects\utilities\processsnapshot\release\processsnapshot.inj:5, 14] (+ 0) @00400000
      ProcessSnapshot!00415058 __tmainCRTStartup (+ 274) [f:\dd\vctools\crt_bld\self_x86\crt\src\crt0.c, 263] (+ 27) @00400000



> How do you test performance?

I don’t. I just believe in it.

This is actually what we have here but still we have managed to deliver software that gives more frames per second than rivals. Why? We hopefully knew what we did in first place. According to one of our partner hardware vendors, there are only two software packages which could render multiple megapixel video feeds at the rates cameras can provide, the ours one and another one with the track leading to sources in Eastern Europe…