Published by Roman on 16 Jul 2009

Sharing Memory Allocators while at the same time Handling Dynamic Media Type Changes

Sharing memory allocators between input and output pins is an important concept to keep performance of filter graph. Unlike more frequent scenario with different allocators, a filter (referred to as “middle filter” below) which has equal media types on input and output pins has an advantage to avoid memory-to-memory copy operation for every frame processed, by delivering downstream the buffer obtained from an upstream filter. With a high resolution video, at high rate, multiple streams running simultaneously this is the expense one would try to avoid for performance reasons.

Memory allocators are (or can be) shared by well known filters, such as Sample Grabber Filter, Infinite Tee Pin Filter and in-place transformation base filters (CTransInPlaceFilter Class).

Still handling Dynamic Format Changes (not only from video renderer filter) filters that share memory allocators may run into the problem of being notified of media type change. Because allocator are typically owned by another filter (e.g. Video Mixing Renderer Filter) and originally its buffer is queried by an upstream filter, the upstream filter obtains allocated buffer independently from the middle filter that shares memory allocators. If the upstream filter decides to never deliver this buffer, however the buffer has a media type attached (see AM_SAMPLE2_PROPERTIES::pMediaType), there is no way for the middle filter to learn about dynamic format change completed.

As a workaround for handling Format Changes from the Video Renderer, when resolution is not changed and it is only stride which might be extended, middle filter might be checking data size in lActual field and learn about the change from an increase in this value.

To be reliably notified on media type change the middle filter is to take extra measures while sharing the allocator. Instead using raw allocator obtained from one pin on another pin (typically output pin’s allocator to be used on an input pin), middle filter may be using an internal proxy object, which implements IMemAllocator interface and forward calls to internal IMemAllocator, obtained originally. Additionally to that, the proxy can check for attached media types on every buffer taken from the allocator, and once the change is noticed – at the moment upstream filter is requesting the buffer – the proxy has a timely chance to remember the new media type so that in the following IMemInputPin::Receive call this media type can be checked for the case upstream buffer decided to not deliver the buffer with attached media type.

if(IsSharingMemAllocators())
{
    // ...
    ATLASSERT((InputMediaSampleProperties.pMediaType != NULL) ^ !(InputMediaSampleProperties.dwSampleFlags & AM_SAMPLE_TYPECHANGED));
    {
        CRoCriticalSectionLock DataLock(GetDataCriticalSection());
        const CObjectPtr<CProxyMemAllocator>& pInputProxyMemAllocator = m_pInputPin->GetProxyMemAllocatorReference();
        CMediaType pMediaType;
        if(pInputProxyMemAllocator && pInputProxyMemAllocator->GetDynamicallyChangedMediaType(pMediaType, TRUE))
        {
            m_pInputPin->SetMediaType(pMediaType);
            m_pOutputPin->SetMediaType(pMediaType);
            // ...
        }
    }
    if(InputMediaSampleProperties.pMediaType)
    {
        m_pInputPin->SetMediaType(InputMediaSampleProperties.pMediaType);
        m_pOutputPin->SetMediaType(InputMediaSampleProperties.pMediaType);
        // ...
    }
    DeliverMediaSample(pMemInputPin, pInputMediaSample);
}

Published by Roman on 23 Dec 2008

ProcessSnapshot to take a snapshot of process modules, threads, stacks and performance

While troubleshooting released application on remote production site, it is very useful to grasp a state of the process for further analysis. There are several scenarios in which the following information about process state is helpful:

  • modules (DLLs) loaded into process and their versions
  • threads and their call stacks
  • process and thread performance

An utility ProcessSnapshot takes advantage of Debugging Tools API (dbghelp.dll – note the dialog also displays DLL version in the right bottom corner) and writes this helpful information to text file and it can also take a sequence of the snapshots to compare thread performance and/or stacks and check the difference.

The generated file is in the directory of the utility application and looks like:

Snapshot
  System Time: 10/14/2008 8:46:33 PM
  Local Time: 10/14/2008 11:46:33 PM

Performance
  Creation System Time: 10/14/2008 8:46:28 PM
  Kernel Time: 0.094 s
  User Time: 0.031 s

Modules

  Module: ProcessSnapshot.exe @00400000
    Base Address: 0x00400000
    Base Size: 0x0005b000 (372736)
    Name: ProcessSnapshot.exe
    Path: D:\Projects\Utilities\ProcessSnapshot\Release\ProcessSnapshot.exe
    Product Version: 1.0.0.1
    File Version: 1.0.0.125

  Module: ntdll.dll @7c900000
    Base Address: 0x7c900000
    Base Size: 0x000af000 (716800)
    Name: ntdll.dll
    Path: C:\WINDOWS\system32\ntdll.dll
    Product Version: 5.1.2600.5512
    File Version: 5.1.2600.5512
[...]

Threads

  Thread: 3824
    Base Priority: 8
    Creation System Time: 10/14/2008 8:46:57 PM
    Kernel Time: 0.063 s
    User Time: 0.031 s
    Call Stack
      ntdll!7c90e4f4 KiFastSystemCallRet (+ 0) @7c900000
      USER32!7e4249c4 GetCursorFrameInfo (+ 460) @7e410000
      USER32!7e424a06 DialogBoxIndirectParamAorW (+ 54) @7e410000
      USER32!7e4247ea DialogBoxParamW (+ 63) @7e410000
      ProcessSnapshot!00403f45 ATL::CDialogImpl<CMainDialog,ATL::CWindow>::DoModal (+ 67) [c:\program files\microsoft visual studio 9.0\vc\atlmfc\include\atlwin.h, 3478] (+ 28) @00400000
      ProcessSnapshot!00403b6f CProcessSnapshotModule::RunMessageLoop (+ 74) [d:\projects\utilities\processsnapshot\processsnapshot.cpp, 67] (+ 0) @00400000
      ProcessSnapshot!004049b9 ATL::CAtlExeModuleT<CProcessSnapshotModule>::Run (+ 17) [c:\program files\microsoft visual studio 9.0\vc\atlmfc\include\atlbase.h, 3552] (+ 0) @00400000
      ProcessSnapshot!004041c3 ATL::CAtlExeModuleT<CProcessSnapshotModule>::WinMain (+ 48) [c:\program files\microsoft visual studio 9.0\vc\atlmfc\include\atlbase.h, 3364] (+ 5) @00400000
      ProcessSnapshot!00434477 wWinMain (+ 5) [*d:\projects\utilities\processsnapshot\release\processsnapshot.inj:5, 14] (+ 0) @00400000
      ProcessSnapshot!00415058 __tmainCRTStartup (+ 274) [f:\dd\vctools\crt_bld\self_x86\crt\src\crt0.c, 263] (+ 27) @00400000
      !00360033

Continue Reading »

Published by Roman on 19 Nov 2008

Performance

> How do you test performance?

I don’t. I just believe in it.

This is actually what we have here but still we have managed to deliver software that gives more frames per second than rivals. Why? We hopefully knew what we did in first place. According to one of our partner hardware vendors, there are only two software packages which could render multiple megapixel video feeds at the rates cameras can provide, the ours one and another one with the track leading to sources in Eastern Europe…

Published by Roman on 18 Aug 2008

An effect of excessive RGB conversion onto video streaming perofrmance (continued)

This continues the topic raised by previous post. As fairly noticed by The March Hare, video renderer is using hardware overlay and the benchmark is incorrect if we are to extrapolate the performance to scenario with multiple video renderers.

So, an updated test application creates 16 video renderers with 16 threads pumping two meida samples through each of the 16 filter graphs.

The screen shot shows that there is only one video overlay in use (which image was not captured and blackness is shown instead), so results may be inaccurate for one of the graph among 16. In this simple test I disregard this.

Here go the results (in all tests CPU usage is maxed out):

  • YUY2 Source -> VMR: 3,480 fps
  • YUY2 Source -> AVI Decompressor (converts to 24-bit RGB) -> Sample Grabber (without processing) -> Color Space Converter (converts to 32-bit RGB) -> VMR: 560 fps
  • YUY2 Source -> AVI Decompressor (converts to 32-bit RGB) -> Color Space Converter -> VMR: 390 fps

Continue Reading »

Published by Roman on 17 Aug 2008

An effect of excessive RGB conversion onto video streaming perofrmance

Started here: How can I overlay timestamp on the image? on microsoft.public.win32.programmer.directx.video

Let us see if RGB conversion adds any noticeable effect on streaming YUY2 video, typical output of video decompressor.

As a reference I am taking a simple YUY2 source -> Video Mixing Render Filter (VMR) graph, where source filter streams the same pre-allocated and pre-initialized data in an infinite loop:

while(WaitForSingleObject(TerminationEvent, 0) == WAIT_TIMEOUT)
{
	ATLENSURE_SUCCEEDED(m_pSourceFilter->InjectMediaSample(m_pnData, m_nDataSize));
	CRoCriticalSectionLock DataLock(m_DataCriticalSection);
	m_pnInjectedFrameCounts[0]++;
}

Video resolution is 640×480 pixels.

What is actually consuming CPU resources here is data copy into VMR’s media sample buffer and actually streaming. VMR might be blocking control waiting on rendering completion, I am leaving this for default VMR to decide (it might be hardware dependent etc).

Running at full pace, the application is rendering 510 frames per second consuming virtually no CPU. That is VMR is waiting until meida sample is rendered, this only allows streaming mentioned number of media samples per second, however rendering process does not take CPU resource, just waiting for video hardware to complete.

Continue Reading »