Continuous realloc()

A colleague raised a question that realloc does better than free + malloc because allocated memory block is never being actually shrunk and reallocations to smaller size following by reallocations to larger (but still not larger than one of the previous) do not lead to heap locks and actual underlying heap memory block reallocations.

While this is technically possible within the contract declared by the API, it does not seem to be likely that the runtime will stay reluctant to release unused memory. And what is also highly probable, that heap managers implement advanced tricks to decrease impact of heap locks while doing memory allocations. In the same time, realloc must move the payload data in full to the new memory location in case the reallocated block is moved itself. If this is not required and the block is large, there is an unwanted performance impact to take place.

The details of the API operation are likely to be described somewhere, and another related question might be how to do the measurement programmatically and get a hint of what is going on internally.

PSAPI offers GetProcessMemoryInfo function to obtain process memory metrics, and returned PROCESS_MEMORY_COUNTERS_EX::PrivateUsage field is showing private memory in use. malloc allocated memory is eventually mapped onto process private memory, so the API is good for seeing approximate (because of fragmentation, process memory use is always higher than sum of actually allocated block sizes) memory usage.

If we are going to allocate 1 MB blocks, then reallocate to 1 KB, then allocate additional memory, observing the process private memory usage we will be able to see if realloc does release unused memory.

The code is as simple as:

VOID* ppvItemsA[256];
static const SIZE_T g_nSizeA1 = 1 << 20; // 1 MB
_tprintf(_T("Allocating %d MB\n"), (_countof(ppvItemsA) * g_nSizeA1) >> 20);
for(SIZE_T nIndex = 0; nIndex < _countof(ppvItemsA); nIndex++)
    ppvItemsA[nIndex] = malloc(g_nSizeA1);
static const SIZE_T g_nSizeA2 = 4 << 10; // 4 KB
_tprintf(_T("Reallocating to %d MB\n"), (_countof(ppvItemsA) * g_nSizeA2) >> 20);
for(SIZE_T nIndex = 0; nIndex < _countof(ppvItemsA); nIndex++)
    ppvItemsA[nIndex] = realloc(ppvItemsA[nIndex], g_nSizeA2);
VOID* ppvItemsB[256];
static const SIZE_T g_nSizeB1 = 16 << 10; // 16 MB
_tprintf(_T("Allocating %d MB more\n"), (_countof(ppvItemsB) * g_nSizeB1) >> 20);
for(SIZE_T nIndex = 0; nIndex < _countof(ppvItemsB); nIndex++)
    ppvItemsB[nIndex] = malloc(g_nSizeB1);

And the output is:

PrivateUsage: 0 MB
Allocating 256 MB
PrivateUsage: 258 MB
Reallocating to 1 MB
PrivateUsage: 3 MB // <<--- (*)
Allocating 4 MB more
PrivateUsage: 7 MB

Which shows that reallocating to smaller size involves freeing unused space.

Download links:

Hardware assisted memory corruption detection

So you got a memory corruption issue with a piece of software. It comes in a unique scenario along the line of having a huge pile of weird code running well most of the time and then, right out of the blue, a corruption takes place followed by unexpected code execution and unstable software state in general.

The biggest problem with memory corruption is that a fragment of code is modifying a memory block which it does not own, and it has no idea who actually is the owner of the block, while the real owner has no timely way to detect the modification. You only face the consequences being unable to capture the modification moment in first place.

To get back to the original cause, an engineer has to drop into a time machine, turn back time and step back to where the trouble took originally place. As developers are not actually given state-of-the-art time machines, the time turning step is speculative.

CVirtualHeapPtr Class: Memory with Exception-on-Write access mode

At the same time a Windows platform developer is or might be aware of virtual memory API which among other things provides user mode application with capabilities to define memory protection modes. Having this on hands opens unique opportunity to apply read-only protection (PAGE_READONLY) onto a memory block and have exception raised at the very moment of unexpected memory modification, having call stack showing up a source of the problem. I refer to this mode of operation as “hardware assisted” because the access violation exception/condition would be generated purely in hardware without any need to additionally do any address comparison in code.

Needless to say that this way is completely convenient for the developer as he does not need to patch the monstrous application all around in order to compare access addresses against read-only fragment. Instead, a block defined as read-only will be immediately available as such for the whole process almost without any performance overhead.

As ATL provides a set of memory allocator templates (CHeapPtr for heap backed memory blocks, allocated with CCRTAllocator, alternate options include CComHeapPtr with CComAllocator wrapping CoTaskMemAlloc/CoTaskMemFree API), let us make an alternate allocator option that mimic well-known class interface and would facilitate corruption detection.

Because virtual memory allocation unit is a page, and protection mode is defined for the whole page, this would be the allocation granularity. For a single allocated byte we would need to request SYSTEM_INFO::dwPageSize bytes of virtual memory. Unlike normal memory heap manager, we have no way to share pages between allocations as we would be unable to effectively apply protection modes. This would definitely increase application pressure onto virtual memory, but is still acceptable for the sacred task of troubleshooting.

We define a CVirtualAllocator class to be compatible with ATL’s CCRTAllocator, however based on VirtualAlloc/VirtualFree API. The smart pointer class over memory pointer would be defined as follows:

template <typename T>
class CVirtualHeapPtr :
    public CHeapPtr<T, CVirtualAllocator>
// CVirtualHeapPtr
    CVirtualHeapPtr() throw();
    explicit CVirtualHeapPtr(_In_ T* pData) throw();
    VOID SetProtection(DWORD nProtection)
        // TODO: ...

The SetProtection method is to define memory protection for the memory block. Full code for the classes is available on Trac here (lines 9-132):

  • CGlobalVirtualAllocator class is a singleton querying operating system for virtual memory page size, and provides alignment method
  • CVirtualAllocator class is a CCRTAllocator-compatible allocator class
  • CVirtualHeapPtr class is smart template class wrapping a pointer to allocated memory

Use case code will be as follows. “SetProtection(PAGE_READONLY)” enables protection on memory block and turns on exception generation at the moment memory block modification attempt. “SetProtection(PAGE_READWRITE)” would restore normal mode of memory operation.

CVirtualHeapPtr<BYTE> p;
p[1] = 0x01;
// NOTE: Compile with /EHa on order to catch the exception
    p[1] = 0x02;
    // NOTE: We never reach here due to exception
    // NOTE: Catching the access violation for now to be able to continue execution
p[1] = 0x03;

Given the information what data gets corrupt, the pointer allocator provides an efficient opportunity to detect the violation attempt. The only thing remained is to keep memory read-only, and temporarily revert to write access when the “legal” memory modification code is about to be executed.

Continue reading →

DirectShow Spy: Intelligent Connect Trace, Selective Process Black Listing

DirectShow Spy is updated with a few new features:

  • retroactive Intelligent Connect trace
  • log output for IAMGraphBuilderCallback-related activity
  • process name based black list to selectively exclude processes from spying

Intelligent Connect Trace

The utility received a capability to read back from its own log file (DirectShowSpy.log, located typically in C:\ProgramData directory) and reconstruct graph building sequence, including steps taken by DirectShow Intelligent Connect.

In order to activate the Intelligent Connect Trace property sheet, one needs to call exported function “DoGraphBuilderCallbackPropertySheetModal“, such as using runndll32 tool:

C:\DirectShowSpy>rundll32 DirectShowSpy.dll,DoGraphBuilderCallbackPropertySheetModal

The upper part of the property page displays recently created DirectShow fitler graphs, newest to older. For a selected graph, the lower part displays events in chronological order. The events include:

The latter two methods also show “Application Result” column and values, which are HRESULT values returned by IAMGraphBuilderCallback callback provided by the application. Typically, a failure HRESULT code indicates that the application rejected the filter.

The trace log is good to expose all DirectShow junk installed in the system. For example,

Continue reading →

DirectShow Spy: Memory Allocator Properties

A small update to the DirectShow Spy today: DirectShow Filter Graph Spy prints memory allocator properties as a part of graph topology trace on transition to running state. Why is that and what it is for? Filters normally agree on allocator properties (ALLOCATOR_PROPERTIES, obtained from IMemAllocator, obtained from IMemInputPin) themselves without interference from controlling application. Sometimes an undesired choice of buffers can cause sick runtime behavior, including but not limited to the following:

  1. live audio capture buffers are too long, and introduce significant latency, including from live video capture stream taking place in parallel; controlling application might need to take advantage of IAMBufferNegotiation::SuggestAllocatorProperties and request shorter buffers.
  2. a filter, such as for example DMO Wrapper Filter, may default to 1 buffer on allocator, which means that if a buffer reference is for some reason held for future reference (e.g. a video filter might be holding a reference to be able to re-push a video sample if an application is requesting video update), the entire streaming might lock.
  3. some filters are requesting unreasonably many/large buffers and consume vast spaces of RAM, such as MainConcept MXF Demultiplexer requesting 200 buffers x 640 kilobytes each (128 MB in total out of sudden)
  4. some filters are requesting unreasonably few/small buffers resulting in inability to pre-buffer data

In a chase for answers to questions “Where is my memory?”, “Why is it so choppy?”, “I would really appreciate a nice lipsync” and to troubleshoot the mentioned scenarios it is helpful to understand buffering configuration. DirectShow Filter Spy is here to deliver this information. Once the graph is put into running state, spy prints out topology data into log file (which is in most cases C:\ProgramData\DirectShowSpy.log):

Pin 2: Name "Input 01", Direction "Input", Peer "Tee 0x087A5AF0.Output2"
 Connection media type:
 majortype {73646976-0000-0010-8000-00AA00389B71}, subtype {31435641-0000-0010-8000-00AA00389B71}, pUnk 0x00000000
 bFixedSizeSamples 0, bTemporalCompression 0, lSampleSize 1
 formattype {E06D80E3-DB46-11CF-B4D1-00805F6CBBEA}, cbFormat 170, pbFormat 0x07c46fc0
 rcSource { 0, 0, 0, 0 ), rcTarget { 0, 0, 0, 0 }
 dwBitRate 0, dwBitErrorRate 0, AvgTimePerFrame 0
 dwInterlaceFlags 0x0, dwCopyProtectFlags 0x0, dwPictAspectRatioX 16, dwPictAspectRatioY 9, dwControlFlags 0x0
 bmiHeader.biSize 40, bmiHeader.biWidth 1280, bmiHeader.biHeight 720, bmiHeader.biPlanes 1, bmiHeader.biBitCount 24, bmiHeader.biCompression avc1
 bmiHeader.biSizeImage 0, bmiHeader.biXPelsPerMeter 1, bmiHeader.biYPelsPerMeter 1, bmiHeader.biClrUsed 0, bmiHeader.biClrImportant 0
 dwStartTimeCode 0x00000000, cbSequenceHeader 38, dwProfile 100, dwLevel 31, dwFlags 0x4
 [0x0000] 00 1D 67 64 00 1F AC 24 88 05 00 5B BF F0 00 10
 [0x0010] 00 11 00 00 03 03 E8 00 00 E9 BA 0F 18 32 A0 00
 [0x0020] 05 68 EE 32 C8 B0
 Memory Allocator: 1 buffers, 1,024 bytes each (1,024 bytes total), align 1, prefix 0

Partial Visual C++ .NET 2008 source code is available from SVN, release binary included (Win32, x64); installation instructions are in another post.

Sharing Memory Allocators while at the same time Handling Dynamic Media Type Changes

Sharing memory allocators between input and output pins is an important concept to keep performance of filter graph. Unlike more frequent scenario with different allocators, a filter (referred to as “middle filter” below) which has equal media types on input and output pins has an advantage to avoid memory-to-memory copy operation for every frame processed, by delivering downstream the buffer obtained from an upstream filter. With a high resolution video, at high rate, multiple streams running simultaneously this is the expense one would try to avoid for performance reasons.

Memory allocators are (or can be) shared by well known filters, such as Sample Grabber Filter, Infinite Tee Pin Filter and in-place transformation base filters (CTransInPlaceFilter Class).

Still handling Dynamic Format Changes (not only from video renderer filter) filters that share memory allocators may run into the problem of being notified of media type change. Because allocator are typically owned by another filter (e.g. Video Mixing Renderer Filter) and originally its buffer is queried by an upstream filter, the upstream filter obtains allocated buffer independently from the middle filter that shares memory allocators. If the upstream filter decides to never deliver this buffer, however the buffer has a media type attached (see AM_SAMPLE2_PROPERTIES::pMediaType), there is no way for the middle filter to learn about dynamic format change completed.

As a workaround for handling Format Changes from the Video Renderer, when resolution is not changed and it is only stride which might be extended, middle filter might be checking data size in lActual field and learn about the change from an increase in this value.

To be reliably notified on media type change the middle filter is to take extra measures while sharing the allocator. Instead using raw allocator obtained from one pin on another pin (typically output pin’s allocator to be used on an input pin), middle filter may be using an internal proxy object, which implements IMemAllocator interface and forward calls to internal IMemAllocator, obtained originally. Additionally to that, the proxy can check for attached media types on every buffer taken from the allocator, and once the change is noticed – at the moment upstream filter is requesting the buffer – the proxy has a timely chance to remember the new media type so that in the following IMemInputPin::Receive call this media type can be checked for the case upstream buffer decided to not deliver the buffer with attached media type.

    // ...
    ATLASSERT((InputMediaSampleProperties.pMediaType != NULL) ^ !(InputMediaSampleProperties.dwSampleFlags & AM_SAMPLE_TYPECHANGED));
        CRoCriticalSectionLock DataLock(GetDataCriticalSection());
        const CObjectPtr<CProxyMemAllocator>& pInputProxyMemAllocator = m_pInputPin->GetProxyMemAllocatorReference();
        CMediaType pMediaType;
        if(pInputProxyMemAllocator && pInputProxyMemAllocator->GetDynamicallyChangedMediaType(pMediaType, TRUE))
            // ...
        // ...
    DeliverMediaSample(pMemInputPin, pInputMediaSample);

Ahead Nero’s NeResize DirectShow Filter

Another example of a negligence with a cost of incompatibility and enormous amount of support time. Ahead Nero installs a number of DirectShow filters into $(Program Files)\Common Files\Ahead\DSFilter directory. One of the files is and it hosts a Nero Resize filter. Let us take a closer look:

CLSID: {30002E0C-C574-481E-A5DE-90AE54A79E10} (note that Nero 8 ships the same buggy stuff with new CLSID of {3D0A27C9-B4D6-487B-AFE4-E3CABD4B81F9} – 11.05.2010)
Merit: 0x00400000 (MERIT_UNLIKELY)
Input Pin’s Media Type: major type GUID_NULL, subtype GUID_NULL
Output Pin’s Media Type: major type GUID_NULL, subtype GUID_NULL

The filter is clearly a video filter:

Ahead Nero Resize Filter's Property Page

So the filter register itself under a merit that allows taking it during Intelligent Connect, it registers using media type wildcard which is clearly widely than the filter can affectively operate with and the most interesting part is: with certain video media types the filter crashes (memory access violation) during pin connection negotiation process. That is, inaccurate filter may be crashing third party software it has nothing to deal with at all.

*** Unhandled Exception
Process: 0x000001d4, Thread: 0x00000ce4, Date: 6/29/2009, Time: 11:20:56 AM, Application: C:\Program Files\...
Module: C:\..., Product Version:, File Version:, File Time: 23.06.2009, 19:02
Code: 0xc0000005, Flags: 0x00000000, Address: 0x05fc6c65
Parameters: 0x00000001, 0x15be9030

** Call Stack
NeResize!05fc6c65 DllUnregisterServer +21909 @05fc0000
NeResize!05fc7888 DllUnregisterServer +25016 @05fc0000
NeResize!05fc7204 DllUnregisterServer +23348 @05fc0000

Additionally to that the filter does not allow its insertion in debugging environment, and it seems even with Visual Studio running without a debugging session active. Which means that developer may be unaware of issues until incompatibility comes up at a later stage such as testing, or at production site.

It is not the first Nero filter which is bringing real problems. Basically any user who want to keep his system far from issues while still having Nero installed, needs to do find $(Program Files)\Common Files\Ahead\DSFilter directory and immediately rename it to some ~DSFilter in order to invalidate all Nero filters registration.

A few quotes from Guidelines for Registering Filters:

Avoid specifying MEDIATYPE_None, MEDIASUBTYPE_None, or GUID_NULL in the AMOVIESETUP_MEDIATYPE information for a pin. IFilterMapper2 treats these as wildcards, which can slow the graph-building process.

Nero Resize does specify and obviously slows the system down.

Choose the lowest merit value possible. Here are some guidelines:

Special purpose filter; any filter that is created directly by the application: MERIT_DO_NOT_USE

Nero Resize uses higher value and thus affects proper applications.

Software developers will be safer to prevent from DirectShow Filter Graph Manager considering the buggy filter to be used during Intelligent Connect by implementing IAMGraphBuilderCallback interface.