Incorrect breaking #import behavior in recent (e.g. 16.6.2) MSVC versions

Yesterday’s bug is not the only “news”. Some time ago I already saw weird broken behavior of rebuild of DirectShowSpy.dll with current version of Visual Studio and MSVC.

Now the problem is getting more clear.

Here is some interface:

[
    object,
    uuid(6CE45967-F228-4F7B-8B93-83DC599618CA),
    //dual,
    //oleautomation,
    nonextensible,
    pointer_default(unique)
]
interface IMuxFilter : IUnknown
{
    HRESULT IsTemporaryIndexFileEnabled();
    HRESULT SetTemporaryIndexFileEnabled([in] BOOL bTemporaryIndexFileEnabled);
    HRESULT GetAlignTrackStartTimeDisabled();
    HRESULT SetAlignTrackStartTimeDisabled([in] BOOL bAlignTrackStartTimeDisabled);
    HRESULT GetMinimalMovieDuration([out] LONGLONG* pnMinimalMovieDuration);
    HRESULT SetMinimalMovieDuration([in] LONGLONG nMinimalMovieDuration);
};

Compiled into type library it looks okay. Windows SDK 10.0.18362 COM/OLE Object Viewer shows the correct definition obtained from the type library:

[
    odl,
    uuid(6CE45967-F228-4F7B-8B93-83DC599618CA),
    nonextensible
]
interface IMuxFilter : IUnknown {
    HRESULT _stdcall IsTemporaryIndexFileEnabled();
    HRESULT _stdcall SetTemporaryIndexFileEnabled([in] long bTemporaryIndexFileEnabled);
    HRESULT _stdcall GetAlignTrackStartTimeDisabled();
    HRESULT _stdcall SetAlignTrackStartTimeDisabled([in] long bAlignTrackStartTimeDisabled);
    HRESULT _stdcall GetMinimalMovieDuration([out] int64* pnMinimalMovieDuration);
    HRESULT _stdcall SetMinimalMovieDuration([in] int64 nMinimalMovieDuration);
};

Now what happens when MSVC #import takes it into Win32 32-bit code:

struct __declspec(uuid("6ce45967-f228-4f7b-8b93-83dc599618ca"))
IMuxFilter : IUnknown
{
    //
    // Raw methods provided by interface
    //

      virtual HRESULT __stdcall IsTemporaryIndexFileEnabled ( ) = 0;
    virtual HRESULT _VtblGapPlaceholder1( ) { return E_NOTIMPL; }
      virtual HRESULT __stdcall SetTemporaryIndexFileEnabled (
        /*[in]*/ long bTemporaryIndexFileEnabled ) = 0;
    virtual HRESULT _VtblGapPlaceholder2( ) { return E_NOTIMPL; }
      virtual HRESULT __stdcall GetAlignTrackStartTimeDisabled ( ) = 0;
    virtual HRESULT _VtblGapPlaceholder3( ) { return E_NOTIMPL; }
      virtual HRESULT __stdcall SetAlignTrackStartTimeDisabled (
        /*[in]*/ long bAlignTrackStartTimeDisabled ) = 0;
    virtual HRESULT _VtblGapPlaceholder4( ) { return E_NOTIMPL; }
      virtual HRESULT __stdcall GetMinimalMovieDuration (
        /*[out]*/ __int64 * pnMinimalMovieDuration ) = 0;
    virtual HRESULT _VtblGapPlaceholder5( ) { return E_NOTIMPL; }
      virtual HRESULT __stdcall SetMinimalMovieDuration (
        /*[in]*/ __int64 nMinimalMovieDuration ) = 0;
    virtual HRESULT _VtblGapPlaceholder6( ) { return E_NOTIMPL; }
};

WTF _VtblGapPlaceholder1??? That was uncalled for!

It looks like some 32/64 bullshit added by MSVC from some point (cross-compilation issue?) for no good reason reason. A sort of gentle reminder that one should get rid of #import in C++ code.

Please have it fixed, 32-bit code is something still being used.

#import of Microsoft’s own quartz.dll, for example, has the same invalid gap insertion:

struct __declspec(uuid("56a868bc-0ad4-11ce-b03a-0020af0ba770"))
IMediaTypeInfo : IDispatch
{
    //
    // Raw methods provided by interface
    //

      virtual HRESULT __stdcall get_Type (
        /*[out,retval]*/ BSTR * strType ) = 0;
    virtual HRESULT _VtblGapPlaceholder1( ) { return E_NOTIMPL; }
      virtual HRESULT __stdcall get_Subtype (
        /*[out,retval]*/ BSTR * strType ) = 0;
    virtual HRESULT _VtblGapPlaceholder2( ) { return E_NOTIMPL; }
};

Something got broken around version 16.6.1 of Visual C++ compiler

Ancient piece of code started giving troubles:

template <typename T>
class ATL_NO_VTABLE CMediaControlT :
    ...
{
    ...
    STDMETHOD(Run)()
    {
        ...
        T* MSVC_FIX_VOLATILE pT = static_cast<T*>(this); // <<--------------------------------
        CRoCriticalSectionLock GraphLock(pT->m_GraphCriticalSection);
        pT->FilterGraphNeeded();
        __D(pT->GetMediaControl(), E_NOINTERFACE);
        pT->PrepareCue();
        pT->DoRun();
        __if_exists(T::Fire_Running)
        {
            pT->Fire_Running();
        }
    ...
}

When MSVC_FIX_VOLATILE is nothing, it appears that optimizing compiler forgets pT and uses just some variation of adjusted this, which makes sense overall because static cast between the two can be resolved at compile time.

However, the problem is that the value of this is wrong and there is undefined behavior scenario.

If I make MSVC_FIX_VOLATILE to be volatile and have the variable pT somewhat “heavier”, optimizing compiler would forget this and uses pT directly with everything working as expected.

The problem still exists in current 16.6.2.

Intel Media SDK H.264 encoder buffer and target bitrate management

I might have mentioned that Intel Media SDK has a ridiculously eclectic design and is a software package for the brave. Something to stay well clear of for as long as you possibly can.

To be on par on the customer support side Intel did something that caused blocking of Intel Developer Zone account. Over time I did a few attempts to restore the account and only once out of the blue someone followed up from there with a surprising response: “You do not have the enterprise login account”. That’s unbelievable, I could register account, I could post like this, I can still request password reset links and receive them, but the problem is I don’t have “enterprise account”.

Back to Intel Media SDK where things are designed to work about the same obvious and reliable as their forums. A bit of code from very basic tutorial:

    //5. Initialize the Media SDK encoder
    sts = mfxENC.Init(&mfxEncParams);
    MSDK_IGNORE_MFX_STS(sts, MFX_WRN_PARTIAL_ACCELERATION);
    MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);

    // Retrieve video parameters selected by encoder.
    // - BufferSizeInKB parameter is required to set bit stream buffer size
    mfxVideoParam par;
    memset(&par, 0, sizeof(par));
    sts = mfxENC.GetVideoParam(&par);
    MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);

    //6. Prepare Media SDK bit stream buffer
    mfxBitstream mfxBS;
    memset(&mfxBS, 0, sizeof(mfxBS));
    mfxBS.MaxLength = par.mfx.BufferSizeInKB * 1000;
    mfxBS.Data = new mfxU8[mfxBS.MaxLength];
    MSDK_CHECK_POINTER(mfxBS.Data, MFX_ERR_MEMORY_ALLOC);

Even though who was that genius who designed it to measure buffers in kilobytes, the snippet makes great sense. You ask SDK for required buffer size and then you provide the space. I myself am even more generous than that: I grant 1024 bytes for every kilobyte in question.

The thing is that hardware encoder is still hitting scenarios where it is unable to fit the data into the space sized the mentioned way. What happens if encoder has more data on hands, maybe it emits a warning? “Well I just screwed things up, be aware”? Buffer overflow error? Buffer reallocation request? Oh no, the SDK makes it smarter: it fills the buffer completely trimming the excess making the bitstream incompliant and triggering frame corruption later on decoder end. Then encoder continues as if nothing important has happened.

There is an absolute rule in the software technology that if the thing is designed to be able to get broken in certain aspect, once in a while there will be a consumer hit by this flaw. Maybe just this once Intel guys thought it would not be the case.

Heterogeneous Media Foundation pipelines

Just a small test here to feature use of multiple GPUs within single Media Foundation pipeline. The initial idea here is pretty simple: quite so many systems are equipped with multiple GPUs, some have “free” onboard Intel GPU idling in presence of regular video card. Some other systems have integrated “iGPU” and discrete “dGPU” seamlessly blended by DXGI.

Media Foundation API does not bring any specific feature set to leverage multiple GPUs at a time, but this is surely possible to take advantage of.

The application creates a 20 second long video clips by combining GPUs: one GPU is used for video rendering and another is a host of hardware H.264 video encoding. No system memory is used for uncompressed video: system memory jumps in first to receive encoded H.264 bitstream. The Media Foundation pipeline hence is:

  • Media Source generating video frames off its video stream using the first GPU
  • Transform to combine multiple GPUs
  • H.264 video encoder transform specific to second GPU
  • Stock MP4 Media Sink

The pipeline runs in a single media session pretty much like normal pipeline. Media Foundation is designed in the way that primitives do not have to be aligned in their GPU usage with the pipeline. Surely they have to share devices and textures so that they all could operate together, but pipeline itself does not put much of limitations there.

Microsoft Windows [Version 10.0.18363.815]
(c) 2019 Microsoft Corporation. All rights reserved.

C:\Temp>HeterogeneousRecordFile.exe
HeterogeneousRecordFile.exe 20200502.1-11-gd3d16d5 (Release)
d3d16d51e7f2a098c5765d445714f14051c7a68d
HEAD -> master, origin/master
2020-05-09 23:51:54 +0300
--
Found 2 DXGI adapters
Trying heterogeneous configurations…
Using NVIDIA GeForce GTX 1650 to render video and Intel(R) HD Graphics 4600 to encode the content
Using filename HeterogeneousRecordFile-20200509-235406.mp4 for recording
Using Intel(R) HD Graphics 4600 to render video and NVIDIA GeForce GTX 1650 to encode the content
Using filename HeterogeneousRecordFile-20200509-235411.mp4 for recording
Trying trivial configuration with loopback data transfer…
Using NVIDIA GeForce GTX 1650 to render video and NVIDIA GeForce GTX 1650 to encode the content
Using filename HeterogeneousRecordFile-20200509-235416.mp4 for recording
Using Intel(R) HD Graphics 4600 to render video and Intel(R) HD Graphics 4600 to encode the content
Using filename HeterogeneousRecordFile-20200509-235419.mp4 for recording

This is just a simple use case, I believe there can be other: GPUs are pretty powerful for certain specific tasks, and they are also equipped with video specific ASICs.

Download links

Binaries:

A readable version of HelloDirectML sample

So it came to my attention that there is a new API in DirectX family: Direct Machine Learning (DirectML).

Direct Machine Learning (DirectML) is a low-level API for machine learning. It has a familiar (native C++, nano-COM) programming interface and workflow in the style of DirectX 12. You can integrate machine learning inferencing workloads into your game, engine, middleware, backend, or other application. DirectML is supported by all DirectX 12-compatible hardware.

You might want to check out this introduction video if you are interested:

I updated HelloDirectML code and restructured it to be readable and easy to comprehend. In my variant I have two operators of addition and multiplication following one another with a UAV resource barrier in between. The code does (1.5 * 2) ^ 2 math in tensor space.

Here is my fork with updated HelloDirectML, with the top surface code with tensor math in less than 150 lines of code starting here. If you are a fan of spaghetti style (according to Wiki it appears what I prefer is referred to as “Ravioli code”), the original sample is there.

Hardware video encoding latency with NVIDIA GeForce GTX 1080 Ti

To complete the set of posts [1, 2, 3] on hardware video encoding at lowest latency settings, I am sharing the juiciest part and the application for NVIDIA NVENC. I did not have a 20 series card at hand to run the measurement for the numbers, and I hope the table below for GeForce GTX 1080 Ti is eloquent.

It is a sort of awkward to put the GTX 1080 Ti numbers (and those are latency in milliseconds for every video frame sent to encoding) side by side with those of AMD products, at least those I had a chance to check out, so here we go with GeForce GTX 1080 Ti vs. GeForce GTX 1650:

Well that’s fast, and GeForce 10 series were released in 2016.

The numbers show that NVIDIA cards are powerful enough for game experience remoting (what you use Rainway for) in wide range of video modes including high frame rates 144 and up.

I also added 640×360@260 just because I have a real camera (and an inexpensive one, with USB 2.0 connection) operating in this mode with high frame rate capture: generally the numbers suggest that it is generally possible to remote a high video frame rate signal at a blink-of-an-eye speed.

There might be many aspects to compare when it comes to choosing among AMD and NVIDIA products, but when it comes to video streaming, low latency video compression and hardware assisted video compression in general, the situation is pretty much clear: just grab an NVIDIA thing and do not do what I did when I put AMD Radeon RX 570 Series video card into my primary development system. I thought maybe at that time AMD had something cool.

So, here goes the app for NVIDIA hardware.

Download links

Binaries:

  • 64-bit: NvcEncode.exe (in .ZIP archive)
  • License: This software is free to use

AMD Radeon RX 570 Series video encoders compared to a couple of NVIDIA pieces of hardware

In continuation of previous AMD AMF encoder latency at faster settings posts, side by side comparison to NVIDIA encoders.

The numbers are to show how different they are even though they are doing something similar. The NVIDIA cards are not high end: GTX 1650 is literally the cheapest stuff among Turing 16xx series, and GeForce 700 series were released in 2013 (OK, GTX 750 was midrange at that time).

The numbers are milliseconds of encoding latency per video frame.

In 2013 NVIDIA card was already capable to do NVENC real-time hardware encoding of video content in 4K resolution 60 frames per second, and four years later RX 570 was released with a significantly less powerful encoder.

Encoder of GTX 1650 is much slower compared to GTX 1080 Ti (application and data to come later) but it is still powerful enough to cover a wide range of video streams including quite intensive in computation and bandwidth.

GTX 750 vs. GTX 1650 comparison also shows that even though the two belong to so much different microarchitectures (Kepler and Turing respectively, and there were Maxwell and Pascal between them), they are not in direct relationship that much newer is superior than the older in every way. When it comes to real-time performance the vendors design stuff to be just good enough.