On efficiency of hardware-assisted JPEG decoding (AMD MFT MJPEG Decoder)

The previous post was focusing on problems with the hardware MFT decoder provided as a part of video driver package. This time I am going to mention some data about how the inefficiency affects performance of video capture using a high frame rate 260 FPS camera as a test stand. Apparently the effect is better visible with high frame rates because CPU and GPU hardware is fast enough already to process less complicated signal.

There is already some interest from AMD end (deserves a separate post why this is exceptional on its own), and some bug fixes are already under the way.

The performance problem is less visible because the decoder is overall performing without fatal issues and provides expected output: no failures, error codes, no deadlocks, neither CPU or GPU engine is peaked out, so things are more or less fine at first glance… The test application uses Media Foundation and Source Reader API to read textures in hardware MFT enabled mode and discards the textures just printing out the frame rate.

AMD MFT MJPEG Decoder

C:\...\MjpgCameraReader\bin\x64\Release>MjpgCameraReader.exe
 Using camera HD USB Camera
 Using adapter Radeon RX 570 Series
 Using video capture format 640x360@260.004 MFVideoFormat_MJPG
 Using hardware decoder MFT AMD MFT MJPEG Decoder
 Using video frame format 640x384@260.004 MFVideoFormat_YUY2
 72.500 video samples per second captured
 134.000 video samples per second captured
 135.000 video samples per second captured
 134.500 video samples per second captured
 135.500 video samples per second captured
 134.000 video samples per second captured
 134.000 video samples per second captured
 135.000 video samples per second captured
 134.500 video samples per second captured
 133.500 video samples per second captured
 134.000 video samples per second captured

With no sign of hitting a bottleneck the reader process produces ~134 FPS from the video capture device.

Alax.Info MJPG Video Decoder for AMD Hardware

My replacement for hardware decoder MFT is doing the decoding of the same signal, and, generally, shares a lot with AMD’s own decoder: both MFTs are built on top of Advanced Media Framework (AMF) SDK. Driver package installs runtime for this SDK and installs a decoder MFT which is linked against a copy of the runtime (according to AMD representative, the static link copy shares the same codebase).

C:\...\MjpgCameraReader\bin\x64\Release>MjpgCameraReader.exe
 Using camera HD USB Camera
 Using adapter Radeon RX 570 Series
 Using video capture format 640x360@260.004 MFVideoFormat_MJPG
 Using substitute decoder Alax.Info MJPG Video Decoder for AMD Hardware
 Using video frame format 640x360@260.004 MFVideoFormat_YUY2
 74.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 260.500 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 260.500 video samples per second captured

Similar CPU and GPU utilization levels with higher frame rate. Actually, with the expected frame rate because it is the rate the camera is supposed to operate at.

Continue reading →

Hardware accelerated JPEG video decoder MFT from AMD

Video GPU vendors (AMD, Intel, NVIDIA) ship their hardware with drivers, which in turn provide hardware-assisted decoder for JPEG (also known as MJPG and MJPEG. and Motion JPEG) video in form-factor of a Media Foundation Transform (MFT).

JPEG is not included in DirectX Video Acceleration (DXVA) 2.0 specification, however hardware carries implementation for the decoder. A separate additional MFT is a natural way to provide OS integration.

AMD’s decoder is named “AMD MFT MJPEG Decoder” and looks weird from the start. It is marked as MFT_ENUM_FLAG_HARDWARE, which is good but this normally assumes that the MFT is also MFT_ENUM_FLAG_ASYNCMFT, but the MFT lacks the markup. AMD’s another decoder MFT “AMD D3D11 Hardware MFT Playback Decoder” has the same problem though.

Hardware MFTs must use the new asynchronous processing model…

Presumably the MFT has the behavior of normal asynchronous MFT, however as long as this markup does not have side effects with Microsoft’s software, AMD does not care for this confusion to others.

Furthermore, the registration information for this decoder suggests that it can handle decoding into MFVideoFormat_NV12 video format, and sadly it is again inaccurate promise. Despite the supposed claim, the capability is missing and Microsoft’s Video Processor MFT jumps in as needed to satisfy such format conversion.

These were just minor things, more or less easy to tolerate. However, a rule of thumb is that Media Foundation glue layer provided by technology partners such as GPU vendors is only satisfying minimal certification requirements, and beyond that it causes suffering and pain to anyone who wants to use it in real world scenarios.

AMD’s take on making developers feel miserable is the way how hardware-assisted JPEG decoding actually takes place.

The thread 0xc880 has exited with code 0 (0x0).
The thread 0x593c has exited with code 0 (0x0).
The thread 0xa10 has exited with code 0 (0x0).
The thread 0x92c4 has exited with code 0 (0x0).
The thread 0x9c14 has exited with code 0 (0x0).
The thread 0xa094 has exited with code 0 (0x0).
The thread 0x609c has exited with code 0 (0x0).
The thread 0x47f8 has exited with code 0 (0x0).
The thread 0xe1ec has exited with code 0 (0x0).
The thread 0x6cd4 has exited with code 0 (0x0).
The thread 0x21f4 has exited with code 0 (0x0).
The thread 0xd8f8 has exited with code 0 (0x0).
The thread 0xf80 has exited with code 0 (0x0).
The thread 0x8a90 has exited with code 0 (0x0).
The thread 0x103a4 has exited with code 0 (0x0).
The thread 0xa16c has exited with code 0 (0x0).
The thread 0x6754 has exited with code 0 (0x0).
The thread 0x9054 has exited with code 0 (0x0).
The thread 0x9fe4 has exited with code 0 (0x0).
The thread 0x12360 has exited with code 0 (0x0).
The thread 0x31f8 has exited with code 0 (0x0).
The thread 0x3214 has exited with code 0 (0x0).
The thread 0x7968 has exited with code 0 (0x0).
The thread 0xbe84 has exited with code 0 (0x0).
The thread 0x11720 has exited with code 0 (0x0).
The thread 0xde10 has exited with code 0 (0x0).
The thread 0x5848 has exited with code 0 (0x0).
The thread 0x107fc has exited with code 0 (0x0).
The thread 0x6e04 has exited with code 0 (0x0).
The thread 0x6e90 has exited with code 0 (0x0).
The thread 0x2b18 has exited with code 0 (0x0).
The thread 0xa8c0 has exited with code 0 (0x0).
The thread 0xbd08 has exited with code 0 (0x0).
The thread 0x1262c has exited with code 0 (0x0).
The thread 0x12140 has exited with code 0 (0x0).
The thread 0x8044 has exited with code 0 (0x0).
The thread 0x6208 has exited with code 0 (0x0).
The thread 0x83f8 has exited with code 0 (0x0).
The thread 0x10734 has exited with code 0 (0x0).

For whatever reason they create a thread for every processed video frame or close to this… Resource utilization and performance is affected respectively. Imagine you are processing a video feed from high frame rate camera? The decoder itself, including its AMF runtime overhead, decodes images in a millisecond or less but they spoiled it with absurd threading topped with other bugs.

However, AMD video cards still have the hardware implementation of the codec, and this capability is also exposed via their AMF SDK.

 AMFVideoDecoderUVD_MJPEG

 Acceleration Type: AMF_ACCEL_HARDWARE
 AMF_VIDEO_DECODER_CAP_NUM_OF_STREAMS: 16 

 CodecId    AMF_VARIANT_INT64   7
 DPBSize    AMF_VARIANT_INT64   1

 NumOfStreams    AMF_VARIANT_INT64   16

 Input
 Width Range: 32 - 7,680
 Height Range: 32 - 4,320
 Vertical Alignment: 32
 Format Count: 0
 Memory Type Count: 1
 Memory Type: AMF_MEMORY_HOST Native
 Interlace Support: 1 

 Output
 Width Range: 32 - 7,680
 Height Range: 32 - 4,320
 Vertical Alignment: 32
 Format Count: 4
 Format: AMF_SURFACE_YUY2 
 Format: AMF_SURFACE_NV12 Native
 Format: AMF_SURFACE_BGRA 
 Format: AMF_SURFACE_RGBA 
 Memory Type Count: 1
 Memory Type: AMF_MEMORY_DX11 Native
 Interlace Support: 1 

I guess they stop harassing developers once they switch from out of the box MFT to SDK interface into their decoder. “AMD MFT MJPEG Decoder” is highly likely just a wrapper over AMF interface, however my guess is that the problematic part is exactly the abandoned wrapper and not the core functionality.

Best JSON and Base64 libraries for C++ and Windows

The application in previous post Is my system a hybrid (switchable) graphics system? is implemented in C++/WinRT, which I finally decided to give a try.

For authoring and consuming Windows Runtime APIs using C++, there is C++/WinRT. This is Microsoft’s recommended replacement for the C++/CX language projection, and the Windows Runtime C++ Template Library (WRL).

The real question here is this: is WRL at all adopted yet? However it is already superseded by a new thing!

What is really cool about C++/WinRT though – and I do like the most of it – is that good old fashioned apps like desktop and even console ones can seamlessly consume UWP APIs. And this is the field where new and revised API appear now in first place.

If you, for instance, have a need in JSON lib for C++ code, Windows.Data.Json API is at your service. You no longer need a lib for this and many other routine tasks. Here is how you would parse and stringify JSON in plain C++:

std::wstring EncodeBase64(const BYTE* Data, SIZE_T DataSize)
{
	winrt::Windows::Storage::Streams::Buffer Buffer(static_cast<UINT32>(DataSize));
	std::memcpy(Buffer.data(), Data, DataSize);
	Buffer.Length(static_cast<UINT32>(DataSize));
	const winrt::hstring Text = winrt::Windows::Security::Cryptography::CryptographicBuffer::EncodeToBase64String(Buffer);
	return static_cast<std::wstring>(Text);
}
std::wstring EncodeBase64(const std::string& Value)
{
	return EncodeBase64(reinterpret_cast<const BYTE*>(Value.data()), Value.size());
}
std::wstring EncodeBase64(const std::wstring& Value)
{
	return EncodeBase64(ToMultiByte(Value, CP_UTF8));
}
std::wstring EncodeBase64(const winrt::Windows::Data::Json::JsonObject& Value)
{
	return EncodeBase64(static_cast<std::wstring>(Value.Stringify()));
}

std::wstring DecodeBase64(const std::wstring& Value)
{
	winrt::Windows::Storage::Streams::IBuffer Buffer { winrt::Windows::Security::Cryptography::CryptographicBuffer::DecodeFromBase64String(Value) };
	const UINT32 Length = Buffer.Length();
	std::string Result(Length, 0);
	std::memcpy(Result.data(), Buffer.data(), Length);
	return FromMultiByte(Result);
}
winrt::Windows::Data::Json::JsonObject DecodeBase64AsJsonObject(const std::wstring& Value)
{
	return { winrt::Windows::Data::Json::JsonObject::Parse(DecodeBase64(Value)) };
}

This thing can cope with XAML too, and if you get the hump just go read on on C++/WinRT use of co_await and friends for built-in support for UWP IAsyncAction.

Is my system a hybrid (switchable) graphics system?

There is a number of systems out there equipped with multiple GPUs which work cooperatively. The technology itself started from Integrated graphics processing units which brought a “free” GPU into system. Such system equipped with an additional discrete card obtained two GPUs at a time and from certain point it was a challenge to not just choose between the two but also run the two concurrently and utilize the capacity of both.

These GPUs are typically quite different, and there is a rational reason to prefer one to another in certain scenarios. Integrated graphics (iGPU) is typically slower and power consumption efficient, and discrete graphics (dGPU) is a powerful fully featured unit offering “performance over power saving” capabilities.

At certain point of development the seamless operation of two GPUs received a name of hybrid graphics.

By the original definition, “The discrete GPU is a render-only device, and no display outputs are connected to it.” and so if was the case for quite some time when systems like laptops were given two GPUs with an option to choose the GPU for an application to run on. The cooperative operation of the GPUs was as this: “when the discrete GPU is handling all the rendering duties, the final image output to the display is still handled by the Intel integrated graphics processor (IGP). In effect, the IGP is only being used as a simple display controller, resulting in a seamless, flicker-free experience with no need to reboot.

That is, the principal feature of hybrid graphics technology is to be able to transfer data between GPUs in efficient way so that computationally intensive rendering could happen on performance GPU with the results transferred to the other GPU which has physical wiring to a monitor.

We leverage this hardware capability in Rainway game streaming to offer seamless experience of low latency game streaming using any hardware encoder present in the system, not necessarily belonging to the piece of hardware whether video originates at.

Microsoft Windows operating system and its DirectX Graphics Infrastructure (DXGI) in particular stepped in to hide the details of switchable graphics from applications. Depending on settings, which can be defined per application, an application would see different enumeration order of adapters and operating system would either indicate a “true” adapter as a host of connected monitor, or it would otherwise indicate a different GPU transferring the rendering results between the GPUs behind the scene, such as during desktop composition process.

Side effects of seamless operation of GPUs and misreporting of GPU having monitor connection is that Desktop Duplication API cannot work with undescriptive error codes in certain cases (Error generated when Desktop Duplication API-capable application is run against discrete GPU) or Output Protection Manager API communication reports wrong security certificates.

Recent updates of Windows introduced GPU preference right in the OS settings:

Starting with Windows 10 build 17093, Microsoft is introducing a new Graphics settings page for Multi-GPU systems that allows you to manage the graphics performance preference of your apps. You may be familiar with similar graphics control panels from AMD and Nvidia, and you can continue to use those control panels. When you set an application preference in the Windows Graphics settings, that will take precedence over the other control panel settings.

However these ongoing updates actually extended the boundaries of hybrid system itself. If original hybrid system was defined as a system with primary iGPU with a monitor connected, and additional render-only secondary powerful dGPU, the recent version of Microsoft Windows can run multiple GPU systems with full featured discrete graphics adapter with monitor connected to it, and secondary iGPU being still a part of heterogeneous setup. In certain sense this update invalidated previous technical information and definitions that assumed that it is iGPU which as physical wiring to monitor in hybrid systems.

Even though there is a seemingly seamless operation of multiple GPUs, the GPUs still remain in master/slave relation: the operating system is responsible for composition of final image on the GPU with monitor (DXGI output) connection.

I developed a simple application (see download link at the bottom of the post) that discovers properties of hybrid systems and identifies the “main” GPU with output connection. The application is displaying the details on whether operating system can trick applications and report another GPU following GPU preference settings, and indicates “main” GPUs with an asterisk.

Here are some of the results:

Continue reading →

AMD PowerXpress/Enduro switchable graphics DXGI issue

Yesterday I wrote about the NVIDIA Optimus problem with DXGI where Output Protection Manager API reported Intel certificate for NVIDIA DXGI adapter.

AMD hybrid graphics known as “PowerXpress” (and most recently as “Enduro”) exhibits the same behavior:

Adapters
 AMD Radeon HD 8850M
 Intel(R) HD Graphics Family 

Adapter: AMD Radeon HD 8850M
 Vendor Identifier: 0x1002
 [...]
 Output: \.\DISPLAY1
 [...]

Output Protection Manager (OPM Semantics)
 Certificate Subject: IntelVpgOpm2011
 OPM_GET_OUTPUT_ID: OutputId 0x0000000000040F04
 OPM_GET_ADAPTER_BUS_TYPE: ulInformation OPM_BUS_TYPE_OTHER | OPM_BUS_IMPLEMENTATION_MODIFIER_INSIDE_OF_CHIPSET | OPM_COPP_COMPATIBLE_BUS_TYPE_INTEGRATED
 OPM_GET_CONNECTOR_TYPE: ulInformation OPM_CONNECTOR_TYPE_DISPLAYPORT_EMBEDDED
 [...]

Output Duplication
 Exception: Указанный интерфейс устройства или уровень компонента не поддерживается в данной системе (0x887A0004; DXGI_ERROR_UNSUPPORTED); https://support.microsoft.com/en-ie/help/3019314/error-generated-when-desktop-duplication-api-capable-application-is-ru 

Interestingly, it seems that NVIDIA managed to resolve this problem in their most recent hybrid graphics solutions somehow (no suitable device handy), however AMD in turn seems to have this line of product discontinued at all (community forum questions look like desperate self-help service). Or I just failed to find accurate information on the topic.

MediaFoundationDxgiCapabilities: GPU preference & Hybrid GPU systems

I added a section that enumerates DXGI adapters with the help IDXGIFactory6::EnumAdapterByGpuPreference – this is included into the produced output.

Unfortunately the method does not distinguish between dual GPU systems, such as with discrete GPU and additional CPU integrated Intel GPU…

DXGI Capabilities
 NOTE: Baseline capabilities are corresponding to DXGI 1.1
 Windowed Stereo: 0
 DXGI_FEATURE_PRESENT_ALLOW_TEARING: 1
 Adapters by Preference:
 DXGI_GPU_PREFERENCE_UNSPECIFIED: Radeon RX 570 Series (0.0x0000D18A), Intel(R) UHD Graphics 630 (0.0x8E94827B), Microsoft Basic Render Driver (0.0x0000D163)
 DXGI_GPU_PREFERENCE_MINIMUM_POWER: Intel(R) UHD Graphics 630 (0.0x8E94827B), Radeon RX 570 Series (0.0x0000D18A), Microsoft Basic Render Driver (0.0x0000D163)
 DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE: Radeon RX 570 Series (0.0x0000D18A), Intel(R) UHD Graphics 630 (0.0x8E94827B), Microsoft Basic Render Driver (0.0x0000D163)
 Factory Interfaces: IDXGIObject, IDXGIFactory, IDXGIFactory1, IDXGIFactory2, IDXGIFactory3, IDXGIFactory4, IDXGIFactory5, IDXGIFactory6, IDXGIFactory7, IDXGIDisplayControl 

Adapters
 Radeon RX 570 Series
 Intel(R) UHD Graphics 630 

…and hybrid systems such as NVIDIA Optimus and AMD PowerXpress. Those from AMD seem to be discontinued and AMD is traditionally provides zero helpful feedback to developers (although NVIDIA is not any better).

However this time I might have spotted something interesting. On an NVIDIA hybrid system when an app is executed on iGPU, the output is this (expected):

DXGI Capabilities
 NOTE: Baseline capabilities are corresponding to DXGI 1.1
 Windowed Stereo: 0
 DXGI_FEATURE_PRESENT_ALLOW_TEARING: 1
 Adapters by Preference:
 DXGI_GPU_PREFERENCE_UNSPECIFIED: Intel(R) HD Graphics 520 (0.0x00010765), NVIDIA GeForce 940MX (0.0x00010A9D), Microsoft Basic Render Driver (0.0x00010A66)
 DXGI_GPU_PREFERENCE_MINIMUM_POWER: Intel(R) HD Graphics 520 (0.0x00010765), NVIDIA GeForce 940MX (0.0x00010A9D), Microsoft Basic Render Driver (0.0x00010A66)
 DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE: NVIDIA GeForce 940MX (0.0x00010A9D), Intel(R) HD Graphics 520 (0.0x00010765), Microsoft Basic Render Driver (0.0x00010A66)
 Factory Interfaces: IDXGIObject, IDXGIFactory, IDXGIFactory1, IDXGIFactory2, IDXGIFactory3, IDXGIFactory4, IDXGIFactory5, IDXGIFactory6, IDXGIFactory7, IDXGIDisplayControl 

Adapters
 Intel(R) HD Graphics 520
 NVIDIA GeForce 940MX 

[...]

Output Protection Manager (OPM Semantics)
 Certificate Subject: IntelVpgOpm2011
 OPM_GET_OUTPUT_ID: OutputId 0x0000000000040F04
 OPM_GET_ADAPTER_BUS_TYPE: ulInformation OPM_BUS_TYPE_OTHER | OPM_BUS_IMPLEMENTATION_MODIFIER_INSIDE_OF_CHIPSET | OPM_COPP_COMPATIBLE_BUS_TYPE_INTEGRATED
 OPM_GET_CONNECTOR_TYPE: ulInformation OPM_CONNECTOR_TYPE_DISPLAYPORT_EMBEDDED 

[...]
 
Output Duplication
 Direct3D 11 Feature Level: D3D_FEATURE_LEVEL_11_1; https://msdn.microsoft.com/en-us/library/windows/desktop/ff476876#Overview
 Mode Description:
 Width: 1 920
 Height: 1 080
 Refresh Rate: 138 500 000/2 310 880 (59,934)
 Format: DXGI_FORMAT_B8G8R8A8_UNORM
 Scanline Ordering: DXGI_MODE_SCANLINE_ORDER_PROGRESSIVE
 Scaling: DXGI_MODE_SCALING_UNSPECIFIED
 Rotation: DXGI_MODE_ROTATION_IDENTITY
 Desktop Image In System Memory: 0 

However, when the app is started on dGPU, adapter enumeration order is expectedly changed, and also Desktop Duplication API is dysfunctional as documented here: Error generated when Desktop Duplication API-capable application is run against discrete GPU, but…

DXGI Capabilities
 NOTE: Baseline capabilities are corresponding to DXGI 1.1
 Windowed Stereo: 0
 DXGI_FEATURE_PRESENT_ALLOW_TEARING: 1
 Adapters by Preference:
 DXGI_GPU_PREFERENCE_UNSPECIFIED: NVIDIA GeForce 940MX (0.0x00010A9D), Intel(R) HD Graphics 520 (0.0x00010765), Microsoft Basic Render Driver (0.0x00010A66)
 DXGI_GPU_PREFERENCE_MINIMUM_POWER: Intel(R) HD Graphics 520 (0.0x00010765), NVIDIA GeForce 940MX (0.0x00010A9D), Microsoft Basic Render Driver (0.0x00010A66)
 DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE: NVIDIA GeForce 940MX (0.0x00010A9D), Intel(R) HD Graphics 520 (0.0x00010765), Microsoft Basic Render Driver (0.0x00010A66)
 Factory Interfaces: IDXGIObject, IDXGIFactory, IDXGIFactory1, IDXGIFactory2, IDXGIFactory3, IDXGIFactory4, IDXGIFactory5, IDXGIFactory6, IDXGIFactory7, IDXGIDisplayControl 

Adapters
 NVIDIA GeForce 940MX
 Intel(R) HD Graphics 520 

[...]

Output Protection Manager (OPM Semantics)
 Certificate Subject: IntelVpgOpm2011
 OPM_GET_OUTPUT_ID: OutputId 0x0000000000040F04
 OPM_GET_ADAPTER_BUS_TYPE: ulInformation OPM_BUS_TYPE_OTHER | OPM_BUS_IMPLEMENTATION_MODIFIER_INSIDE_OF_CHIPSET | OPM_COPP_COMPATIBLE_BUS_TYPE_INTEGRATED
 OPM_GET_CONNECTOR_TYPE: ulInformation OPM_CONNECTOR_TYPE_DISPLAYPORT_EMBEDDED

[...]
 
Output Duplication
 Exception: Указанный интерфейс устройства или уровень компонента не поддерживается в данной системе (0x887A0004; DXGI_ERROR_UNSUPPORTED); https://support.microsoft.com/en-ie/help/3019314/error-generated-when-desktop-duplication-api-capable-application-is-ru 

…here is the news: NVIDIA DXGI adapter is enumerated with Intel OPM certificate… What a find!

Download links

Low latency video streaming

Bits of Rainway Pulsar technology at work in this test run. The big monitor is a part of desktop system, by the way, not a high end one: Radeon RX 570 with two monitors, one of which is encoded into H.264 and streamed over network to two pieces of hardware:

  1. Intel NUC with a 2K monitor (small monitor on the left) connected to onboard Intel® Iris® Plus Graphics 640 & its HDMI connector
  2. Xbox One X (in the right bottom corner)

The high FPS 3D thing (by the way, it is “The Universe Within” by BigWIngs) is running on desktop PC and its image is being taken to the other two systems simultaneously via H.264 5 MBit/s encoding.

Ordinary hardware, yet streaming is made right and shows what can be squeezed out.

Will have to check faster hardware, and I think the latency can be cut twice. Interestingly, at some point it might be harder to remote audio at such low latency.