State of video remoting continued

Comparison of time codes is one method, and getting impression on latency through driving is another. Rainway Xbox One UWP application as a thin client to a desktop PC game.

And we are happy people love it:

Whoever the engineers who wrote the core technology, the minimal-latency streaming code - wow, I am so impressed by what they've created! It's SO quick, like I'm streaming now from two computers to remote platforms, and everything is all over WiFi and latency is 9ms or less. This is giving life to some old hardware, and it's enabling me to use my computer anywhere.

Whoever the engineers who wrote the core technology, the minimal-latency streaming code – wow, I am so impressed by what they’ve created! It’s SO quick, like I’m streaming now from two computers to remote platforms, and everything is all over WiFi and latency is 9ms or less. This is giving life to some old hardware, and it’s enabling me to use my computer anywhere.

UPD. A few days later:

Telegram bot to extract contents of H.264 parameter set NAL units

In continuation of previous post about C++/WinRT and Telegram, here we with @ParameterSetAnalyzeBot: “Your buddy to extract H.264 parameter set NAL data”. In a chat, it expects an MP4 file with an H.264 video track sent him (her?). Then it extracts data from sample description box and deciphers into readable form:

It’s literally taking the MP4 file to the Media Foundation Source Reader API, pulls MF_MT_MPEG_SEQUENCE_HEADER and pipe the data to h264_analyze tool (my fork of it has Visual Studio 2019 project, and is added ability to take input from stdin for piping needs).

It will probably not be online forever, but it’s live. Be aware that Telegram limits file transmissions to 20 MB per file at the moment.

Continue reading →

So yes, C++/WinRT is how C++ development is to be done on Windows

“Modern” C++/WinRT is the way to write rather powerful things in a compact and readable way, mixing everything you can think of together: classic C++ and libraries, UWP APIs including HTTP client, JSON, COM, ability to put code into console/desktop applications, async API model and C++20 coroutines.

Fragment of Telegram bot code snippet that echoes a message back, written with just bare Windows 10 SDK API set without external libraries, for example:

for(auto&& UpdateValue: UpdateArray) // https://core.telegram.org/bots/api#update
{
	JsonObject Update = UpdateValue.GetObject();
	const UINT64 UpdateIdentifier = static_cast<UINT64>(Update.GetNamedNumber(L"update_id"));
	m_Context.m_NextUpdateIdentifier = UpdateIdentifier + 1;
	if(Update.HasKey(L"message"))
	{
		JsonObject Message = Update.GetNamedObject(L"message");
		m_Journal.Write(
		{ 
			L"Message",
			static_cast<std::wstring>(Message.Stringify()),
		});
		const UINT64 MessageIdentifier = static_cast<UINT64>(Message.GetNamedNumber(L"message_id"));
		JsonObject FromUser = Message.GetNamedObject(L"from");
		const UINT64 FromUserIdentifier = static_cast<UINT64>(FromUser.GetNamedNumber(L"id"));
		std::wstring FromUserUsername = static_cast<std::wstring>(FromUser.GetNamedString(L"username"));
		#pragma region ACK
		JsonObject Chat = Message.GetNamedObject(L"chat");
		const UINT64 ChatIdentifier = static_cast<UINT64>(Chat.GetNamedNumber(L"id"));
		{
			std::wstring Text = Format(L"Hey, *@%ls*, I confirm message _%llu_\\. Send me a file now\\!", FromUserUsername.c_str(), MessageIdentifier);
			Uri RequestUri(static_cast<winrt::hstring>(Format(L"https://api.telegram.org/bot%ls/sendMessage", m_Configuration.m_Token.c_str())));
			JsonObject Request;
			Request.Insert(L"chat_id", JsonValue::CreateNumberValue(static_cast<DOUBLE>(ChatIdentifier)));
			Request.Insert(L"text", JsonValue::CreateStringValue(static_cast<winrt::hstring>(Text)));
			Request.Insert(L"parse_mode", JsonValue::CreateStringValue(L"MarkdownV2"));
			m_Journal.Write(
			{ 
				L"sendMessage",
				L"Request",
				static_cast<std::wstring>(Request.Stringify()),
			});
			HttpStringContent Content(Request.Stringify(), UnicodeEncoding::Utf8);
			Content.Headers().ContentType(Headers::HttpMediaTypeHeaderValue(L"application/json"));
			HttpResponseMessage ResponseMessage = Client.PostAsync(RequestUri, Content).get();
			JsonObject Response = JsonObject::Parse(ResponseMessage.Content().ReadAsStringAsync().get());
			m_Journal.Write(
			{ 
				L"sendMessage",
				L"Response",
				static_cast<std::wstring>(Response.Stringify()),
			});
			__D(Response.GetNamedBoolean(L"ok"), E_UNNAMED);
		}
		#pragma endregion
	}

Please count me as a fan of this.

On efficiency of hardware-assisted JPEG decoding (AMD MFT MJPEG Decoder)

The previous post was focusing on problems with the hardware MFT decoder provided as a part of video driver package. This time I am going to mention some data about how the inefficiency affects performance of video capture using a high frame rate 260 FPS camera as a test stand. Apparently the effect is better visible with high frame rates because CPU and GPU hardware is fast enough already to process less complicated signal.

There is already some interest from AMD end (deserves a separate post why this is exceptional on its own), and some bug fixes are already under the way.

The performance problem is less visible because the decoder is overall performing without fatal issues and provides expected output: no failures, error codes, no deadlocks, neither CPU or GPU engine is peaked out, so things are more or less fine at first glance… The test application uses Media Foundation and Source Reader API to read textures in hardware MFT enabled mode and discards the textures just printing out the frame rate.

AMD MFT MJPEG Decoder

C:\...\MjpgCameraReader\bin\x64\Release>MjpgCameraReader.exe
 Using camera HD USB Camera
 Using adapter Radeon RX 570 Series
 Using video capture format 640x360@260.004 MFVideoFormat_MJPG
 Using hardware decoder MFT AMD MFT MJPEG Decoder
 Using video frame format 640x384@260.004 MFVideoFormat_YUY2
 72.500 video samples per second captured
 134.000 video samples per second captured
 135.000 video samples per second captured
 134.500 video samples per second captured
 135.500 video samples per second captured
 134.000 video samples per second captured
 134.000 video samples per second captured
 135.000 video samples per second captured
 134.500 video samples per second captured
 133.500 video samples per second captured
 134.000 video samples per second captured

With no sign of hitting a bottleneck the reader process produces ~134 FPS from the video capture device.

Alax.Info MJPG Video Decoder for AMD Hardware

My replacement for hardware decoder MFT is doing the decoding of the same signal, and, generally, shares a lot with AMD’s own decoder: both MFTs are built on top of Advanced Media Framework (AMF) SDK. Driver package installs runtime for this SDK and installs a decoder MFT which is linked against a copy of the runtime (according to AMD representative, the static link copy shares the same codebase).

C:\...\MjpgCameraReader\bin\x64\Release>MjpgCameraReader.exe
 Using camera HD USB Camera
 Using adapter Radeon RX 570 Series
 Using video capture format 640x360@260.004 MFVideoFormat_MJPG
 Using substitute decoder Alax.Info MJPG Video Decoder for AMD Hardware
 Using video frame format 640x360@260.004 MFVideoFormat_YUY2
 74.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 260.500 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 260.500 video samples per second captured

Similar CPU and GPU utilization levels with higher frame rate. Actually, with the expected frame rate because it is the rate the camera is supposed to operate at.

Continue reading →

Hardware accelerated JPEG video decoder MFT from AMD

Video GPU vendors (AMD, Intel, NVIDIA) ship their hardware with drivers, which in turn provide hardware-assisted decoder for JPEG (also known as MJPG and MJPEG. and Motion JPEG) video in form-factor of a Media Foundation Transform (MFT).

JPEG is not included in DirectX Video Acceleration (DXVA) 2.0 specification, however hardware carries implementation for the decoder. A separate additional MFT is a natural way to provide OS integration.

AMD’s decoder is named “AMD MFT MJPEG Decoder” and looks weird from the start. It is marked as MFT_ENUM_FLAG_HARDWARE, which is good but this normally assumes that the MFT is also MFT_ENUM_FLAG_ASYNCMFT, but the MFT lacks the markup. AMD’s another decoder MFT “AMD D3D11 Hardware MFT Playback Decoder” has the same problem though.

Hardware MFTs must use the new asynchronous processing model…

Presumably the MFT has the behavior of normal asynchronous MFT, however as long as this markup does not have side effects with Microsoft’s software, AMD does not care for this confusion to others.

Furthermore, the registration information for this decoder suggests that it can handle decoding into MFVideoFormat_NV12 video format, and sadly it is again inaccurate promise. Despite the supposed claim, the capability is missing and Microsoft’s Video Processor MFT jumps in as needed to satisfy such format conversion.

These were just minor things, more or less easy to tolerate. However, a rule of thumb is that Media Foundation glue layer provided by technology partners such as GPU vendors is only satisfying minimal certification requirements, and beyond that it causes suffering and pain to anyone who wants to use it in real world scenarios.

AMD’s take on making developers feel miserable is the way how hardware-assisted JPEG decoding actually takes place.

The thread 0xc880 has exited with code 0 (0x0).
The thread 0x593c has exited with code 0 (0x0).
The thread 0xa10 has exited with code 0 (0x0).
The thread 0x92c4 has exited with code 0 (0x0).
The thread 0x9c14 has exited with code 0 (0x0).
The thread 0xa094 has exited with code 0 (0x0).
The thread 0x609c has exited with code 0 (0x0).
The thread 0x47f8 has exited with code 0 (0x0).
The thread 0xe1ec has exited with code 0 (0x0).
The thread 0x6cd4 has exited with code 0 (0x0).
The thread 0x21f4 has exited with code 0 (0x0).
The thread 0xd8f8 has exited with code 0 (0x0).
The thread 0xf80 has exited with code 0 (0x0).
The thread 0x8a90 has exited with code 0 (0x0).
The thread 0x103a4 has exited with code 0 (0x0).
The thread 0xa16c has exited with code 0 (0x0).
The thread 0x6754 has exited with code 0 (0x0).
The thread 0x9054 has exited with code 0 (0x0).
The thread 0x9fe4 has exited with code 0 (0x0).
The thread 0x12360 has exited with code 0 (0x0).
The thread 0x31f8 has exited with code 0 (0x0).
The thread 0x3214 has exited with code 0 (0x0).
The thread 0x7968 has exited with code 0 (0x0).
The thread 0xbe84 has exited with code 0 (0x0).
The thread 0x11720 has exited with code 0 (0x0).
The thread 0xde10 has exited with code 0 (0x0).
The thread 0x5848 has exited with code 0 (0x0).
The thread 0x107fc has exited with code 0 (0x0).
The thread 0x6e04 has exited with code 0 (0x0).
The thread 0x6e90 has exited with code 0 (0x0).
The thread 0x2b18 has exited with code 0 (0x0).
The thread 0xa8c0 has exited with code 0 (0x0).
The thread 0xbd08 has exited with code 0 (0x0).
The thread 0x1262c has exited with code 0 (0x0).
The thread 0x12140 has exited with code 0 (0x0).
The thread 0x8044 has exited with code 0 (0x0).
The thread 0x6208 has exited with code 0 (0x0).
The thread 0x83f8 has exited with code 0 (0x0).
The thread 0x10734 has exited with code 0 (0x0).

For whatever reason they create a thread for every processed video frame or close to this… Resource utilization and performance is affected respectively. Imagine you are processing a video feed from high frame rate camera? The decoder itself, including its AMF runtime overhead, decodes images in a millisecond or less but they spoiled it with absurd threading topped with other bugs.

However, AMD video cards still have the hardware implementation of the codec, and this capability is also exposed via their AMF SDK.

 AMFVideoDecoderUVD_MJPEG

 Acceleration Type: AMF_ACCEL_HARDWARE
 AMF_VIDEO_DECODER_CAP_NUM_OF_STREAMS: 16 

 CodecId    AMF_VARIANT_INT64   7
 DPBSize    AMF_VARIANT_INT64   1

 NumOfStreams    AMF_VARIANT_INT64   16

 Input
 Width Range: 32 - 7,680
 Height Range: 32 - 4,320
 Vertical Alignment: 32
 Format Count: 0
 Memory Type Count: 1
 Memory Type: AMF_MEMORY_HOST Native
 Interlace Support: 1 

 Output
 Width Range: 32 - 7,680
 Height Range: 32 - 4,320
 Vertical Alignment: 32
 Format Count: 4
 Format: AMF_SURFACE_YUY2 
 Format: AMF_SURFACE_NV12 Native
 Format: AMF_SURFACE_BGRA 
 Format: AMF_SURFACE_RGBA 
 Memory Type Count: 1
 Memory Type: AMF_MEMORY_DX11 Native
 Interlace Support: 1 

I guess they stop harassing developers once they switch from out of the box MFT to SDK interface into their decoder. “AMD MFT MJPEG Decoder” is highly likely just a wrapper over AMF interface, however my guess is that the problematic part is exactly the abandoned wrapper and not the core functionality.

Best JSON and Base64 libraries for C++ and Windows

The application in previous post Is my system a hybrid (switchable) graphics system? is implemented in C++/WinRT, which I finally decided to give a try.

For authoring and consuming Windows Runtime APIs using C++, there is C++/WinRT. This is Microsoft’s recommended replacement for the C++/CX language projection, and the Windows Runtime C++ Template Library (WRL).

The real question here is this: is WRL at all adopted yet? However it is already superseded by a new thing!

What is really cool about C++/WinRT though – and I do like the most of it – is that good old fashioned apps like desktop and even console ones can seamlessly consume UWP APIs. And this is the field where new and revised API appear now in first place.

If you, for instance, have a need in JSON lib for C++ code, Windows.Data.Json API is at your service. You no longer need a lib for this and many other routine tasks. Here is how you would parse and stringify JSON in plain C++:

std::wstring EncodeBase64(const BYTE* Data, SIZE_T DataSize)
{
	winrt::Windows::Storage::Streams::Buffer Buffer(static_cast<UINT32>(DataSize));
	std::memcpy(Buffer.data(), Data, DataSize);
	Buffer.Length(static_cast<UINT32>(DataSize));
	const winrt::hstring Text = winrt::Windows::Security::Cryptography::CryptographicBuffer::EncodeToBase64String(Buffer);
	return static_cast<std::wstring>(Text);
}
std::wstring EncodeBase64(const std::string& Value)
{
	return EncodeBase64(reinterpret_cast<const BYTE*>(Value.data()), Value.size());
}
std::wstring EncodeBase64(const std::wstring& Value)
{
	return EncodeBase64(ToMultiByte(Value, CP_UTF8));
}
std::wstring EncodeBase64(const winrt::Windows::Data::Json::JsonObject& Value)
{
	return EncodeBase64(static_cast<std::wstring>(Value.Stringify()));
}

std::wstring DecodeBase64(const std::wstring& Value)
{
	winrt::Windows::Storage::Streams::IBuffer Buffer { winrt::Windows::Security::Cryptography::CryptographicBuffer::DecodeFromBase64String(Value) };
	const UINT32 Length = Buffer.Length();
	std::string Result(Length, 0);
	std::memcpy(Result.data(), Buffer.data(), Length);
	return FromMultiByte(Result);
}
winrt::Windows::Data::Json::JsonObject DecodeBase64AsJsonObject(const std::wstring& Value)
{
	return { winrt::Windows::Data::Json::JsonObject::Parse(DecodeBase64(Value)) };
}

This thing can cope with XAML too, and if you get the hump just go read on on C++/WinRT use of co_await and friends for built-in support for UWP IAsyncAction.

Is my system a hybrid (switchable) graphics system?

There is a number of systems out there equipped with multiple GPUs which work cooperatively. The technology itself started from Integrated graphics processing units which brought a “free” GPU into system. Such system equipped with an additional discrete card obtained two GPUs at a time and from certain point it was a challenge to not just choose between the two but also run the two concurrently and utilize the capacity of both.

These GPUs are typically quite different, and there is a rational reason to prefer one to another in certain scenarios. Integrated graphics (iGPU) is typically slower and power consumption efficient, and discrete graphics (dGPU) is a powerful fully featured unit offering “performance over power saving” capabilities.

At certain point of development the seamless operation of two GPUs received a name of hybrid graphics.

By the original definition, “The discrete GPU is a render-only device, and no display outputs are connected to it.” and so if was the case for quite some time when systems like laptops were given two GPUs with an option to choose the GPU for an application to run on. The cooperative operation of the GPUs was as this: “when the discrete GPU is handling all the rendering duties, the final image output to the display is still handled by the Intel integrated graphics processor (IGP). In effect, the IGP is only being used as a simple display controller, resulting in a seamless, flicker-free experience with no need to reboot.

That is, the principal feature of hybrid graphics technology is to be able to transfer data between GPUs in efficient way so that computationally intensive rendering could happen on performance GPU with the results transferred to the other GPU which has physical wiring to a monitor.

We leverage this hardware capability in Rainway game streaming to offer seamless experience of low latency game streaming using any hardware encoder present in the system, not necessarily belonging to the piece of hardware whether video originates at.

Microsoft Windows operating system and its DirectX Graphics Infrastructure (DXGI) in particular stepped in to hide the details of switchable graphics from applications. Depending on settings, which can be defined per application, an application would see different enumeration order of adapters and operating system would either indicate a “true” adapter as a host of connected monitor, or it would otherwise indicate a different GPU transferring the rendering results between the GPUs behind the scene, such as during desktop composition process.

Side effects of seamless operation of GPUs and misreporting of GPU having monitor connection is that Desktop Duplication API cannot work with undescriptive error codes in certain cases (Error generated when Desktop Duplication API-capable application is run against discrete GPU) or Output Protection Manager API communication reports wrong security certificates.

Recent updates of Windows introduced GPU preference right in the OS settings:

Starting with Windows 10 build 17093, Microsoft is introducing a new Graphics settings page for Multi-GPU systems that allows you to manage the graphics performance preference of your apps. You may be familiar with similar graphics control panels from AMD and Nvidia, and you can continue to use those control panels. When you set an application preference in the Windows Graphics settings, that will take precedence over the other control panel settings.

However these ongoing updates actually extended the boundaries of hybrid system itself. If original hybrid system was defined as a system with primary iGPU with a monitor connected, and additional render-only secondary powerful dGPU, the recent version of Microsoft Windows can run multiple GPU systems with full featured discrete graphics adapter with monitor connected to it, and secondary iGPU being still a part of heterogeneous setup. In certain sense this update invalidated previous technical information and definitions that assumed that it is iGPU which as physical wiring to monitor in hybrid systems.

Even though there is a seemingly seamless operation of multiple GPUs, the GPUs still remain in master/slave relation: the operating system is responsible for composition of final image on the GPU with monitor (DXGI output) connection.

I developed a simple application (see download link at the bottom of the post) that discovers properties of hybrid systems and identifies the “main” GPU with output connection. The application is displaying the details on whether operating system can trick applications and report another GPU following GPU preference settings, and indicates “main” GPUs with an asterisk.

Here are some of the results:

Continue reading →