Microsoft “FaceTracker” Face Detection in form of Telegram bot

Microsoft Windows operating systems come with built-in API for face tracking in Windows.Media.FaceAnalysis Namespace. It is available since Windows 10 “Threshold 1” build 1507 even though it is probably not the most popular API for a few reasons but maybe most important is that like some other new additions, the API addresses UWP development.

Nonetheless, Microsoft even published a couple of samples (see Basic face tracking sample) and perhaps this feature is mostly known because of its use Windows Store Camera application: in Photo mode the application detects faces and the documentation suggests that the application might be involved in working cooperatively with hardware in auto-focusing on detected face. Overall, the face detection is of limited use and feature set even though it is quite nicely fitted to not only still image face detection, but also video and “tracking”: when faces are tracked through sequence of video frames.

UWP video APIs do not often mention the internals of implementation and the fact that on the internal side of things this class of functionality is implemented in integration with Media Foundation API. Even though it makes good sense overall since Media Foundation the is the current media API in Windows, the functionality is not advertised as available directly through Media Foundation: no documentation, no sample code: “this area is restricted to personnel, stay well clear of”.

So I just made a quick journey into this restricted territory and plugged FaceTracker into classic desktop Media Foundation pipeline, which in turn is plugged into Telegram Bot template (in next turn, just running as a Windows Service on some Internet connected box: this way Telegram is convertible into cheap cloud service).

The demo is highlighting detected faces in a user provided clip, like this:

That is, in a conversation with a bot one supplies his or her own video clip and the bot runs it against face detector, overlaying frames for found faces, then sends back the re-encoded video. Since Media Foundation is involved and given that face detector is well interfaced to media pipeline, the process is taking place on GPU to extent possible: DXVA, Direct3D 11, hardware encoder ASIC.

All in all, meet @FaceDetectionDemoBot:

I guess next time the bot will also extract face detection information from those clips recorded by iPhones. If I get it right, the recent fruity products do face detection automatically and embed the information right into clip helping cloud storage processing since edge devices already invests its horsepower into resource consuming number crunching.

In closing,

I have a container element, but I will not give it to you…

A few weeks ago I posted a problem with AMF SDK about a property included into enumeration and triggering a failure in a followup request for value. It appeared to be “internal property” to intentionally tease the caller and report error by design, unlike all other dozens of SDK properties. And of course to raise an error in an unexpected place for those naive ones who trust third party packages from reputable vendors too much.

So I am happy to report that another vendor, NVIDIA, is catching up in the competition.

uint32_t nPresetCount = 0;
NvCheck(NvEncGetEncodePresetCount(Identifier, nPresetCount));
// nPresetCount is 17 now even though it's been 10 for a long time
if(nPresetCount)
{
	[...]
	NvCheck(NvEncGetEncodePresetGUIDs(Identifier, pPresetIdentifiers, nPresetCount, nPresetCount));
	// Success
	for(uint32_t nPresetIndex = 0; nPresetIndex < nPresetCount; nPresetIndex++)
	{
		[...]
		NV_ENC_PRESET_CONFIG Configuration { NV_ENC_PRESET_CONFIG_VER };
		Configuration.presetCfg.version = NV_ENC_CONFIG_VER;
		// Argument for 11th item discovered above
		NvCheck(NvEncGetEncodePresetConfig(Identifier, PresetIdentifier, Configuration));
		// NV_ENC_ERR_UNSUPPORTED_PARAM

If this is behavior by design, it is not an innovation, we’d need something new.

State of video remoting continued

Comparison of time codes is one method, and getting impression on latency through driving is another. Rainway Xbox One UWP application as a thin client to a desktop PC game.

And we are happy people love it:

Whoever the engineers who wrote the core technology, the minimal-latency streaming code - wow, I am so impressed by what they've created! It's SO quick, like I'm streaming now from two computers to remote platforms, and everything is all over WiFi and latency is 9ms or less. This is giving life to some old hardware, and it's enabling me to use my computer anywhere.

Whoever the engineers who wrote the core technology, the minimal-latency streaming code – wow, I am so impressed by what they’ve created! It’s SO quick, like I’m streaming now from two computers to remote platforms, and everything is all over WiFi and latency is 9ms or less. This is giving life to some old hardware, and it’s enabling me to use my computer anywhere.

UPD. A few days later:

Telegram bot to extract contents of H.264 parameter set NAL units

In continuation of previous post about C++/WinRT and Telegram, here we with @ParameterSetAnalyzeBot: “Your buddy to extract H.264 parameter set NAL data”. In a chat, it expects an MP4 file with an H.264 video track sent him (her?). Then it extracts data from sample description box and deciphers into readable form:

It’s literally taking the MP4 file to the Media Foundation Source Reader API, pulls MF_MT_MPEG_SEQUENCE_HEADER and pipe the data to h264_analyze tool (my fork of it has Visual Studio 2019 project, and is added ability to take input from stdin for piping needs).

It will probably not be online forever, but it’s live. Be aware that Telegram limits file transmissions to 20 MB per file at the moment.

Continue reading →

So yes, C++/WinRT is how C++ development is to be done on Windows

“Modern” C++/WinRT is the way to write rather powerful things in a compact and readable way, mixing everything you can think of together: classic C++ and libraries, UWP APIs including HTTP client, JSON, COM, ability to put code into console/desktop applications, async API model and C++20 coroutines.

Fragment of Telegram bot code snippet that echoes a message back, written with just bare Windows 10 SDK API set without external libraries, for example:

for(auto&& UpdateValue: UpdateArray) // https://core.telegram.org/bots/api#update
{
	JsonObject Update = UpdateValue.GetObject();
	const UINT64 UpdateIdentifier = static_cast<UINT64>(Update.GetNamedNumber(L"update_id"));
	m_Context.m_NextUpdateIdentifier = UpdateIdentifier + 1;
	if(Update.HasKey(L"message"))
	{
		JsonObject Message = Update.GetNamedObject(L"message");
		m_Journal.Write(
		{ 
			L"Message",
			static_cast<std::wstring>(Message.Stringify()),
		});
		const UINT64 MessageIdentifier = static_cast<UINT64>(Message.GetNamedNumber(L"message_id"));
		JsonObject FromUser = Message.GetNamedObject(L"from");
		const UINT64 FromUserIdentifier = static_cast<UINT64>(FromUser.GetNamedNumber(L"id"));
		std::wstring FromUserUsername = static_cast<std::wstring>(FromUser.GetNamedString(L"username"));
		#pragma region ACK
		JsonObject Chat = Message.GetNamedObject(L"chat");
		const UINT64 ChatIdentifier = static_cast<UINT64>(Chat.GetNamedNumber(L"id"));
		{
			std::wstring Text = Format(L"Hey, *@%ls*, I confirm message _%llu_\\. Send me a file now\\!", FromUserUsername.c_str(), MessageIdentifier);
			Uri RequestUri(static_cast<winrt::hstring>(Format(L"https://api.telegram.org/bot%ls/sendMessage", m_Configuration.m_Token.c_str())));
			JsonObject Request;
			Request.Insert(L"chat_id", JsonValue::CreateNumberValue(static_cast<DOUBLE>(ChatIdentifier)));
			Request.Insert(L"text", JsonValue::CreateStringValue(static_cast<winrt::hstring>(Text)));
			Request.Insert(L"parse_mode", JsonValue::CreateStringValue(L"MarkdownV2"));
			m_Journal.Write(
			{ 
				L"sendMessage",
				L"Request",
				static_cast<std::wstring>(Request.Stringify()),
			});
			HttpStringContent Content(Request.Stringify(), UnicodeEncoding::Utf8);
			Content.Headers().ContentType(Headers::HttpMediaTypeHeaderValue(L"application/json"));
			HttpResponseMessage ResponseMessage = Client.PostAsync(RequestUri, Content).get();
			JsonObject Response = JsonObject::Parse(ResponseMessage.Content().ReadAsStringAsync().get());
			m_Journal.Write(
			{ 
				L"sendMessage",
				L"Response",
				static_cast<std::wstring>(Response.Stringify()),
			});
			__D(Response.GetNamedBoolean(L"ok"), E_UNNAMED);
		}
		#pragma endregion
	}

Please count me as a fan of this.

On efficiency of hardware-assisted JPEG decoding (AMD MFT MJPEG Decoder)

The previous post was focusing on problems with the hardware MFT decoder provided as a part of video driver package. This time I am going to mention some data about how the inefficiency affects performance of video capture using a high frame rate 260 FPS camera as a test stand. Apparently the effect is better visible with high frame rates because CPU and GPU hardware is fast enough already to process less complicated signal.

There is already some interest from AMD end (deserves a separate post why this is exceptional on its own), and some bug fixes are already under the way.

The performance problem is less visible because the decoder is overall performing without fatal issues and provides expected output: no failures, error codes, no deadlocks, neither CPU or GPU engine is peaked out, so things are more or less fine at first glance… The test application uses Media Foundation and Source Reader API to read textures in hardware MFT enabled mode and discards the textures just printing out the frame rate.

AMD MFT MJPEG Decoder

C:\...\MjpgCameraReader\bin\x64\Release>MjpgCameraReader.exe
 Using camera HD USB Camera
 Using adapter Radeon RX 570 Series
 Using video capture format 640x360@260.004 MFVideoFormat_MJPG
 Using hardware decoder MFT AMD MFT MJPEG Decoder
 Using video frame format 640x384@260.004 MFVideoFormat_YUY2
 72.500 video samples per second captured
 134.000 video samples per second captured
 135.000 video samples per second captured
 134.500 video samples per second captured
 135.500 video samples per second captured
 134.000 video samples per second captured
 134.000 video samples per second captured
 135.000 video samples per second captured
 134.500 video samples per second captured
 133.500 video samples per second captured
 134.000 video samples per second captured

With no sign of hitting a bottleneck the reader process produces ~134 FPS from the video capture device.

Alax.Info MJPG Video Decoder for AMD Hardware

My replacement for hardware decoder MFT is doing the decoding of the same signal, and, generally, shares a lot with AMD’s own decoder: both MFTs are built on top of Advanced Media Framework (AMF) SDK. Driver package installs runtime for this SDK and installs a decoder MFT which is linked against a copy of the runtime (according to AMD representative, the static link copy shares the same codebase).

C:\...\MjpgCameraReader\bin\x64\Release>MjpgCameraReader.exe
 Using camera HD USB Camera
 Using adapter Radeon RX 570 Series
 Using video capture format 640x360@260.004 MFVideoFormat_MJPG
 Using substitute decoder Alax.Info MJPG Video Decoder for AMD Hardware
 Using video frame format 640x360@260.004 MFVideoFormat_YUY2
 74.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 260.500 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 260.500 video samples per second captured

Similar CPU and GPU utilization levels with higher frame rate. Actually, with the expected frame rate because it is the rate the camera is supposed to operate at.

Continue reading →

Hardware accelerated JPEG video decoder MFT from AMD

Video GPU vendors (AMD, Intel, NVIDIA) ship their hardware with drivers, which in turn provide hardware-assisted decoder for JPEG (also known as MJPG and MJPEG. and Motion JPEG) video in form-factor of a Media Foundation Transform (MFT).

JPEG is not included in DirectX Video Acceleration (DXVA) 2.0 specification, however hardware carries implementation for the decoder. A separate additional MFT is a natural way to provide OS integration.

AMD’s decoder is named “AMD MFT MJPEG Decoder” and looks weird from the start. It is marked as MFT_ENUM_FLAG_HARDWARE, which is good but this normally assumes that the MFT is also MFT_ENUM_FLAG_ASYNCMFT, but the MFT lacks the markup. AMD’s another decoder MFT “AMD D3D11 Hardware MFT Playback Decoder” has the same problem though.

Hardware MFTs must use the new asynchronous processing model…

Presumably the MFT has the behavior of normal asynchronous MFT, however as long as this markup does not have side effects with Microsoft’s software, AMD does not care for this confusion to others.

Furthermore, the registration information for this decoder suggests that it can handle decoding into MFVideoFormat_NV12 video format, and sadly it is again inaccurate promise. Despite the supposed claim, the capability is missing and Microsoft’s Video Processor MFT jumps in as needed to satisfy such format conversion.

These were just minor things, more or less easy to tolerate. However, a rule of thumb is that Media Foundation glue layer provided by technology partners such as GPU vendors is only satisfying minimal certification requirements, and beyond that it causes suffering and pain to anyone who wants to use it in real world scenarios.

AMD’s take on making developers feel miserable is the way how hardware-assisted JPEG decoding actually takes place.

The thread 0xc880 has exited with code 0 (0x0).
The thread 0x593c has exited with code 0 (0x0).
The thread 0xa10 has exited with code 0 (0x0).
The thread 0x92c4 has exited with code 0 (0x0).
The thread 0x9c14 has exited with code 0 (0x0).
The thread 0xa094 has exited with code 0 (0x0).
The thread 0x609c has exited with code 0 (0x0).
The thread 0x47f8 has exited with code 0 (0x0).
The thread 0xe1ec has exited with code 0 (0x0).
The thread 0x6cd4 has exited with code 0 (0x0).
The thread 0x21f4 has exited with code 0 (0x0).
The thread 0xd8f8 has exited with code 0 (0x0).
The thread 0xf80 has exited with code 0 (0x0).
The thread 0x8a90 has exited with code 0 (0x0).
The thread 0x103a4 has exited with code 0 (0x0).
The thread 0xa16c has exited with code 0 (0x0).
The thread 0x6754 has exited with code 0 (0x0).
The thread 0x9054 has exited with code 0 (0x0).
The thread 0x9fe4 has exited with code 0 (0x0).
The thread 0x12360 has exited with code 0 (0x0).
The thread 0x31f8 has exited with code 0 (0x0).
The thread 0x3214 has exited with code 0 (0x0).
The thread 0x7968 has exited with code 0 (0x0).
The thread 0xbe84 has exited with code 0 (0x0).
The thread 0x11720 has exited with code 0 (0x0).
The thread 0xde10 has exited with code 0 (0x0).
The thread 0x5848 has exited with code 0 (0x0).
The thread 0x107fc has exited with code 0 (0x0).
The thread 0x6e04 has exited with code 0 (0x0).
The thread 0x6e90 has exited with code 0 (0x0).
The thread 0x2b18 has exited with code 0 (0x0).
The thread 0xa8c0 has exited with code 0 (0x0).
The thread 0xbd08 has exited with code 0 (0x0).
The thread 0x1262c has exited with code 0 (0x0).
The thread 0x12140 has exited with code 0 (0x0).
The thread 0x8044 has exited with code 0 (0x0).
The thread 0x6208 has exited with code 0 (0x0).
The thread 0x83f8 has exited with code 0 (0x0).
The thread 0x10734 has exited with code 0 (0x0).

For whatever reason they create a thread for every processed video frame or close to this… Resource utilization and performance is affected respectively. Imagine you are processing a video feed from high frame rate camera? The decoder itself, including its AMF runtime overhead, decodes images in a millisecond or less but they spoiled it with absurd threading topped with other bugs.

However, AMD video cards still have the hardware implementation of the codec, and this capability is also exposed via their AMF SDK.

 AMFVideoDecoderUVD_MJPEG

 Acceleration Type: AMF_ACCEL_HARDWARE
 AMF_VIDEO_DECODER_CAP_NUM_OF_STREAMS: 16 

 CodecId    AMF_VARIANT_INT64   7
 DPBSize    AMF_VARIANT_INT64   1

 NumOfStreams    AMF_VARIANT_INT64   16

 Input
 Width Range: 32 - 7,680
 Height Range: 32 - 4,320
 Vertical Alignment: 32
 Format Count: 0
 Memory Type Count: 1
 Memory Type: AMF_MEMORY_HOST Native
 Interlace Support: 1 

 Output
 Width Range: 32 - 7,680
 Height Range: 32 - 4,320
 Vertical Alignment: 32
 Format Count: 4
 Format: AMF_SURFACE_YUY2 
 Format: AMF_SURFACE_NV12 Native
 Format: AMF_SURFACE_BGRA 
 Format: AMF_SURFACE_RGBA 
 Memory Type Count: 1
 Memory Type: AMF_MEMORY_DX11 Native
 Interlace Support: 1 

I guess they stop harassing developers once they switch from out of the box MFT to SDK interface into their decoder. “AMD MFT MJPEG Decoder” is highly likely just a wrapper over AMF interface, however my guess is that the problematic part is exactly the abandoned wrapper and not the core functionality.