Demo: Webcam with YOLOv4 object detection via Microsoft Media Foundation and DirectML

As we see announcements for more and more powerful NPUs and AI support in newer consumer hardware, here is a new demo with Artificial Intelligence attached to a good old sample camera application based on the Microsoft Media Foundation Capture Engine API.

This demo is a blend of several technologies: Media Foundation, Direct3D 11, Direct3D 12, DirectML (of Windows AI), and Direct2D, all at once!

The video data never leaves the video hardware. We have the video captured via Media Foundation, internally processed with the Windows Camera Frame Server service, and then shared through the Capture Engine API (this perhaps still does not really work as “shared access” but we are used to this). The data then is converted from Direct3D 11 to Direct3D 12 and, further, to a DirectML tensor. From there, we run the YOLOv4 model (see DirectML YOLOv4 sample) on the data while the video smoothly goes to preview presentation. As soon as the Machine Learning model produces predictions, we pass the data to a Direct2D overlay, which attaches a mask onto the video flow going to the presentation.

The compiled application’s only dependency is the yolov4.weights file (you need to download it separately and place in the same directory as the executable, there is an alternative download link); the rest is the Windows API and the application itself. As the work is handled by the GPU, you will see Task Manager showing GPU load close to 100% while CPU load is minimal.

The model is trained on 608 by 608 pixel images, and so is the model input and rescaling. That is, the resolution of the camera video does not matter much, except that the overlay mask is more accurate with higher resolution video—the overlay is burnt into the video stream itself. To show the recent progress in hardware capabilities, here are some new numbers:

  1. Intel(R) UHD Graphics 770 integrated in the Intel® Core™ i9-13900K CPU achieves around 2.5 fps for real-time video processing.
  2. AMD Radeon 780M Graphics integrated in the AMD Ryzen 9 7940HS CPU runs the processing up to four times faster, achieving around 8.0 fps.

Intel’s CPU is still the good one but its video is not so, and it’s GPU at work now. The AMD Ryzen 9 7940HS includes AMD’s dedicated AI XDNA Technology. The NPU performance is rated at up to 10 TOPS (Tera Operations Per Second), upcoming Copilot+ PCs are expected to have 40+ TOPS, so the new hardware should be a good fit for this class of applications.

Have fun!

Minimal Requirements:

  1. Windows 11 x64 — DirectML might run on Windows 10 as well, but for the simplicity of the demo it requires Windows 11; Windows 11 ARM64 is of certain interest as well, but I don’t have hardware at hand to check)
  2. DirectX 12 compatible video adapter, which is used across APIs — for the simplicity of the demo there is no option to offload to another GPU or NPU
  3. The .weights file downloaded and placed as mentioned, besides the application in the archive above

Demo update: Web camera video with MPEG-DASH live broadcasting, now also with HLS

Serving MPEG-DASH differs from serving HLS but as long as you have video packaged in ISO BMFF segments, adding an option to also expose content as HLS (HTTP Live Streaming, RFC 8216) is not too difficult.

Besides, being able to stream webcamera signal as MPEG-DASH using http://localhost/MediaFoundationCameraToolkit/Capture/manifest.mpd I also made it possible to use http://localhost/MediaFoundationCameraToolkit/Capture/master.m3u8 with, for example, the same Shaka Player Demo, or https://hlsjs.video-dev.org/demo/.

BTW, hls.js has even better visualization of media buffering:

Demo update: Web camera video with MPEG-DASH live broadcasting, now with MP4 export feature

Since the last demo appears to be quite nice, one another addition: ability to drop the internal memory video content of the application into MP4 file.

As the application works and shows video preview, it keeps H.264/AVC version of the data in memory in the form of sliding window. Now you can just hit F8 and have this video – the last two minutes, that is – written into an MP4 file.

What the application does is effectively this: it uses initialization segment from MPEG-DASH content, and concatenates all media segments that the server keeps ready. This way we have an ISO BMFF media file, of the fragmented MP4 flavor. It is playable but not nicely, so the application contintues and then right in the memory and on the fly it re-packages (there is a related bug in Windows Media Foundation which needs to be worked around, but it is what it is) this file into standard MP4 file, using here and there just Windows Media Foundation. This whole process is instant, of course.

And more to this, alternatively you can take this video snapshot even via web server! Just request http://localhost/MediaFoundationCameraToolkit/Capture/video.mp4 and you will have the application to do the same processing and preparation/export of video file, but it will be delivered to browser instead of saving to file system.

Have fun!

Demo: Web camera video with MPEG-DASH live broadcasting

New series in demonstrations of what one can squeeze out of Windows Media Foundation Capture Engine API.

This video camera capture demonstration application features a mounted MPEG-DASH (Dynamic Adaptive Streaming over HTTP) server. The concept is straightforward: during video capture, the application takes the video feed and compresses it in H.264/AVC format using GPU hardware-assisted encoding. It then retains approximately two minutes of data in memory and generates an MPEG-DASH-compatible view of this data. The view follows the dynamic manifest format specified by ISO/IEC 23009-1. The entire system is integrated with the HTTP Server API and accessible over the network.

Since it is pretty standard streaming media (just maybe without adaptive bitrate capability: the broadcasting takes place in just one quality) the signal can be played back with something like Google Shaka Player. As the application keeps last two minutes of data, you can rewind web player back to see yourselves in past… And then fast forward yourselves into future once again.

Just Windows platform APIs, Microsoft Windows Media Foundation and C++ code, the only external library is Windows Implementation Libraries (WIL) if this classifies at all as an external library. No FFmpeg, no GStreamer and such. No curl, no libhttpserver and whatever web servers are. That is, as simple as this:

auto const ToSeconds = [] (NanoDuration const& Value, double Multiplier = 1.0) -> std::wstring
{
	return Format(L"PT%.2fS", Multiplier * Value.count() / 1E7);
};

Element Mpd(L"MPD", // ISO/IEC 23009-1:2019; Table 3 — Semantics of MPD element; 5.3.1.2 Semantics
{
	{ L"xmlns", L"urn:mpeg:dash:schema:mpd:2011" },
	//{ L"xmlns", L"xsi", L"http://www.w3.org/2001/XMLSchema-instance" },
	//{ L"xsi", L"schemaLocation", L"urn:mpeg:dash:schema:mpd:2011 http://standards.iso.org/ittf/PubliclyAvailableStandards/MPEG-DASH_schema_files/DASH-MPD.xsd" },
	{ L"profiles", L"urn:mpeg:dash:profile:isoff-live:2011" },
	{ L"type", L"dynamic" },
	{ L"maxSegmentDuration", ToSeconds(2s) },
	{ L"minBufferTime", ToSeconds(4s) },
	{ L"minimumUpdatePeriod", ToSeconds(1s) },
	{ L"suggestedPresentationDelay", ToSeconds(3s) },
	{ L"availabilityStartTime", FormatDateTime(BaseLiveTime) },
	//{ L"publishTime", FormatDateTime(BaseLiveTime) },
});

The video is compressed once as video capture process goes, and the application is integrated with native HTTP web server, so the whole thing is pretty scalable: connect multiple clients and this is fine, the application mostly provides you a view into H.264/AVC data temporarily kept in memory within the retention window. For the same reason resource consumption of the solution is what you expect it to be. The playback clients do not evenhave to play the same historical part of the content:

So okay well, this demo opens path to next steps once in a while: audio, DRM, HLS version, low latency variants such as LL-HLS, MPEG-DASH segment sequence representations.

So just have the webcam video capture application working, and open MPEG-DASH manifest http://localhost/MediaFoundationCameraToolkit/Capture/manifest.mpd with https://shaka-player-demo.appspot.com/ using “Custom Content” option.

Note that the application requires administrative elevated access in order to use HTTP Server API capabilities (AFAIR it is possible to make it another way, but you don’t need this this time).

The application doing video capture, rendering the 1920×1080@30 stream to the user interface, teeing signal into additional processing, doing hardware assisted video encoding, packaging, serving MPEG-DASH content is not taking too many resources: it is just something that makes good sense.

Oh and one can also use standard C# tooling to display this sort of video signal, here we go with standard PlayReady C# Sample with a XAML MediaElement inside:

Demo: Live camera video with Microsoft’s Video Stabilization Effect MFT

In continuation of camera demos, one another build with Microsoft’s Video Stabilization MFT.

In the context of Capture Engine applciation and use of the MFT as an effect, it is used in its defautl configuration, in particular without explicit low latency mode. This creates a noticeable delay in video transmission. Still, it is what it is – the effect still passes through the video feed.

Still, it is hardware accelerated and is apparently well suitable for real-time video processing.