Demo: Webcam with YOLOv4 object detection via Microsoft Media Foundation and DirectML

As we see announcements for more and more powerful NPUs and AI support in newer consumer hardware, here is a new demo with Artificial Intelligence attached to a good old sample camera application based on the Microsoft Media Foundation Capture Engine API.

This demo is a blend of several technologies: Media Foundation, Direct3D 11, Direct3D 12, DirectML (of Windows AI), and Direct2D, all at once!

The video data never leaves the video hardware. We have the video captured via Media Foundation, internally processed with the Windows Camera Frame Server service, and then shared through the Capture Engine API (this perhaps still does not really work as “shared access” but we are used to this). The data then is converted from Direct3D 11 to Direct3D 12 and, further, to a DirectML tensor. From there, we run the YOLOv4 model (see DirectML YOLOv4 sample) on the data while the video smoothly goes to preview presentation. As soon as the Machine Learning model produces predictions, we pass the data to a Direct2D overlay, which attaches a mask onto the video flow going to the presentation.

The compiled application’s only dependency is the yolov4.weights file (you need to download it separately and place in the same directory as the executable, there is an alternative download link); the rest is the Windows API and the application itself. As the work is handled by the GPU, you will see Task Manager showing GPU load close to 100% while CPU load is minimal.

The model is trained on 608 by 608 pixel images, and so is the model input and rescaling. That is, the resolution of the camera video does not matter much, except that the overlay mask is more accurate with higher resolution video—the overlay is burnt into the video stream itself. To show the recent progress in hardware capabilities, here are some new numbers:

  1. Intel(R) UHD Graphics 770 integrated in the Intel® Core™ i9-13900K CPU achieves around 2.5 fps for real-time video processing.
  2. AMD Radeon 780M Graphics integrated in the AMD Ryzen 9 7940HS CPU runs the processing up to four times faster, achieving around 8.0 fps.

Intel’s CPU is still the good one but its video is not so, and it’s GPU at work now. The AMD Ryzen 9 7940HS includes AMD’s dedicated AI XDNA Technology. The NPU performance is rated at up to 10 TOPS (Tera Operations Per Second), upcoming Copilot+ PCs are expected to have 40+ TOPS, so the new hardware should be a good fit for this class of applications.

Have fun!

Minimal Requirements:

  1. Windows 11 x64 — DirectML might run on Windows 10 as well, but for the simplicity of the demo it requires Windows 11; Windows 11 ARM64 is of certain interest as well, but I don’t have hardware at hand to check)
  2. DirectX 12 compatible video adapter, which is used across APIs — for the simplicity of the demo there is no option to offload to another GPU or NPU
  3. The .weights file downloaded and placed as mentioned, besides the application in the archive above

Leave a Reply