Heterogeneous Media Foundation pipelines

Just a small test here to feature use of multiple GPUs within single Media Foundation pipeline. The initial idea here is pretty simple: quite so many systems are equipped with multiple GPUs, some have “free” onboard Intel GPU idling in presence of regular video card. Some other systems have integrated “iGPU” and discrete “dGPU” seamlessly blended by DXGI.

Media Foundation API does not bring any specific feature set to leverage multiple GPUs at a time, but this is surely possible to take advantage of.

The application creates a 20 second long video clips by combining GPUs: one GPU is used for video rendering and another is a host of hardware H.264 video encoding. No system memory is used for uncompressed video: system memory jumps in first to receive encoded H.264 bitstream. The Media Foundation pipeline hence is:

  • Media Source generating video frames off its video stream using the first GPU
  • Transform to combine multiple GPUs
  • H.264 video encoder transform specific to second GPU
  • Stock MP4 Media Sink

The pipeline runs in a single media session pretty much like normal pipeline. Media Foundation is designed in the way that primitives do not have to be aligned in their GPU usage with the pipeline. Surely they have to share devices and textures so that they all could operate together, but pipeline itself does not put much of limitations there.

Microsoft Windows [Version 10.0.18363.815]
(c) 2019 Microsoft Corporation. All rights reserved.

C:\Temp>HeterogeneousRecordFile.exe
HeterogeneousRecordFile.exe 20200502.1-11-gd3d16d5 (Release)
d3d16d51e7f2a098c5765d445714f14051c7a68d
HEAD -> master, origin/master
2020-05-09 23:51:54 +0300
--
Found 2 DXGI adapters
Trying heterogeneous configurations…
Using NVIDIA GeForce GTX 1650 to render video and Intel(R) HD Graphics 4600 to encode the content
Using filename HeterogeneousRecordFile-20200509-235406.mp4 for recording
Using Intel(R) HD Graphics 4600 to render video and NVIDIA GeForce GTX 1650 to encode the content
Using filename HeterogeneousRecordFile-20200509-235411.mp4 for recording
Trying trivial configuration with loopback data transfer…
Using NVIDIA GeForce GTX 1650 to render video and NVIDIA GeForce GTX 1650 to encode the content
Using filename HeterogeneousRecordFile-20200509-235416.mp4 for recording
Using Intel(R) HD Graphics 4600 to render video and Intel(R) HD Graphics 4600 to encode the content
Using filename HeterogeneousRecordFile-20200509-235419.mp4 for recording

This is just a simple use case, I believe there can be other: GPUs are pretty powerful for certain specific tasks, and they are also equipped with video specific ASICs.

Download links

Binaries:

A readable version of HelloDirectML sample

So it came to my attention that there is a new API in DirectX family: Direct Machine Learning (DirectML).

Direct Machine Learning (DirectML) is a low-level API for machine learning. It has a familiar (native C++, nano-COM) programming interface and workflow in the style of DirectX 12. You can integrate machine learning inferencing workloads into your game, engine, middleware, backend, or other application. DirectML is supported by all DirectX 12-compatible hardware.

You might want to check out this introduction video if you are interested:

I updated HelloDirectML code and restructured it to be readable and easy to comprehend. In my variant I have two operators of addition and multiplication following one another with a UAV resource barrier in between. The code does (1.5 * 2) ^ 2 math in tensor space.

Here is my fork with updated HelloDirectML, with the top surface code with tensor math in less than 150 lines of code starting here. If you are a fan of spaghetti style (according to Wiki it appears what I prefer is referred to as “Ravioli code”), the original sample is there.

Hardware video encoding latency with NVIDIA GeForce GTX 1080 Ti

To complete the set of posts [1, 2, 3] on hardware video encoding at lowest latency settings, I am sharing the juiciest part and the application for NVIDIA NVENC. I did not have a 20 series card at hand to run the measurement for the numbers, and I hope the table below for GeForce GTX 1080 Ti is eloquent.

It is a sort of awkward to put the GTX 1080 Ti numbers (and those are latency in milliseconds for every video frame sent to encoding) side by side with those of AMD products, at least those I had a chance to check out, so here we go with GeForce GTX 1080 Ti vs. GeForce GTX 1650:

Well that’s fast, and GeForce 10 series were released in 2016.

The numbers show that NVIDIA cards are powerful enough for game experience remoting (what you use Rainway for) in wide range of video modes including high frame rates 144 and up.

I also added 640×360@260 just because I have a real camera (and an inexpensive one, with USB 2.0 connection) operating in this mode with high frame rate capture: generally the numbers suggest that it is generally possible to remote a high video frame rate signal at a blink-of-an-eye speed.

There might be many aspects to compare when it comes to choosing among AMD and NVIDIA products, but when it comes to video streaming, low latency video compression and hardware assisted video compression in general, the situation is pretty much clear: just grab an NVIDIA thing and do not do what I did when I put AMD Radeon RX 570 Series video card into my primary development system. I thought maybe at that time AMD had something cool.

So, here goes the app for NVIDIA hardware.

Download links

Binaries:

  • 64-bit: NvcEncode.exe (in .ZIP archive)
  • License: This software is free to use

AMD Radeon RX 570 Series video encoders compared to a couple of NVIDIA pieces of hardware

In continuation of previous AMD AMF encoder latency at faster settings posts, side by side comparison to NVIDIA encoders.

The numbers are to show how different they are even though they are doing something similar. The NVIDIA cards are not high end: GTX 1650 is literally the cheapest stuff among Turing 16xx series, and GeForce 700 series were released in 2013 (OK, GTX 750 was midrange at that time).

The numbers are milliseconds of encoding latency per video frame.

In 2013 NVIDIA card was already capable to do NVENC real-time hardware encoding of video content in 4K resolution 60 frames per second, and four years later RX 570 was released with a significantly less powerful encoder.

Encoder of GTX 1650 is much slower compared to GTX 1080 Ti (application and data to come later) but it is still powerful enough to cover a wide range of video streams including quite intensive in computation and bandwidth.

GTX 750 vs. GTX 1650 comparison also shows that even though the two belong to so much different microarchitectures (Kepler and Turing respectively, and there were Maxwell and Pascal between them), they are not in direct relationship that much newer is superior than the older in every way. When it comes to real-time performance the vendors design stuff to be just good enough.

Video encoders of Radeon RX Vega M GH Graphics

A follow-up observation about encoders of Radeon RX Vega M GH Graphics of Hades Canyon NUC and the measurement app from the previous post:

The side by side comparison with desktop RX 570 card shows a few interesting things:

  1. Radeon RX Vega M GH Graphics has the latest driver, but AMF runtime version is way behind the latest: 1.4.11, that is, the system does not receive timely update (and overall its already discontinued)
  2. Even though some performance tuning might be coming from AMF updates, H.265/HEVC encoder performance suggests that the circuitry is pretty much the same and HEVC encoder numbers are close
  3. Embedded version of H.264 encoder is limited to nicely support 1920×1080@60 with reasonable headspace and it is assumed that higher resolutions are to be handled by next gen codec; yet desktop version received an improved version of H.264 encoder to cover real-time processing of 2560×1440@60 and 3820×2160@30

How fast is your AMD H.264 and H.265/HEVC GPU encoder?

Just a small tool here to try a few of a popular resolutions and measure video encoding latency. The encoder is running in configuration to address needs of real-time encoding with speed over quality setup.

Note that the performance might be affected by side load, such as graphics application (I often use this application for my needs with parameters that produce higher or lower GPU load). Also, the application itself is using Direct2D to generate actual input video frames so this activity also has certain impact, presumably low enough due to primitive operations, yet still.

The main point here is to measure the latency in first place for a particular piece of hardware, see how things possibly improve with driver updates, and how codecs compare one to the other and what is the effect of the resolution choice. Also, the question is whether the encoder is fast enough to process data real-time in first place.

The application keeps drawing a simple chart and then the same data is fed into encoder. The application writes raw elementary stream into .H264 and .H265 files respectively (use Media Player Classic to play them out), also saves last frame as a .PNG file.

Continue reading →