Presenting Shadertoy output at low latency with DXGI and Direct3D 11

A few nice updates today for a Direct3D 11 shadertoy rendering tool I posted earlier on here. As a load tool, it benefits from some flexibility and it is also a demonstration of hardware capabilities as well.

First of all, the -EnumerateAdapters switch is now more verbose about DXGI and prints more of flags, e.g.:

Factory: DXGI_FEATURE_PRESENT_ALLOW_TEARING 1
 Adapter 0: Radeon RX 570 Series, Vendor 0x1002, Adapter 0.0000CE52, Flags DXGI_ADAPTER_FLAG3_ACG_COMPATIBLE | DXGI_ADAPTER_FLAG3_SUPPORT_MONITORED_FENCES | DXGI_ADAPTER_FLAG3_KEYED_MUTEX_CONFORMANCE
   Output 0: \.\DISPLAY4, BitsPerColor 10, ColorSpace DXGI_COLOR_SPACE_RGB_FULL_G22_NONE_P709, HardwareCompositionSupport DXGI_HARDWARE_COMPOSITION_SUPPORT_FLAG_FULLSCREEN | DXGI_HARDWARE_COMPOSITION_SUPPORT_FLAG_CURSOR_STRETCHED
   Output 1: \.\DISPLAY5, BitsPerColor 10, ColorSpace DXGI_COLOR_SPACE_RGB_FULL_G22_NONE_P709, HardwareCompositionSupport DXGI_HARDWARE_COMPOSITION_SUPPORT_FLAG_FULLSCREEN | DXGI_HARDWARE_COMPOSITION_SUPPORT_FLAG_CURSOR_STRETCHED
 Adapter 1: Intel(R) UHD Graphics 630, Vendor 0x8086, Adapter 0.0000C9B9, Flags DXGI_ADAPTER_FLAG3_ACG_COMPATIBLE | DXGI_ADAPTER_FLAG3_SUPPORT_MONITORED_FENCES | DXGI_ADAPTER_FLAG3_KEYED_MUTEX_CONFORMANCE
 Adapter 2: Microsoft Basic Render Driver, Vendor 0x1414, Adapter 0.0000CE2B, Flags DXGI_ADAPTER_FLAG3_SOFTWARE | DXGI_ADAPTER_FLAG3_ACG_COMPATIBLE | DXGI_ADAPTER_FLAG3_SUPPORT_MONITORED_FENCES | DXGI_ADAPTER_FLAG3_KEYED_MUTEX_CONFORMANCE

I also added a few more shadertoy options available now using switch -ShaderIndex with values 0..3. The values of three picks HLSL version of Neon Tunnel by alro. Apart from being cool, this shader is really simple and lightweight and so it enables high FPS rendering while it still offers smooth perceptible motion in view.

New switch -OutputMode 1 enables textual overlay in the top left corner, which is implemented using Direct2D with DirectWrite, and interop with Direct3D 11 to cooperatively render and print over graphics before sending frames to presentation.

With shader rendering, interop with overlay and really simple single threaded design the process results in pretty high frame rates.

The tool is now implements better support for variable refresh rate monitors and low latency presentation. The latency in windowed mode falls below two frame sync intervals (on Radeon RX 570 Series; use PresentMon to collect statistics):

(DXGI): SyncInterval=0 Flags=512 1.81 ms/frame (551.3 fps, 59.8 fps displayed, 25.39 ms latency) Composed: Flip

With 144 Hz monitor, it can still be under 2/144 of a second:

(DXGI): SyncInterval=0 Flags=512 0.68 ms/frame (1469.1 fps, 131.3 fps displayed, 13.45 ms latency) Composed: Flip  

When in fullscreen mode, direct flipping mode results in really low latency of just a few milliseconds.

(DXGI): SyncInterval=0 Flags=0 1.82 ms/frame (550.9 fps, 550.9 fps displayed, 3.57 ms latency) Hardware Composed: Independent Flip

Use of -WaitableObject 3 switch in windowed mode to enable use of DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT shows that a single threaded rendering loop indeed spends some time in sleeping with respective reduction in GPU engine load and frame rate.

Have fun!

Download links

Binaries:

  • 64-bit: RenderSc.exe (in .7z archive)
  • License: This software is free to use

Sakura Bliss version of Direct3D 11 rendering and presentation load tool

I am sharing today one another tool which I am using to simulate load from time to time (such as the one against video/desktop streaming through Rainway).

The application is rendering an HLSL variant of Sakura Bliss shadertoy (by Philippe Desgranges) using Direct3D 11 swap chain in DXGI_SWAP_EFFECT_FLIP_DISCARD mode.

For starters, the shader is mesmerizing on its own, fantastic work!

Just as in the case of another earlier application with another shadertoy, the HLSL source code can be extracted from application resources and the shaders are compiled on runtime.

The application offers some important command line switches to configure the workload as needed.

Syntax: RenderSc [options]

Options: 
  -DisableDebugOutput - Disable forward of debug output to console output in debug mode (should appear before -Debug)
  -Debug[:<Normal|Full|<Value>>] - Enable self-debugging capability with specific minidump type
  -EnumerateAdapters - Enumerate DXGI adapters and exit
  -AdapterIndex <index> - Specify DXGI adapter index (default is 0)
  -OutputIndex <index> - Specify DXGI output index (default is 0)
  -Resolution <width> <height> - Specify resolution of generated video (default is 1920 x 1080)
  -Format <format> - Specify DXGI format (b8g8r8a8, r8g8b8a8, r10g10b10a2, r16g16b16a16; default is b8g8r8a8)
  -SwapChainBufferCount <count> - Specify DXGI swapchain buffer count (default is 2)
  -Fullscreen - Start in full screen mode (otherwise use Alt+Enter to switch)
  -Rate <numerator> <denominator> - Specify frame rate for fullscreen mode (default is 144 Hz)
  -PresentSyncInterval <interval> - Use specific presentation synchronization interval (default is 0)

Full-screen mode can be requested from command line as well as enabled or disabled by Alt+Enter.

It is possible to configure some important parameters which you should be aware of from MSDN documentation on DXGI and Direct3D 11. One specific thing to mention is that it is possible to request DXGI_FORMAT_R10G10B10A2_UNORM and DXGI_FORMAT_R16G16B16A16_FLOAT formats. To reduce amount of rendering a
-PresentSyncInterval 1 parameter can be used: it defines the first argument to IDXGISwapChain::Present call.

Download links

Binaries:

  • 64-bit: RenderSc.exe (in .7z archive)
  • License: This software is free to use

A few interesting observations about NVIDIA Turing video encoders

GPUNVIDIA GeForce GTX 1080 TiNVIDIA GeForce GTX 1660 Ti
MicroarchitecturePascalTuring
H.264
NV_ENC_CAPS_SUPPORT_FIELD_ENCODING
YesNo
H.264
NV_ENC_PRESET_HP_GUID entropyCodingMode
NV_ENC_H264_
ENTROPY_CODING_MODE_
CAVLC
NV_ENC_H264_
ENTROPY_CODING_MODE_
CABAC
H.265/HEVC
NV_ENC_CAPS_NUM_MAX_BFRAMES
05
H.265/HEVC
NV_ENC_CAPS_SUPPORT_TEMPORAL_AQ
NoYes

Apart from the capabilities, a whitepaper mentions these H.265/HEVC improvements:

Turing GPUs also ship with an enhanced NVENC encoder unit that adds support for H.265 (HEVC) 8K encode at 30 fps. The new NVENC encoder provides up to 25% bitrate savings for HEVC and up to 15% bitrate savings for H.264.

Media Foundation incorrectly reports resolution for H.265/HEVC video tracks

Another problem (bug) with Microsoft Media Foundation MPEG-4 Media Source H.265/HEVC handler is that it ignores conformance_window_flagflag and values from H.265’s seq_parameter_set_rbsp (see H.265 spec, F.7.3.2.2.1 General sequence parameter set RBSP syntax).

The problem might or might not be limited to fragmented MP4 variants.

It is overall questionable whether it has been a good idea to report video stream properties using parameter set data. This is not necessarily bad, especially if it was accurately documented in first place. Apparently this raises certain issues from time to time, like this one:
Media Foundation and Windows Explorer reporting incorrect video resolution, 2560×1440 instead of 1920×1080. Perhaps every other piece of software and library does not take a trouble to parse the bitstream and simply forwards values from tkhd and/or stsd boxes, why not?

Not the case of Media Foundation primitives which shake the properties out of bitstreams and their parameter sets. There is no problem if values match one another through the file of course.

A bigger problem, however, is that parsing out H.265/HEVC bitstream the media source handler fails to take into account cropping window… Seriously!

conformance_window_flag equal to 1 indicates that the conformance cropping window offset parameters follow next in the SPS. conformance_window_flag equal to 0 indicates that the conformance cropping window offset parameters are not present.

The popular resolution of 1920×1080 when encoded in 16×16 macroblocks is effectively consisting of 120×68 blocks with 1088 luma samples in height. The height of 1080 is obtained by cropping 1088 from either or both sides. By ignoring the cropping, Microsoft’s handler misreporting video height 1920×1088 even if all parts of video file have the correct value of 1080.

1920×1080 HEVC (meaning it does not play in every browser – beware and use Edge)
 MF_MT_MAJOR_TYPE, vValue {73646976-0000-0010-8000-00AA00389B71} (Type VT_CLSID, MFMediaType_Video, FourCC vids)
 MF_MT_SUBTYPE, vValue {43564548-0000-0010-8000-00AA00389B71} (Type VT_CLSID, MFVideoFormat_HEVC, FourCC HEVC)
 MF_MT_VIDEO_PROFILE, vValue 1 (Type VT_UI4)
 MF_MT_VIDEO_LEVEL, vValue 123 (Type VT_UI4)
 MF_MT_FRAME_SIZE, vValue 8246337209408 (Type VT_UI8, 1920x1088)
 MF_MT_INTERLACE_MODE, vValue 7 (Type VT_UI4)
 MF_MT_FRAME_RATE, vValue 65970697666816 (Type VT_UI8, 15360/256, 60.000)
 MF_MT_AVG_BITRATE, vValue 41976 (Type VT_UI4)
 MF_MT_MPEG4_CURRENT_SAMPLE_ENTRY, vValue 0 (Type VT_UI4)
 MF_MT_MPEG4_SAMPLE_DESCRIPTION, vValue 00 00 00 D1 68 76 63 31 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 07 80 04 38 00 00 00 48 00 00 00 48 00 00 00 00 00 01 0B 48 45 56 43 20 43 6F 64 69 6E 67 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 18 FF FF 00 00 00 7B 68 76 63 43 01 01 00 00 00 40 00 B0 00 00 00 00 7B F0 F0 FC FD F8 F8 3C 00 0B 03 A0 00 01 00 17 40 01 0C 01 FF FF… (Type VT_VECTOR | VT_UI1)
 MF_MT_VIDEO_ROTATION, vValue 0 (Type VT_UI4)
 MF_NALU_LENGTH_SET, vValue 1 (Type VT_UI4)

CleanPoint markup fun with a fragmented MP4 file and Media Foundation MPEG-4 Source

MPEG-4 Media Foundation Source stubbornly keeps marking a second video sample with a MFSampleExtension_CleanPoint flag even though nothing suggests that the video frame is an IDR frame.

The actual video frame is a P frame both in terms of MP4 box formatting and contained NAL units (the video is in fact an “infinite GOP” flavor of recording where all frames are P frames except the very first IDR one).

The problem is specific to fragmented MP4 files (and maybe even a subset of those), however is pretty much consistent and shows up with both H.264 and H.265/HEVC video.

Use of ICodecAPI interface with a video encoder managed by Media Foundation Sink Writer instance

A bump of StackOverflow post about Media Foundation design flaw related to video encoding.

Set attributes via ICodecAPI for a H.264 IMFSinkWriter Encoder

I am trying to tweak the attributes of the H.264 encoder created via ActivateObject() by retrieving the ICodecAPI interface to it. Although I do not get errors, my settings are not taken into account. […]

Media Foundation’s Sink Writer is a simplified API with a encoder configuration question slipped away. The fundamental problem here is that you don’t own the encoder MFT and you are accessing it over the writer’s head, then the behavior of encoders around changing settings after everything is set up depends on implementation, which is in encoder’s case a vendor specific implementation and might vary across hardware.

Your more reliable option is to manage encoding MFT directly and supply Sink Writer with already encoded video.

Your potential trick to make things work with less of effort is to retrieve IMFTransform of the encoder as well and clear and then set back the input/output media types after you finished with ICodecAPI update. Nudging the media types, you suggest that encoder re-configures the internals and it would do this already having your fine tunings. Note that this, generally speaking, might have side issues.

Continue reading →