Some of high resolution HDMI video capture hardware

Recent submissions of DirectShow and Media Foundation capture capabilities brought references to new interesting hardware.

1. Magewell USB Capture HDMI Gen 2 AKA XI100DUSB-HDMI — USB\VID_2935&PID_0001

Reports ability to capture up to 1920×1080 video at 60 frames per second in YUY2 and RGB pixel formats.

2. Inogeni 4K HDMI to USB 3.0 Converter AKA 2A34-INOGENI 4K2USB3 — USB\VID_2997&PID_0004

Goes beyond that and promises resolutions up to 4096×2160@24, 2560×1440@60

Also, not only with Windows 10, but it seems that it can work well with older OSes like Windows 7 as well.

3. Magewell Pro Capture HDMI 4K — PCI\VEN_1CD7&DEV_0012

A PCI version of HDMI capture board with a rich offering of media types and 2K resolutions

4. Some simpler noname HDMICap USB dongle for up to 1920×1080 YUV and RGB capture — USB\VID_EB1A&PID_7595

(I did not have the actual hardware, I just see reported specifications)

Follow up: mixed parallel H.264 encoding, Intel® Quick Sync Video H.264 Encoder MFT + NVIDIA H.264 Encoder MFT

A scenario which was dropped out from previous post is mixed simultaneous encoding using both hardware encoders. Rationale: Intel QSV encoder might exist as a “free” capability of the motherboard (provided with onboard video adapter), The other one might be available with the video adapter intentionally plugged in (including for other reasons, such as to power dual monitor system etc).

From this standpoint, it might be interesting if one can benefit from using of both encoders.

Intel QSV Filter Graph

Nvidia NVEVC Filter Graph

Two filter graphs are set to produce 60 seconds of 1080p60 video as soon as possible, and are started simultaneously. The chart below show completion time, side by side with those for runs of one and two sessions of each encoder separately.

Completion Times: Intel® Quick Sync Video H.264 Encoder MFT + NVIDIA H.264 Encoder MFT

Informational: in single stream runs CPU load was around 30%, two session runs – around 50%, of which the part that synthesizes and converts the video to compatible MFT input format took 5-6% of CPU time overall. Or, if computed against 60 seconds of CPU time of eight core CPU, the synthesis-and-conversion itself consumed <4% CPU time for one stream, and <7% for dual stream runs.

 

Encoding H.264 video using hardware MFTs

Some time ago there were some pictures explaining performance and other properties of software H.264 encoder (x264). At this time, it is a turn of hardware H.264 encoders and more to that, two of them and side by side. Both encoders are nothing new: Intel® Quick Sync Video H.264 Encoder and NVIDIA H.264 Encoder already have been around for a while. Some would say it is already time for H.265 encoders.

Either way, on my test machine both encoders are available without additionally installed software (that is, no need for Intel Media SDK, Nvidia NVENC, redistributable files etc.). Out of the box, Windows 10 offers stock software only encoder, and hardware encoders in form factor of Media Foundation Transform (MFT).

Environment:

  • OS: Windows 10 Pro
  • CPU: Intel i7-4790
  • Video Adapter 1: Intel HD Graphics 4600 (on-board, not connected to monitors)
  • Video Adapter 2: NVIDIA GeForce GTX 750

It is not convenient or fun to do things with Media Foundation, but good news is that Media Foundation components are well-separable. A wrapper over MFT that converts them into DirectShow filters, make them available to DirectShow where it is already way easier to run various test runs. The pictures below show metrics for encoder defaults (bitrate, profiles and many other options that create a great deal of encoding modes). Still the pictures do show that both encoders are well usable for many scenarios including HD processing, simultaneous data processing etc.

Video Encoder MFT Wrapper in GraphStudioNext

Test runs are as simple as taking reference video source signal of different properties, pushing it through encoder filter and either writing to a file (to inspect the footage) or to Null Renderer Filter to measure performance.

Intel® Quick Sync Video H.264 Encoder produces files like these: 720×480.mp4, 2556×1440.mp4, which are of decent quality (with respect to low bitrate and “hard to handle” background changes). NVIDIA H.264 Encoder produces somewhat better output supposedly by choosing higher bitrate. Either way, both encoders have a number of ways to fine tune the encoding process. Not just bitrate, profile, GOP length, B frame settings but even more sophisticated parameters.

Intel® Quick Sync Video H.264 Encoder MFT

CODECAPI_AVEncCommonRateControlMode: VT_UI4 0, default VT_UI4 0, modifiable // eAVEncCommonRateControlMode_CBR = 0
CODECAPI_AVEncCommonQuality: minimal VT_UI4 0, maximal VT_EMPTY, step VT_EMPTY
CODECAPI_AVEncCommonBufferSize: VT_UI4 3131961357, default VT_UI4 0, modifiable
CODECAPI_AVEncCommonMaxBitRate: default VT_UI4 0
CODECAPI_AVEncCommonMeanBitRate: VT_UI4 3131961357, default VT_UI4 2222000, modifiable
CODECAPI_AVEncCommonQualityVsSpeed: VT_UI4 50, default VT_UI4 50, modifiable
CODECAPI_AVEncH264CABACEnable: modifiable
CODECAPI_AVEncMPVDefaultBPictureCount: VT_UI4 0, default VT_UI4 0, modifiable
CODECAPI_AVEncMPVGOPSize: VT_UI4 128, default VT_UI4 128, modifiable
CODECAPI_AVEncVideoEncodeQP: 
CODECAPI_AVEncVideoForceKeyFrame: VT_UI4 0, default VT_UI4 0, modifiable
CODECAPI_AVLowLatencyMode: VT_BOOL 0, default VT_BOOL 0, modifiable
CODECAPI_AVEncVideoLTRBufferControl: VT_UI4 65536, values { VT_UI4 65536, VT_UI4 65537, VT_UI4 65538, VT_UI4 65539, VT_UI4 65540, VT_UI4 65541, VT_UI4 65542, VT_UI4 65543, VT_UI4 65544, VT_UI4 65545, VT_UI4 65546, VT_UI4 65547, VT_UI4 65548, VT_UI4 65549, VT_UI4 65550, VT_UI4 65551, VT_UI4 65552 }, modifiable
CODECAPI_AVEncVideoMarkLTRFrame: 
CODECAPI_AVEncVideoUseLTRFrame: 
CODECAPI_AVEncVideoEncodeFrameTypeQP: default VT_UI8 111670853658, minimal VT_UI8 0, maximal VT_UI8 219046674483, step VT_UI8 1
CODECAPI_AVEncSliceControlMode: VT_UI4 0, default VT_UI4 2, minimal VT_UI4 2, maximal VT_UI4 2, step VT_UI4 0, modifiable
CODECAPI_AVEncSliceControlSize: VT_UI4 0, minimal VT_UI4 0, maximal VT_UI4 8160, step VT_UI4 1, modifiable
CODECAPI_AVEncVideoMaxNumRefFrame: minimal VT_UI4 0, maximal VT_UI4 16, step VT_UI4 1, modifiable
CODECAPI_AVEncVideoTemporalLayerCount: default VT_UI4 1, minimal VT_UI4 1, maximal VT_UI4 3, step VT_UI4 1, modifiable
CODECAPI_AVEncMPVDefaultBPictureCount: VT_UI4 0, default VT_UI4 0, modifiable

NVIDIA H.264 Encoder MFT

CODECAPI_AVEncCommonRateControlMode: VT_UI4 0
CODECAPI_AVEncCommonQuality: VT_UI4 65
CODECAPI_AVEncCommonBufferSize: VT_UI4 8923353
CODECAPI_AVEncCommonMaxBitRate: VT_UI4 8923353
CODECAPI_AVEncCommonMeanBitRate: VT_UI4 2974451
CODECAPI_AVEncCommonQualityVsSpeed: VT_UI4 33
CODECAPI_AVEncH264CABACEnable: VT_BOOL -1
CODECAPI_AVEncMPVGOPSize: VT_UI4 50
CODECAPI_AVEncVideoEncodeQP: VT_UI8 26
CODECAPI_AVEncVideoForceKeyFrame: 
CODECAPI_AVEncVideoMinQP: VT_UI4 0, minimal VT_UI4 0, maximal VT_UI4 51, step VT_UI4 1
CODECAPI_AVLowLatencyMode: VT_BOOL 0
CODECAPI_AVEncVideoLTRBufferControl: VT_UI4 0, values { VT_I4 65537, VT_I4 65538 }
CODECAPI_AVEncVideoMarkLTRFrame: 
CODECAPI_AVEncVideoUseLTRFrame: 
CODECAPI_AVEncVideoEncodeFrameTypeQP: VT_UI8 111670853658
CODECAPI_AVEncSliceControlMode: VT_UI4 2, minimal VT_UI4 0, maximal VT_UI4 2, step VT_UI4 1
CODECAPI_AVEncSliceControlSize: VT_UI4 0, minimal VT_UI4 0, maximal VT_UI4 3, step VT_UI4 1
CODECAPI_AVEncVideoMaxNumRefFrame: VT_UI4 1, minimal VT_UI4 0, maximal VT_UI4 16, step VT_UI4 1
CODECAPI_AVEncVideoMeanAbsoluteDifference: VT_UI4 0
CODECAPI_AVEncVideoMaxQP: VT_UI4 51, minimal VT_UI4 0, maximal VT_UI4 51, step VT_UI4 1
CODECAPI_AVEncVideoROIEnabled: VT_UI4 0
CODECAPI_AVEncVideoTemporalLayerCount: minimal VT_UI4 1, maximal VT_UI4 3, step VT_UI4 1

Important property of hardware encoder is that even that it does consume some of CPU time, the most of the complexity is offloaded to video hardware. In all single stream test runs, the eight-core CPU was loaded not more than 30% including time required to synthesize the image using WIC and Direct2D and convert it to YUV format using CPU. That is, offloading video encoding to GPU is a convenient way to free CPU for real time video processing applications.

I was mostly interested in how the encoders are in terms of being able to process real time data, esp. so that they are applied to record lengthy sessions. Both encoders appear to be fast enough to crack 1920×1080 HD video at frame rates up to 60 and higher. The test did encoding at highest rate possible and 100% number on the charts corresponds to situation that it took one second to synthesize and encode one second of video no matter what effective CPU/GPU load is. That is, values less than 100% indicate ability to encode video content in real time right away.

Intel and NVidia Hardware H.264 Encoders Side by Side

Basically, the numbers show that both encoders are fast enough to reliably encode 1080p60 stream.

Looking at it from another standpoint of being able to process two or more H.264 encoding sessions at once, encoder from NVidia has an important limitation of two sessions per system (supposedly related thread – for this or another reason test run with three streams fails).

Intel and NVidia H.264 Encoders in Concurrent Encoding

Both encoders are hardly suitable for reliable encoding of two 1080p60 streams simultaneously (or perhaps some fine tuning might make things faster by choosing appropriate encoding mode). However both look fine for encoding 1080p and lower resolution stream. Clearly, Intel’s encoder can be used to encoder multiple low resolution streams in parallel or mix real time encoding with background encoding (provided that background encoding is throttled to let the real time stream run fast enough). If otherwise real-time encoding is not necessary, both encoders can do the job as well, and with Nvidia the application needs to make sure that only two sessions are running simultaneously, Intel’s encoder can be used in a more flexible way.

Also, Nvidia’s encoder is slightly faster, however Intel’s allow 3+ concurrently encoded stream and also allows to supply RGB input directly without converting to YUV.

There is also Intel® Hardware H265 Encoder MFT available for H.265 encoding, but this is going to be another story some time later.

Blackmagic Design Intensity Pro 4K Issues

The new board is inexpensive,cool (well, actually it is hot, see below) and easy to interface with but has has severe issues.

The Intensity Pro 4K is great for video editors that need a realtime preview on a big screen TV, people doing live streaming presentations, or for those trying to save family videos from old VHS tapes.
Customers can capture NTSC, PAL, 720HD, 1080HD and Ultra HD.

Issue 1. The cooling fan is totally annoying. 40mm fan is running at constant high (max?) speed without dynamic speed control. Noise level is absolutely unacceptable and the board is a “no go” until the problem is solved. There is a grayed out check box to enable board’s control over spinning, so we might expect (and hope!) that firmware update starts doing what it is supposed to do from the start.

Issue 2. Blackmagic Design Desktop Video 10.4 is unstable. Internal problems partially disable board capabilities and certain modes are no longer available. Software can no longer capture XBox signal, in particular. Blackmagic Design is yet to release good operational version of software.

Issue 3. DeckLink SDK memory leak (applies to 10.3.7 and supposedly earlier versions as well; reference code). IDeckLinkInput does not properly manage internal video frame buffers and leak them once in a while. The problem does not happen if you:

  • reuse IDeckLinkInput interfaces
  • use custom memory allocator (which is preferred because stock allocator is also way too memory greedy)

Hardware assisted memory corruption detection

So you got a memory corruption issue with a piece of software. It comes in a unique scenario along the line of having a huge pile of weird code running well most of the time and then, right out of the blue, a corruption takes place followed by unexpected code execution and unstable software state in general.

The biggest problem with memory corruption is that a fragment of code is modifying a memory block which it does not own, and it has no idea who actually is the owner of the block, while the real owner has no timely way to detect the modification. You only face the consequences being unable to capture the modification moment in first place.

To get back to the original cause, an engineer has to drop into a time machine, turn back time and step back to where the trouble took originally place. As developers are not actually given state-of-the-art time machines, the time turning step is speculative.

CVirtualHeapPtr Class: Memory with Exception-on-Write access mode

At the same time a Windows platform developer is or might be aware of virtual memory API which among other things provides user mode application with capabilities to define memory protection modes. Having this on hands opens unique opportunity to apply read-only protection (PAGE_READONLY) onto a memory block and have exception raised at the very moment of unexpected memory modification, having call stack showing up a source of the problem. I refer to this mode of operation as “hardware assisted” because the access violation exception/condition would be generated purely in hardware without any need to additionally do any address comparison in code.

Needless to say that this way is completely convenient for the developer as he does not need to patch the monstrous application all around in order to compare access addresses against read-only fragment. Instead, a block defined as read-only will be immediately available as such for the whole process almost without any performance overhead.

As ATL provides a set of memory allocator templates (CHeapPtr for heap backed memory blocks, allocated with CCRTAllocator, alternate options include CComHeapPtr with CComAllocator wrapping CoTaskMemAlloc/CoTaskMemFree API), let us make an alternate allocator option that mimic well-known class interface and would facilitate corruption detection.

Because virtual memory allocation unit is a page, and protection mode is defined for the whole page, this would be the allocation granularity. For a single allocated byte we would need to request SYSTEM_INFO::dwPageSize bytes of virtual memory. Unlike normal memory heap manager, we have no way to share pages between allocations as we would be unable to effectively apply protection modes. This would definitely increase application pressure onto virtual memory, but is still acceptable for the sacred task of troubleshooting.

We define a CVirtualAllocator class to be compatible with ATL’s CCRTAllocator, however based on VirtualAlloc/VirtualFree API. The smart pointer class over memory pointer would be defined as follows:

template <typename T>
class CVirtualHeapPtr :
    public CHeapPtr<T, CVirtualAllocator>
{
public:
// CVirtualHeapPtr
    CVirtualHeapPtr() throw();
    explicit CVirtualHeapPtr(_In_ T* pData) throw();
    VOID SetProtection(DWORD nProtection)
    {
        // TODO: ...
    }
};

The SetProtection method is to define memory protection for the memory block. Full code for the classes is available on Trac here (lines 9-132):

  • CGlobalVirtualAllocator class is a singleton querying operating system for virtual memory page size, and provides alignment method
  • CVirtualAllocator class is a CCRTAllocator-compatible allocator class
  • CVirtualHeapPtr class is smart template class wrapping a pointer to allocated memory

Use case code will be as follows. “SetProtection(PAGE_READONLY)” enables protection on memory block and turns on exception generation at the moment memory block modification attempt. “SetProtection(PAGE_READWRITE)” would restore normal mode of memory operation.

CVirtualHeapPtr<BYTE> p;
p.Allocate(2);
p[1] = 0x01;
p.SetProtection(PAGE_READONLY);
// NOTE: Compile with /EHa on order to catch the exception
_ATLTRY
{
    p[1] = 0x02;
    // NOTE: We never reach here due to exception
}
_ATLCATCHALL()
{
    // NOTE: Catching the access violation for now to be able to continue execution
}
p.SetProtection(PAGE_READWRITE);
p[1] = 0x03;

Given the information what data gets corrupt, the pointer allocator provides an efficient opportunity to detect the violation attempt. The only thing remained is to keep memory read-only, and temporarily revert to write access when the “legal” memory modification code is about to be executed.

Continue reading →