Applying Hardsubs to H.264 Video with GPU

Video adapters currently offer a range of services which enables transcoding of H.264 content with certain modifications (including but not limited to flexible overlays, scaling, mirroring, effects and filters) end-to end on GPU keeping data as DirectX Graphics Infrastructure resource at all processing stages.

Such specialized processing capabilities are pretty powerful compared to traditional CPU processing, especially taking into consideration performance of low end low power-consumption systems still equipped with contemporary GPU.

For a test, I transcoded video H.264/MP4 files of different resolutions applying a text overlay having a time stamp of video frame being processed. The overlay is complex enough to be  varying frame to frame, be a standard font with respective rasterization (using DirectWrite). The test re-encoded H.264 content maintaining bitrate without giving too much care for other encoding details (defaults used).

Hardsub Performance Test

The roughly made test was successful with two video GPUs:

  1. Intel HD Graphics 4600 (7th gen; Desktop system; Core i7-4790 CPU)
  2. Intel HD Graphics (8th gen; Ultramobile system; Atom x5-Z8300 CPU) – the system is actually a $200 worth budget Chinese tablet Cube iWork 10 Ultimate

The test failed on other GPUs:

  • Intel HD Graphics 4000 (7th gen; Mobile system; Core i7-3517U CPU)
  • NVIDIA GeForce GTX 750

The problem – as it looks without getting into details – seems to be the inability of Media Foundation APIs to fit Direct3D-enabled pipelines out of the box, such as because of lack of certain conversion. It looks like transcoding can be achieved, with just putting some more effort into it.

As of now, Intel offers their 9th generation GPUs and the ones being tested are hardware of a few yeas in age…

Compared to real time performance of 100% (meaning that it takes one second to process one second of video of given metrics), both systems managed to do the transcoding relatively efficiently. With a roughly built test having a bottleneck at applying overlay, taking place serially in single thread, both systems showed performance sufficient to convert 1920×1080@60 video faster than in real time and without maxing CPU out.

Intel’s seventh generation desktop GPU managed to do the job way much faster.

It is interesting that even cheap tablet can process a Full HD video stream loading CPU less than 40%. Basically, the performance is sufficient for doing real time video processing (including using external web camera like Logitech C930E) with certain processing in 1080p resolution using budget grade hardware.

Re-encoding Performance

When there is no necessity to keep the real time processing pace, the cheap tablet showed the ability to do GPU processing on 2K video, which is also good news for those who wants to apply budget hardware to high resolution material.

Apparently, the key factor is ability of the process to keep data in video hardware. As Intel GPU H.264 abilities scale well when used for concurrent multi-stream processing, the performance numbers promise great performance recording video in several formats at a time: raw video, video with overlay, scaled down etc.

The table below gives more numbers for the tests concluded:

Re-encoding Performance Numbers

As mentioned above, the overlaying part itself is a single threaded bottleneck and presumably it is a reserve to be used to cut elapsed time down even further.

Another interesting observation is that while ultramobile system still uses much of CPU time (which is okay – it’s not a powerful system by design), the desktop GPU has minimal impact on CPU while doing pretty complicated task.

Leave a Reply