WebCodecs in StreamingServer for JavaScript H.264 decoding

One another small addition to StreamingServer showcase/development application: verification for WebCodecs API video streaming. WebCodecs API offers browser applications video decoding capabilities:

The WebCodecs API gives web developers low-level access to the individual frames of a video stream and chunks of audio. It is useful for web applications that require full control over the way media is processed. For example, video or audio editors, and video conferencing.

The API is shipped starting Chrome version 94 (explainer is here). In a nutshell, JavaScript code can handle raw uncontainerized video data and convert that into video frames which can be, in particular, drawn on HTML canvas. This provides a lower level video decoding capability compared to Media Streaming Extensions (MSE): the video stream does not need to be containerized, yet browser provides intefrace into hardware accelerated video decoding for efficient video data processing.

StreamingServer now handles two types of requests in its HTTP/HTTPS interface: /webcodecs-videodecoder-A.html with JavaScript code controlling WebCodecs API for decoding followed by rendering obtained frames on a timer callback, and /webcodecs-videodecoder-A?frame= to send indivudal H.264 encoded on the fly video frame. All together, the code simluates video playback receiving H.264 frames from HTTP server one by one.

The setup is a proof of concept and generates and encodes the full frame set on original request, without actual per frame on demand encoding, so be aware if you happen to request a long sequence.

To check things out, have StreamingServer started and open Chrome Canary version 94+¹, then navigate to one of the following:

  • http://localhost/hls/webcodecs-videodecoder-A.html
  • http://localhost/hls/webcodecs-videodecoder-A.html?FrameSizeW=720&FrameSizeH=480&FrameRateN=30000&FrameRateD=1001&SegmentDuration=15

The second URL shows the available parameters for video encoding. The JavaScript code can be inspected directly from Chrome’s Developer Tools.

¹ Microsoft Edge self-updated today to Version 94.0.992.31 and it has WebCodecs API available in it as well!

Download links

Binaries:

  • 64-bit: StreamingServer.exe (in .ZIP archive)
  • License: This software is free to use; builds have time based expiration

MPEG-DASH trick play adaptation set

Just a small addition to MPEG-DASH server: a separate trick play video track with 1 fps just IDR frames track.

The “trick mode” itself is essentially this:

3.2.9. Trick Mode Support

Trick Modes are used by DASH clients in order to support fast forward, seek, rewind and other operations in which typically the media, especially video, is displayed in a speed other than the normal playout speed. In order to support such operations, it is recommended that the content author adds Representations at lower frame rates in order to support faster playout with the same decoding and rendering capabilities.

However, Representations targeted for trick modes are typically not be suitable for regular playout.

The application extends its manifest with an additional “trick” video track when requested URL is http://localhost/hls/manifest.mpd?trickplay

Download links

Binaries:

  • 64-bit: StreamingServer.exe (in .ZIP archive)
  • License: This software is free to use; builds have time based expiration

MPEG-DASH content in StreamingServer application

MPEG-DASH is ISO/IEC 23009 “Dynamic Adaptive Streaming over HTTP” specification. This is widely used to stream audiovisual content over internet opposed to playback of static content such as downloaded clip.

The StreamingServer application I published some time ago generated test content using HTTP Live Streaming protocol, which is, well, similar.

So I extended StreamingServer a bit and made it expose the media as MPEG-DASH content as well. The feature set is way narrower than in the case of HLS, it’s just a VOD asset, but it’s a bit sophisticated: multi-period with three periods and not so obvious internal layout. Experimental, a sort of.

I will use the space of this post to document steps to enable playback of this content.

Once again what the application does in first place? Once started, the application (or a service, if converted to run as a Windows service) is jumping onto Windows HTTP Server API (so you might need to run it with elevated privileges) and extends built-in web server by providing content. If executed with no arguments, it connects to http://localhost/hls/ node and is ready to serve http://localhost/hls/master.m3u8 for HLS playback, and now also http://localhost/hls/manifest.mpd for MPEG-DASH playback. http://localhost/hls/about has some embedded documentation.

Serving the requests, the application prepares audio and video content on the fly, for video it leverages NVIGIA GPU hardware video encoder if available, bit it also has a fallback code path to use Microsoft software encoder. The application is not designed for concurrent access by multiple clients and of course real time video encoding has its own capacity too. The application is rather a verification tool, internally it runs a few Microsoft Media Foundation pipelines (media sessions) for various things: to obtains RFC 6381 “codecs” data, initialization and media segments etc.

To play MPEG-DASH asset perhaps the most popular player would be Shaka Player, which specifically has a convenient online demo. There is custom content section where manifest URL http://localhost/hls/manifest.mpd can be added for playback.

One problem here is CORS with security and permissions for browser code. The demo is running over HTTPS and so it can’t consume HTTP media asset. To work this around StreamingServer needs to be started with these command line switches, to register on both HTTP and HTTPS nodes of the web server.

StreamingServer.exe -Location http://+:80/hls/ -Location https://+:443/hls/

In order to use the application non-locally over HTTPS you might need to configure IIS first and add a certificate there. Self-signed certificate works out fine as long as you add trust to it on the client side.

What happens next? We are good to go.

The blue, green and red parts represent separate periods which are stitched smoothly during playback (it is easy to see what’s inside by downloading the manifest and opening it in your favorite XML editor).

Some more perks:

The rest of the properties of video and audio are hardcoded for MPEG-DASH.

Further reading:

Download links

Binaries:

  • 64-bit: StreamingServer.exe (in .ZIP archive)
  • License: This software is free to use; builds have time based expiration

Wait for IAsyncAction on STA thread

Figured out how to elegantly do a blocking wait for an asynchronous coroutine-enabled function on a STA thread.

You can’t do this:

// /std:c++latest /await

#include <unknwn.h>
#include <winrt\base.h>
#include <winrt\Windows.Foundation.h>

#pragma comment(lib, "windowsapp.lib")

winrt::Windows::Foundation::IAsyncAction Foo()
{
    co_return;
}

int main()
{
    winrt::init_apartment(winrt::apartment_type::single_threaded);
    Foo().get(); // <<--- Debug Assertion Failed!
    return 0;
}

There is an assertion failure because .get() assumes ability to block. On STA this hits a failure in winrt::impl::blocking_suspend call.

So you have to avoid doing .get() to synchronize and there should be a message pump (you might need it for another reason anyway or why would you want non-default single_threaded in first place?).

So you would get something like this:

// /std:c++latest /await

#include <unknwn.h>
#include <winrt\base.h>
#include <winrt\Windows.Foundation.h>

#pragma comment(lib, "windowsapp.lib")

winrt::Windows::Foundation::IAsyncAction Foo()
{
    co_return;
}

int main()
{
    winrt::init_apartment(winrt::apartment_type::single_threaded);
    winrt::handle CompletionEvent { CreateEvent(nullptr, TRUE, FALSE, nullptr) };
    auto const Action { Foo() };
    Action.Completed([&](winrt::Windows::Foundation::IAsyncAction const&, winrt::Windows::Foundation::AsyncStatus Status) 
    {
        WINRT_ASSERT(Status == winrt::Windows::Foundation::AsyncStatus::Completed);
        WINRT_VERIFY(SetEvent(CompletionEvent.get()));
    });
    HANDLE const Objects[] { CompletionEvent.get() };
    for(; ; )
    {
        auto const WaitResult = MsgWaitForMultipleObjects(static_cast<DWORD>(std::size(Objects)), Objects, FALSE, INFINITE, QS_ALLEVENTS);
        if(WaitResult == WAIT_OBJECT_0 + 0) // CompletionEvent
            break;
        WINRT_ASSERT(WaitResult == WAIT_OBJECT_0 + std::size(Objects));
        MSG Message;
        while(PeekMessageW(&Message, NULL, WM_NULL, WM_NULL, PM_REMOVE))
            DispatchMessageW(&Message);
    }
    return 0;
}

Now the question is what if the Foo function needs to switch context while being on a STA thread, would it need to repeat the same pattern and dispatch messages while waiting?

NO!

Use of apartment_context enables to switch context and return back to STA in the coroutine execution sequnce, while being on outer message pump between the coroutines.

Below is full sample code that does strange threading things in the Foo function with threads and COM apartment checks, then return to calling STA in the end of the day. Additionally, it posts a message from the worker thread and makes sure that outer message pump catches it.

// /std:c++latest /await

#include <unknwn.h>
#include <winrt\base.h>
#include <winrt\Windows.Foundation.h>

#pragma comment(lib, "windowsapp.lib")

using namespace winrt::Windows::Foundation;

#include <chrono>
#include <thread>

using namespace std::chrono_literals;

void ApartmentCheck(APTTYPE ExpectType, APTTYPEQUALIFIER ExpectQualifier)
{
    APTTYPE Type;
    APTTYPEQUALIFIER Qualifier;
    WINRT_VERIFY(SUCCEEDED(CoGetApartmentType(&Type, &Qualifier)));
    WINRT_ASSERT(Type == ExpectType && Qualifier == ExpectQualifier);
}

IAsyncAction Foo()
{
    ApartmentCheck(APTTYPE_MAINSTA, APTTYPEQUALIFIER_NONE);
    winrt::apartment_context Context;
    winrt::handle ExternalEvent { CreateEvent(nullptr, TRUE, FALSE, nullptr) };
    {
        auto const ThreadIdentifier = GetCurrentThreadId();
        std::thread SimulationThread([&] 
        {
            WINRT_VERIFY(PostThreadMessageW(ThreadIdentifier, WM_APP, 0, 0));
            std::this_thread::sleep_for(5s);
            WINRT_VERIFY(SetEvent(ExternalEvent.get())); 
        });
        //co_await winrt::resume_background();
        co_await winrt::resume_on_signal(ExternalEvent.get());
        SimulationThread.join();
    }
    ApartmentCheck(APTTYPE_MTA, APTTYPEQUALIFIER_IMPLICIT_MTA); // MtaThread enables this, see below
    co_await Context;
    ApartmentCheck(APTTYPE_MAINSTA, APTTYPEQUALIFIER_NONE);
    co_return;
}

int main()
{
    winrt::init_apartment(winrt::apartment_type::single_threaded);
    winrt::handle MtaThreadTerminationEvent { CreateEvent(nullptr, TRUE, FALSE, nullptr) };
    std::thread MtaThread([&] 
    { 
        winrt::init_apartment();
        WINRT_VERIFY(WaitForSingleObject(MtaThreadTerminationEvent.get(), INFINITE) == WAIT_OBJECT_0);
    });
    std::this_thread::sleep_for(1s);
    winrt::handle CompletionEvent { CreateEvent(nullptr, TRUE, FALSE, nullptr) };
    auto const Action { Foo() };
    Action.Completed([&](IAsyncAction const&, AsyncStatus Status) 
    {
        WINRT_ASSERT(Status == AsyncStatus::Completed);
        WINRT_VERIFY(SetEvent(CompletionEvent.get()));
    });
    unsigned int MessageCount = 0;
    HANDLE const Objects[] { CompletionEvent.get() };
    for(; ; )
    {
        auto const WaitResult = MsgWaitForMultipleObjects(static_cast<DWORD>(std::size(Objects)), Objects, FALSE, INFINITE, QS_ALLEVENTS);
        if(WaitResult == WAIT_OBJECT_0 + 0) // CompletionEvent
            break;
        WINRT_ASSERT(WaitResult == WAIT_OBJECT_0 + std::size(Objects));
        MSG Message;
        while(PeekMessageW(&Message, NULL, WM_NULL, WM_NULL, PM_REMOVE))
        {
            WINRT_ASSERT(Message.message == WM_USER || Message.message == WM_APP);
            if(Message.message == WM_APP)
                MessageCount++;
            DispatchMessageW(&Message);
        }
    }
    WINRT_ASSERT(MessageCount == 1);
    WINRT_VERIFY(SetEvent(MtaThreadTerminationEvent.get()));
    MtaThread.join();
    return 0;
}

Some additional comments to the code:

  • MtaThread is necessary for thread pool threads to belong to implicit MTA or otherwise COM backed STA return would not work
  • Initial sleep is to make sure that MTA is up
  • DispatchMessageW would dispatch two messages, one that we PostThreadMessageW ourselves and the other WM_USER one which is a part of co_await Context; call)
  • SimulationThread is featuring externally set asynchronous event
  • Commented out co_await winrt::resume_background(); indicates that there is no need in explicit switch to a worker thread: coroutine tech itself would suspend execution and continue on a thread pool thread (or maybe it’s implementation specific?)

Hardware AV1 video encoders are coming

There is something interesting finally happening with video encoding and also Media Foundation:

Intel® Hybrid AV1 Encoder MFT

11 Attributes:

  • MFT_TRANSFORM_CLSID_Attribute: {62C053CE-5357-4794-8C5A-FBEFFEFFB82D} (Type VT_CLSID)
  • MF_TRANSFORM_FLAGS_Attribute: MFT_ENUM_FLAG_HARDWARE
  • MFT_ENUM_HARDWARE_VENDOR_ID_Attribute: VEN_8086 (Type VT_LPWSTR)
  • MFT_ENUM_HARDWARE_URL_Attribute: AA243E5D-2F73-48c7-97F7-F6FA17651651 (Type VT_LPWSTR)
  • MFT_INPUT_TYPES_Attributes: {3231564E-3961-42AE-BA67-FF47CCC13EED}, MFVideoFormat_NV12
  • MFT_OUTPUT_TYPES_Attributes: MFVideoFormat_AV1
  • MFT_CODEC_MERIT_Attribute: 7 (Type VT_UI4)
  • MFT_SUPPORT_DYNAMIC_FORMAT_CHANGE: 1 (Type VT_UI4)
  • MF_TRANSFORM_ASYNC: 1 (Type VT_UI4)

Intel UHD graphics coming with 11th Gen Intel(R) Core(TM) i5-11400 @ 2.60GHz is noiced to be equipped with new stuff. AMD and NVIDIA are probably also on the way.

Unicode vs. Windows Console

If I run this, what the output would be?

#include <string>
#include <iostream>

#include <winrt\base.h>
#include <winrt\Windows.Foundation.h>
#include <winrt\Windows.Globalization.DateTimeFormatting.h>

#pragma comment(lib, "windowsapp.lib")

int main()
{
	auto const Now = winrt::clock::now();
	winrt::Windows::Globalization::DateTimeFormatting::DateTimeFormatter DateTimeFormatter { L"shortdate longtime" };
	std::wcout << "Now is " << static_cast<std::wstring>(DateTimeFormatter.Format(Now)) << std::endl;
	return 0;
}

Here we go:

What appears to be wrong is Unicode Left-to-Right mark character, which is killing the console and it stops accepting any further text!

Now if you’re going to do this:

#include <string>
#include <iostream>

#include <winrt\base.h>
#include <winrt\Windows.Foundation.h>
#include <winrt\Windows.Globalization.DateTimeFormatting.h>

#pragma comment(lib, "windowsapp.lib")

std::wstring Replace(std::wstring const& Input, std::wstring const& A, std::wstring const& B)
{
	std::wstring Output;
	for(size_t C = 0; ; )
	{
		auto const D = Input.find(A, C);
		if(D == Input.npos)
		{
			Output.append(Input.substr(C));
			break;
		}
		Output.append(Input.substr(C, D - C));
		Output.append(B);
		C = D + A.length();
	}
	return Output;
}

int main()
{
	auto const Now = winrt::clock::now();
	winrt::Windows::Globalization::DateTimeFormatting::DateTimeFormatter DateTimeFormatter { L"shortdate longtime" };
	std::wcout << "Now is " << Replace(static_cast<std::wstring>(DateTimeFormatter.Format(Now)), L"\u200E", L"") << std::endl;
	return 0;
}

Then you get what you want, and not the trailing EOL is in its place (and it’s not in the first run):

2021-07-17

A clear way to unblock Unicode character output is probably to specify locale for console output explicitly:

std::wcout.imbue(std::locale(".UTF8"));

Modern asynchronous C++

Windows API offers asynchronously implemented functionality for file, network and other I/O since long ago. It was maybe one of the easiest way to make a simple thing messy and ridiculously bloated and sensitive to errors of sorts.

If you’re sane and you don’t need to squeeze everything out something you would just not use overlapped I/O and prefer blocking versions of API. One specific advantage synchronous API and blocking calls offer is linearity of code: you see clearly what happens next and you don’t need to go back and forth between multiple functions, completions and structures that carry transit context. At the cost of threads, memory, blocking you obtain an easier and more reliable method to write code.

At some point C# as a way more flexibly developed language made a move to approach asynchronous programming in a new way: Asynchronous programming in C# | Microsoft Docs. In C++ you remained where you were before being able to doo all the same at the code of code mess. There have been a few more attempts to make things easier with concurrency in C++ and eventually co-routines are making their way into modern C++.

So for some specific task I needed to quickly write some code to grab multiple images and have a Telegram bot throw it over into channel as a notification measure. Not a superman’s job, but still a good small example how to make things in parallel and have compact C++ code for that.

MSVC C++17 with /await enables use of C++ coroutines and C++/WinRT language projection supplies us with suitable asynchronous API. The code snippet below starts multiple simultaneous tasks of locating a file, reading it into memory, starting an HTTP POST request and posting the image to remote web server. Then the controlling code synchronizes on completion of all of the tasks letting them run and complete independently.

struct CompletionContext
{
	CompletionContext(size_t Counter) :
		Counter(static_cast<uint32_t>(Counter))
	{
	}
	void Decrement()
	{
		if(--Counter == 0)
			WI_VERIFY(SetEvent(Event.get()));
	}

	std::atomic_uint32_t Counter;
	winrt::handle Event { CreateEvent(nullptr, TRUE, FALSE, nullptr) };
};

winrt::Windows::Foundation::IAsyncOperation<bool> Process(DateTime Time, Configuration::Channel& Channel, CompletionContext& Context)
{
	auto Decrement = wil::scope_exit([&]() { Context.Decrement(); });
	WI_ASSERT(!Channel.RecordDirectory.empty());
	WI_ASSERT(!Channel.Name.empty());
	auto const TimeEx = system_clock::from_time_t(winrt::clock::to_time_t(Time));
	winrt::Windows::Storage::Streams::IBuffer Buffer;
	WCHAR Path[MAX_PATH];
	PathCombineW(Path, Channel.RecordDirectory.c_str(), Channel.Name.c_str());
	PathCombineW(Path, Path, L"thumbnail.jpg");
	using namespace winrt::Windows::Storage;
	auto const File = co_await StorageFile::GetFileFromPathAsync(Path);
	auto const InputStream = co_await File.OpenAsync(FileAccessMode::Read, StorageOpenOptions::AllowOnlyReaders);
	Buffer = co_await TelegramHelper::ToBuffer(InputStream);
	std::wostringstream Stream;
	Stream << Format(L"ℹ️ Notification") << std::endl;
	Stream << std::endl;
	AppendComputerTime(Stream);
	Stream << "Directory: " << Channel.RecordDirectory << std::endl;
	Stream << "Channel: " << Channel.Name << L" (" << Channel.FriendlyName << L")" << std::endl;
	co_await TelegramHelper::SendPhoto(TelegramHelper::BinaryDocument(Buffer, L"thumbnail.jpg"), Stream.str());
	co_return true;
}
winrt::Windows::Foundation::IAsyncAction Completion(CompletionContext& Context)
{
	co_await winrt::resume_on_signal(Context.Event.get()); // https://docs.microsoft.com/en-us/uwp/cpp-ref-for-winrt/resume-on-signal
	co_return;
}

CompletionContext Context(m_Configuration.ChannelVector.size());
for(auto&& Channel : m_Configuration.ChannelVector)
	Process(Time, Channel, Context);
co_await Completion(Context);

(I think I just am just not aware of existing suitable pattern to synchronize with multiple completion, so I made it with a manual event and waiting on it with existing C++/WinRT helper that I was aware of)

So how is this better than what we had before?

First – and the most perhaps important – the code remains compact and linear. With this amount of C++ code you would not even say that it runs highly parallelized. The only blocking is at the last line of the snippet where we finally wait on completion of all of the tasks. Still task code is perfectly readable and does not have excessive code to desperately read trying to figure out what is going on.

Second, the code is concurrent and parallel without any need to manage threads and stuff. You don’t need to think of how many threads you want, how many CPU cores the system have. The code is just parallel enough and is mapped onto available system resources in a good way. You just focus on what is important. The scalability will be better understood in the following paragraph.

Third, the amount of co_await operators. They appear a lot in code around asynchronous operations. The way things work is this: C++ compiler slices your function with return type of winrt::Windows::Foundation::IAsync* (for details on this I forward to coroutine theory linked in the beginning of the paragraph, let’s just focus on C++/WinRT part here) into multiple pieces separated by co_await operators. This is done transparently and you see the function as a solid piece of code while effectively it’s broken into separate pieces joint by execution context with arguments, local variables and returned value. At every such operator the function can be suspended for as long as necessary (for example, to complete I/O) and then resumed on this or another thread. As a C++ developer you don’t have to think about the details anymore as C++20 compiler is here to help you to catch up in efficiency with C# guys.

Even though this might be not exactly accurate technically, I think it might be helpful to imagine that C++ compiler compiles multiple “subfunctions” breaking original function at co_await boundaries. Now imagine that it is possible to quickly transfer such “subfunctions” to another CPU core, or put it aside executing a “more important” “subfunction” from another context of execution. The application is now a deck of small tasks which are executed in a highly parallel manner, but in the same time the order of tasks within one context of execution is safely preserved. You also have all nice things you are used to from earlier C++.

IAsync*/co_await implementation is supplying you with a thread pool to place your multiple tasks and their function pieces onto available CPU cores for concurrent execution. That is lots of this subtasks and evenly distributed across cores and reasonable number threads for unblocked execution and managed waiting for I/O completion synchronization.

All in all, you can now have compact well readable and manageable concurrent code, scalable and with efficient resource consumption, with so lower need to do threading and waiting on your own.