<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Fooling Around &#187; performance</title>
	<atom:link href="http://alax.info/blog/tag/performance/feed" rel="self" type="application/rss+xml" />
	<link>http://alax.info/blog</link>
	<description>// Software Production Line</description>
	<lastBuildDate>Wed, 02 May 2012 15:42:49 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Obtaining number of thread context switches programmatically</title>
		<link>http://alax.info/blog/1182</link>
		<comments>http://alax.info/blog/1182#comments</comments>
		<pubDate>Sun, 20 Mar 2011 10:38:02 +0000</pubDate>
		<dc:creator>Roman</dc:creator>
				<category><![CDATA[Seriously]]></category>
		<category><![CDATA[Source]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[context switch]]></category>
		<category><![CDATA[ntdll]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[undocumented]]></category>
		<category><![CDATA[x64]]></category>

		<guid isPermaLink="false">http://alax.info/blog/?p=1182</guid>
		<description><![CDATA[<a href="http://alax.info/blog/1182" title="Obtaining number of thread context switches programmatically"></a>Previous post on thread synchronization and context switches used number of thread context switches as one of the performance indicators. One might have hard times getting the number from operating system though. The only well documented access to amount of &#8230;<p class="read-more"><a href="http://alax.info/blog/1182">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://alax.info/blog/1182" title="Obtaining number of thread context switches programmatically"></a><p>Previous post on <a href="http://alax.info/blog/1177">thread synchronization and context switches</a> used number of thread context switches as one of the performance indicators. One might have hard times getting the number from operating system though.</p>
<p>The only well documented access to amount of context switches seems to be accessing corresponding <a href="http://msdn.microsoft.com/en-us/library/aa373083%28VS.85%29.aspx">performance counters</a>. Thread performance counter will list available thread instances and counters &#8220;<em>Thread(&lt;process-name&gt;/&lt;thread-number&gt;)/Context Switches/sec</em>&#8221; will provide context switch rate per second.</p>
<p><a href="http://alax.info/blog/wp-content/uploads/2011/03/Image002.png"><img class="alignnone size-medium wp-image-1183" title="Context Switches Performance Counter" src="http://alax.info/blog/wp-content/uploads/2011/03/Image002-320x238.png" alt="" width="320" height="238" /></a></p>
<p>While access to performance counters is far not the most convenient API, to access data programmatically one would really prefer absolute number of switches rather than rate per second (which is still good for interactive monitoring).</p>
<p>A gate into kernel world to grab the data of interest is provided with <a href="http://msdn.microsoft.com/en-us/library/ms724509%28VS.85%29.aspx">NtQuerySystemInformation</a> function. Although mentioned in documentation, it is marked as unreliable for use, and Windows SDK static library is missing it so one has to obtain it using <a href="http://msdn.microsoft.com/en-us/library/ms683199%28VS.85%29.aspx">GetModuleHandle</a>/<a href="http://msdn.microsoft.com/en-us/library/ms683212%28VS.85%29.aspx">GetProcAddress</a> it explicity.</p>
<pre style="color: #000000; background: #ffffff;"><span style="color: #800000; font-weight: bold;">typedef</span> NTSTATUS <span style="color: #808030;">(</span><span style="color: #603000;">WINAPI</span> <span style="color: #808030;">*</span>NTQUERYSYSTEMINFORMATION<span style="color: #808030;">)</span><span style="color: #808030;">(</span>SYSTEM_INFORMATION_CLASS SystemInformationClass<span style="color: #808030;">,</span> <span style="color: #603000;">PVOID</span> SystemInformation<span style="color: #808030;">,</span> <span style="color: #603000;">ULONG</span> SystemInformationLength<span style="color: #808030;">,</span> <span style="color: #603000;">PULONG</span> ReturnLength<span style="color: #808030;">)</span><span style="color: #800080;">;</span>
NTQUERYSYSTEMINFORMATION NtQuerySystemInformation <span style="color: #808030;">=</span> <span style="color: #808030;">(</span>NTQUERYSYSTEMINFORMATION<span style="color: #808030;">)</span> <span style="color: #400000;">GetProcAddress</span><span style="color: #808030;">(</span><span style="color: #400000;">GetModuleHandle</span><span style="color: #808030;">(</span>_T<span style="color: #808030;">(</span><span style="color: #800000;">"</span><span style="color: #0000e6;">ntdll.dll</span><span style="color: #800000;">"</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: #808030;">,</span> <span style="color: #800000;">"</span><span style="color: #0000e6;">NtQuerySystemInformation</span><span style="color: #800000;">"</span><span style="color: #808030;">)</span><span style="color: #800080;">;</span>
ATLASSERT<span style="color: #808030;">(</span>NtQuerySystemInformation<span style="color: #808030;">)</span><span style="color: #800080;">;</span>
ATLVERIFY<span style="color: #808030;">(</span>NtQuerySystemInformation<span style="color: #808030;">(</span><span style="color: #808030;">.</span><span style="color: #808030;">.</span><span style="color: #808030;">.</span><span style="color: #808030;">)</span> <span style="color: #808030;">=</span><span style="color: #808030;">=</span> <span style="color: #008c00;">0</span><span style="color: #808030;">)</span><span style="color: #800080;">;</span></pre>
<p>Having this done, the function is capable of providing SystemProcessInformation/SYSTEM_PROCESS_INFORMATION data about running processes.</p>
<p><span id="more-1182"></span>MSDN documents only a part of the structure, and the rest of the data contains information of interest. Returned data contains a list of variable length SYSTEM_PROCESS_INFORMATION structures, each of which has a fixed part (partially documented on MSDN), and embeds list of variable length structures for every thread running within the process.</p>
<p>For 32-bit Win32 platform, layout of the structure is documented on Internet, for example, here: <a href="http://undocumented.ntinternals.net/UserMode/Undocumented%20Functions/System%20Information/Structures/SYSTEM_THREAD.html">SYSTEM_THREAD on Undocumented functions of NTDLL</a>.</p>
<p>For 64-bit x64 (amd64) platform the access method works too, however I had to fit the structure layout so that it matches actual data:</p>
<pre style="color: #000000; background: #ffffff;"><span style="color: #800000; font-weight: bold;">typedef</span> <span style="color: #800000; font-weight: bold;">struct</span> _SYSTEM_THREAD_INFORMATION <span style="color: #800080;">{</span>
    ULONGLONG KernelTime<span style="color: #800080;">;</span>
    ULONGLONG UserTime<span style="color: #800080;">;</span>
    ULONGLONG CreateTime<span style="color: #800080;">;</span>
    <span style="color: #603000;">ULONG</span> WaitTime<span style="color: #800080;">;</span>
    <span style="color: #603000;">ULONG</span> Reserved1<span style="color: #800080;">;</span>
    <span style="color: #603000;">PVOID</span> StartAddress<span style="color: #800080;">;</span>
    CLIENT_ID ClientId<span style="color: #800080;">;</span>
    KPRIORITY Priority<span style="color: #800080;">;</span>
    <span style="color: #603000;">LONG</span> BasePriority<span style="color: #800080;">;</span>
    <span style="color: #603000;">ULONG</span> ContextSwitchCount<span style="color: #800080;">;</span>
    <span style="color: #603000;">ULONG</span> State<span style="color: #800080;">;</span>
    KWAIT_REASON WaitReason<span style="color: #800080;">;</span>
<span style="color: #800080;">}</span> SYSTEM_THREAD_INFORMATION<span style="color: #808030;">,</span> <span style="color: #808030;">*</span>PSYSTEM_THREAD_INFORMATION<span style="color: #800080;">;</span>

<span style="color: #800000; font-weight: bold;">typedef</span> <span style="color: #800000; font-weight: bold;">struct</span> _SYSTEM_PROCESS_INFORMATION <span style="color: #800080;">{</span>
    <span style="color: #603000;">ULONG</span> NextEntryOffset<span style="color: #800080;">;</span>
    <span style="color: #603000;">ULONG</span> NumberOfThreads<span style="color: #800080;">;</span>
    ULONGLONG Reserved<span style="color: #808030;">[</span><span style="color: #008c00;">3</span><span style="color: #808030;">]</span><span style="color: #800080;">;</span>
    ULONGLONG CreateTime<span style="color: #800080;">;</span>
    ULONGLONG UserTime<span style="color: #800080;">;</span>
    ULONGLONG KernelTime<span style="color: #800080;">;</span>
    UNICODE_STRING ImageName<span style="color: #800080;">;</span>
    KPRIORITY BasePriority<span style="color: #800080;">;</span>
    <span style="color: #603000;">HANDLE</span> ProcessId<span style="color: #800080;">;</span>
    <span style="color: #603000;">HANDLE</span> InheritedFromProcessId<span style="color: #800080;">;</span>
    <span style="color: #603000;">ULONG</span> HandleCount<span style="color: #800080;">;</span>
    <span style="color: #603000;">ULONG</span> Reserved2<span style="color: #808030;">[</span><span style="color: #008c00;">2</span><span style="color: #808030;">]</span><span style="color: #800080;">;</span>
    <span style="color: #603000;">ULONG</span> PrivatePageCount<span style="color: #800080;">;</span>  <span style="color: #696969;">// Garbage</span>
    VM_COUNTERS VirtualMemoryCounters<span style="color: #800080;">;</span>
    IO_COUNTERS IoCounters<span style="color: #800080;">;</span>
    SYSTEM_THREAD_INFORMATION Threads<span style="color: #808030;">[</span><span style="color: #008c00;">1</span><span style="color: #808030;">]</span><span style="color: #800080;">;</span>
<span style="color: #800080;">}</span> SYSTEM_PROCESS_INFORMATION<span style="color: #808030;">,</span> <span style="color: #808030;">*</span>PSYSTEM_PROCESS_INFORMATION<span style="color: #800080;">;</span></pre>
<p>Each structure contains thread identifier and related absolute number of context switches.</p>
<p>Also, the same information is visually available from <a href="http://technet.microsoft.com/en-us/sysinternals/bb896653">Process Explorer</a>:</p>
<p><a href="http://alax.info/blog/wp-content/uploads/2011/03/Image003.png"><img class="alignnone size-medium wp-image-1184" title="Thread Context Switches on Process Explorer" src="http://alax.info/blog/wp-content/uploads/2011/03/Image003-320x320.png" alt="" width="320" height="320" /></a></p>
<p>A more complete code snippet to access the data from x64 code is <a href="http://www.assembla.com/code/roatl-utilities/subversion/nodes/trunk/EventSynchronizationTest01/EventSynchronizationTest01.cpp#ln18">here in a small application/project</a>, lines 18-144 within <em>#pragma region</em>.</p>
]]></content:encoded>
			<wfw:commentRss>http://alax.info/blog/1182/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Thread synchronization and context switches</title>
		<link>http://alax.info/blog/1177</link>
		<comments>http://alax.info/blog/1177#comments</comments>
		<pubDate>Sat, 19 Mar 2011 16:41:15 +0000</pubDate>
		<dc:creator>Roman</dc:creator>
				<category><![CDATA[ATL]]></category>
		<category><![CDATA[Source]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[context switch]]></category>
		<category><![CDATA[critical section]]></category>
		<category><![CDATA[event]]></category>
		<category><![CDATA[IPC]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[synchronization]]></category>
		<category><![CDATA[thread]]></category>
		<category><![CDATA[threading]]></category>

		<guid isPermaLink="false">http://alax.info/blog/?p=1177</guid>
		<description><![CDATA[<a href="http://alax.info/blog/1177" title="Thread synchronization and context switches"></a>A basic task in thread synchronization is putting something on one thread and getting it out on another thread for further processing. Two or more threads of execution are accessing certain data, and in order to keep data consistent and &#8230;<p class="read-more"><a href="http://alax.info/blog/1177">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://alax.info/blog/1177" title="Thread synchronization and context switches"></a><p>A basic task in thread synchronization is putting something on one thread and getting it out on another thread for further processing. Two or more threads of execution are accessing certain data, and in order to keep data consistent and solid the access is split into atomic operations which are only allowed for one thread at a time. Before one thread completes its thing, another is not allowed to touch stuff, such as waiting for so called wait state. This is what synchronization objects and <a href="http://msdn.microsoft.com/en-us/library/ms682530%28VS.85%29.aspx">critical sections</a> in particular for. Furthermore, a thread which is waiting for stuff to be available has nothing to do, so it uses one of the <a href="http://msdn.microsoft.com/en-us/library/ms687069%28VS.85%29.aspx">wait functions</a> to not waste CPU time, and both threads are using <a href="http://msdn.microsoft.com/en-us/library/ms682655%28VS.85%29.aspx">event</a> or similar objects to notify and receive notifications waking up from wait state.</p>
<p>Let us see what is the cost of doing things not quite right. LetÂ  us take aÂ <em>send</em> thread which is generating data/events which is locking shared resource and setting an event when something is done and requires another <em>receive</em> thread to wake up and take over. Send thread might be doing something like:</p>
<pre style="color: #000000; background: #ffffff;">CComCritSecLock<span style="color: #800080;">&lt;</span>CComAutoCriticalSection<span style="color: #800080;">&gt;</span> DataLock<span style="color: #808030;">(</span>m_CriticalSection<span style="color: #808030;">)</span><span style="color: #800080;">;</span>
m_nSendCount<span style="color: #808030;">+</span><span style="color: #808030;">+</span><span style="color: #800080;">;</span>
ATLVERIFY<span style="color: #808030;">(</span>m_AvailabilityEvent<span style="color: #808030;">.</span>Set<span style="color: #808030;">(</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: #800080;">;</span></pre>
<p>And receive thread will wait and take over like this:</p>
<pre style="color: #000000; background: #ffffff;">CComCritSecLock<span style="color: #800080;">&lt;</span>CComAutoCriticalSection<span style="color: #800080;">&gt;</span> DataLock<span style="color: #808030;">(</span>m_CriticalSection<span style="color: #808030;">)</span><span style="color: #800080;">;</span>
m_nReceiveCount<span style="color: #808030;">+</span><span style="color: #808030;">+</span><span style="color: #800080;">;</span>
ATLVERIFY<span style="color: #808030;">(</span>m_AvailabilityEvent<span style="color: #808030;">.</span>Reset<span style="color: #808030;">(</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: #800080;">;</span></pre>
<p>Let us have three send threads and one receive thread running in parallel:</p>
<p><a href="http://alax.info/blog/wp-content/uploads/2011/03/4d84bf5d-ba88-42d2-8815-4b700af9a3bf.png"><img class="alignnone size-large wp-image-1178" title="Test 1" src="http://alax.info/blog/wp-content/uploads/2011/03/4d84bf5d-ba88-42d2-8815-4b700af9a3bf-800x481.png" alt="" width="640" height="384" /></a></p>
<p>The simplicity is tempting and having run this the result over 60 seconds is:</p>
<p><span id="more-1177"></span></p>
<pre>Send Count:Â Â Â Â Â Â Â  280,577,239
Receive Count:Â Â Â Â Â Â Â Â  703,238

Process Time:Â Â Â Â Â Â Â  UserÂ Â Â Â  14,539 ms, KernelÂ Â Â Â  60,746 ms
Send Thread 0 Time:Â  UserÂ Â Â Â Â  5,584 ms, KernelÂ Â Â Â  22,666 ms, Context SwitchesÂ Â Â Â Â Â  5,449,926
Receive Thread Time: UserÂ Â Â Â Â Â Â  811 ms, KernelÂ Â Â Â Â  2,776 ms, Context SwitchesÂ Â Â Â Â Â  8,836,263</pre>
<p>One of the send threads performed 280M cycles (note the number is for one thread of the three, and also test code involves random delays so amount varies from execution to execution), while receive thread could only do 703K. The huge difference is definitely related to the behavior around critical section and thread wait time.</p>
<p>All four threads are locking critical section and making other threads to wait. However send threads are only setting the event, while receive thread is waiting for event and wake up on (that is, soon after) its setting. Once receive thread wakes up, its trying to enter critical section and acquire lock. Still, send threads are setting the event before leaving critical section, so if things happen too fast and receive thread tries to enter locked critical, it will fail up to eventually giving its time slice away doing a <a href="http://msdn.microsoft.com/en-us/library/ms682105%28VS.85%29.aspx">context switch</a>.</p>
<p>This explains low number of iterations of the receive thread together with rather high amount of context switches.</p>
<p>To address this problem let us modify sending part to set event from outside of critical section lock scope.</p>
<pre style="color: #000000; background: #ffffff;"><span style="color: #800080;">{</span>
    CComCritSecLock<span style="color: #800080;">&lt;</span>CComAutoCriticalSection<span style="color: #800080;">&gt;</span> DataLock<span style="color: #808030;">(</span>m_CriticalSection<span style="color: #808030;">)</span><span style="color: #800080;">;</span>
    m_nSendCount<span style="color: #808030;">+</span><span style="color: #808030;">+</span><span style="color: #800080;">;</span>
<span style="color: #800080;">}</span>
ATLVERIFY<span style="color: #808030;">(</span>m_AvailabilityEvent<span style="color: #808030;">.</span>Set<span style="color: #808030;">(</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: #800080;">;</span></pre>
<p>This makes sense too, event itself does not need to be signaled from inside protected fragment, sending thread might be doing this a bit later, with a small but acceptable chance that another send thread will set the event between leaving critical section and setting the event on original thread. From the point of view of receive thread this means that sometimes it might be entering protected area with result of work of 2+ threads and sometimes the even is set again after all work is done and there is nothing more to do.</p>
<p><a href="http://alax.info/blog/wp-content/uploads/2011/03/4d84c85a-608c-4586-92dc-42870af9a3bf.png"><img class="alignnone size-large wp-image-1179" title="Test 2" src="http://alax.info/blog/wp-content/uploads/2011/03/4d84c85a-608c-4586-92dc-42870af9a3bf-800x483.png" alt="" width="640" height="386" /></a></p>
<p>Result of execution is:</p>
<pre>Send Count:Â Â Â Â Â Â Â  233,620,244
Receive Count:Â Â Â Â Â  10,901,974

Process Time:Â Â Â Â Â Â Â  UserÂ Â Â Â  35,006 ms, KernelÂ Â Â Â  90,808 ms
Send Thread 0 Time:Â  UserÂ Â Â Â  10,654 ms, KernelÂ Â Â Â  27,190 ms, Context SwitchesÂ Â Â Â Â Â  7,879,948
Receive Thread Time: UserÂ Â Â Â Â  3,338 ms, KernelÂ Â Â Â  12,012 ms, Context SwitchesÂ Â Â Â Â Â  8,021,273
</pre>
<p>Apparently, the change had a positive effect on receive thread, which was able to make 15 times more iterations.</p>
<p>After all, how it compares to scenario when no critical section is involved at all? If synchronization is implemented using an alternate technique (such as, for example, <a href="http://msdn.microsoft.com/en-us/library/ms684122%28VS.85%29.aspx">interlocked variable access</a>).</p>
<p><a href="http://alax.info/blog/wp-content/uploads/2011/03/4d84d6c8-4bdc-49a0-b07e-24780afe86df.png"><img class="alignnone size-large wp-image-1180" title="Test 3" src="http://alax.info/blog/wp-content/uploads/2011/03/4d84d6c8-4bdc-49a0-b07e-24780afe86df-800x481.png" alt="" width="640" height="384" /></a></p>
<p>Removing critical section, the result of execution is:</p>
<pre>Send Count:Â Â Â Â Â Â Â  246,133,642
Receive Count:Â Â Â Â Â  24,743,638

Process Time:Â Â Â Â Â Â Â  UserÂ Â Â Â  16,723 ms, KernelÂ Â Â  225,515 ms
Send Thread 0 Time:Â  UserÂ Â Â Â Â  4,461 ms, KernelÂ Â Â Â  56,519 ms, Context SwitchesÂ Â Â Â Â Â Â Â Â Â  1,353
Receive Thread Time: UserÂ Â Â Â Â  3,354 ms, KernelÂ Â Â Â  56,019 ms, Context SwitchesÂ Â Â Â Â Â Â Â Â  17,372</pre>
<p>About the same number of send thread iterations and twice as many on receive thread. In the same time, amount of context switches dramatically dropped and indicates that this way the threads do not have to fall into wait state and wait one another. The threads do still wait for stop event (with zero timeout, that is they just check for it without any need to wait), and receive thread keep synchronizing to availability event.</p>
<p>Having done that, let us do one more test to check the effect of entering/leaving a critical section which is never owned by another thread. If send threads keep running without lockingÂ  critical section, and receive thread is back to enter and leave it (on its own only, without congestions with other threads involved), how much slower the overall execution will be?</p>
<p><a href="http://alax.info/blog/wp-content/uploads/2011/03/4d84d8ea-4600-4a15-9e0f-423f0af9a3bf.png"><img class="alignnone size-large wp-image-1181" title="Test 4" src="http://alax.info/blog/wp-content/uploads/2011/03/4d84d8ea-4600-4a15-9e0f-423f0af9a3bf-800x480.png" alt="" width="640" height="384" /></a></p>
<p>The results show that without concurrent access to the critical section, no extra context switches take place and the execution is similar to the one in previous test 3:</p>
<pre>Send Count:Â Â Â Â Â Â Â  213,333,676
Receive Count:Â Â Â Â Â  18,542,098

Process Time:Â Â Â Â Â Â Â  UserÂ Â Â Â  15,022 ms, KernelÂ Â Â  227,496 ms
Send Thread 0 Time:Â  UserÂ Â Â Â Â  3,322 ms, KernelÂ Â Â Â  57,626 ms, Context SwitchesÂ Â Â Â Â Â Â Â Â Â  2,786
Receive Thread Time: UserÂ Â Â Â Â  3,510 ms, KernelÂ Â Â Â  56,316 ms, Context SwitchesÂ Â Â Â Â Â Â Â Â  10,616</pre>
<p>Some conclusions:</p>
<ul>
<li>setting event for shared resource while still holding a lock of it might be pretty expensive</li>
<li><a href="http://msdn.microsoft.com/en-us/library/ms682530(VS.85).aspx">critical sections</a> are quite low-weight as promised and add minimal overhead unless congestion takes place</li>
<li>critical sections might still create a bottleneck in case of intensive use on multiple threads, in which case alternatives such as <a href="http://msdn.microsoft.com/en-us/library/aa904937(VS.85).aspx">slim reader/writer locks</a> and <a href="http://msdn.microsoft.com/en-us/library/ms684122%28VS.85%29.aspx">interlocked variable access</a></li>
<li>amount of thread&#8217;sÂ <a href="http://msdn.microsoft.com/en-us/library/ms682105%28VS.85%29.aspx">context switches</a> is a good indicator of execution congestions</li>
</ul>
<p>Good luck with further research on synchronization and performance and here is <a href="http://www.assembla.com/code/roatl-utilities/subversion/nodes/trunk/EventSynchronizationTest01">source code for the test</a>, a <a href="http://www.microsoft.com/visualstudio/">Visual Studio</a> 2010 C++ project.</p>
<p>Obtaining information on amount of context switches for x64 code deserves a separate post, and the code is <a href="http://www.assembla.com/code/roatl-utilities/subversion/nodes/trunk/EventSynchronizationTest01/EventSynchronizationTest01.cpp#ln18">here</a>, lines 18-144 within <em>#pragma region</em>.</p>
]]></content:encoded>
			<wfw:commentRss>http://alax.info/blog/1177/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ProcessSnapshot: Create process minidump for port-mortem debugging</title>
		<link>http://alax.info/blog/1119</link>
		<comments>http://alax.info/blog/1119#comments</comments>
		<pubDate>Wed, 24 Mar 2010 22:17:42 +0000</pubDate>
		<dc:creator>Roman</dc:creator>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[Utilities]]></category>
		<category><![CDATA[.DMP]]></category>
		<category><![CDATA[ATL]]></category>
		<category><![CDATA[debug]]></category>
		<category><![CDATA[minidump]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[process]]></category>
		<category><![CDATA[snapshot]]></category>
		<category><![CDATA[Source]]></category>
		<category><![CDATA[utility]]></category>
		<category><![CDATA[WTL]]></category>

		<guid isPermaLink="false">http://alax.info/blog/?p=1119</guid>
		<description><![CDATA[<a href="http://alax.info/blog/1119" title="ProcessSnapshot: Create process minidump for port-mortem debugging"></a>ProcessSnapshot is a utility to take a snapshot of process call stacks, and the snapshot taken is written into a human friendly text file. Additionally to this, the utility has been given a capability to create process minidump files, on &#8230;<p class="read-more"><a href="http://alax.info/blog/1119">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://alax.info/blog/1119" title="ProcessSnapshot: Create process minidump for port-mortem debugging"></a><p><a href="http://alax.info/blog/665">ProcessSnapshot</a> is a utility to take a snapshot of process call stacks, and the snapshot taken is written into a human friendly text file.</p>
<p><a href="http://alax.info/blog/wp-content/uploads/2010/03/24-Image001.png"><img class="alignnone size-medium wp-image-1120" title="ProcessSnapshot is taking process minidump files" src="http://alax.info/blog/wp-content/uploads/2010/03/24-Image001-320x189.png" alt="ProcessSnapshot is taking process minidump files" width="320" height="189" /></a></p>
<p>Additionally to this, the utility has been given a capability to create process <a href="http://msdn.microsoft.com/en-us/library/ms680369%28VS.85%29.aspx">minidump files</a>, on user request. The minidump files can be used with debugger to analyze the context of the process using feature rich debug environment, esp. Microsoft Visual Studio. To create a minidump for a process, check a corresponding box and press &#8220;Take a Dump&#8221; button. A file named &#8220;&lt;process-image-name&gt; &#8211; &lt;date&gt; &lt;time&gt;.dmp&#8221; will be created in the directory of the utility executable.</p>
<p>See also:</p>
<ul>
<li><a href="http://msdn.microsoft.com/en-us/library/ms680369%28VS.85%29.aspx">Minidump Files (MSDN)</a></li>
<li><a href="http://support.microsoft.com/kb/315263">How to read the small memory dump files that Windows creates for debugging</a></li>
<li><a href="http://www.codeproject.com/KB/debug/postmortemdebug_standalone1.aspx">Post-Mortem Debugging Your Application with Minidumps and Visual Studio .NET</a></li>
<li><a href="http://www.pchell.com/support/minidumps.shtml">How to View Windows Minidump Files</a></li>
</ul>
<p>A binary [<a href="http://www.assembla.com/code/roatl-utilities/subversion/nodes/trunk/ProcessSnapshot/Win32/Release/ProcessSnapshot.exe?format=raw">Win32</a>, <a href="http://www.assembla.com/code/roatl-utilities/subversion/nodes/trunk/ProcessSnapshot/x64/Release/ProcessSnapshot.exe?format=raw">x64</a>] and partial Visual C++ .NET 2008 source code <a href="http://trac2.assembla.com/roatl-utilities/browser/trunk/ProcessSnapshot/Release/ProcessSnapshot.exe">are  available from SVN</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://alax.info/blog/1119/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Sharing Memory Allocators while at the same time Handling Dynamic Media Type Changes</title>
		<link>http://alax.info/blog/981</link>
		<comments>http://alax.info/blog/981#comments</comments>
		<pubDate>Thu, 16 Jul 2009 06:09:57 +0000</pubDate>
		<dc:creator>Roman</dc:creator>
				<category><![CDATA[Seriously]]></category>
		<category><![CDATA[allocation]]></category>
		<category><![CDATA[DirectShow]]></category>
		<category><![CDATA[filter]]></category>
		<category><![CDATA[IMemAllocator]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[sharing]]></category>

		<guid isPermaLink="false">http://alax.info/blog/?p=981</guid>
		<description><![CDATA[<a href="http://alax.info/blog/981" title="Sharing Memory Allocators while at the same time Handling Dynamic Media Type Changes"></a>Sharing memory allocators between input and output pins is an important concept to keep performance of filter graph. Unlike more frequent scenario with different allocators, a filter (referred to as &#8220;middle filter&#8221; below) which has equal media types on input &#8230;<p class="read-more"><a href="http://alax.info/blog/981">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://alax.info/blog/981" title="Sharing Memory Allocators while at the same time Handling Dynamic Media Type Changes"></a><p>Sharing memory allocators between input and output pins is an important concept to keep performance of filter graph. Unlike more frequent scenario with different allocators, a filter (referred to as &#8220;middle filter&#8221; below) which has equal media types on input and output pins has an advantage to avoid memory-to-memory copy operation for every frame processed, by delivering downstream the buffer obtained from an upstream filter. With a high resolution video, at high rate, multiple streams running simultaneously this is the expense one would try to avoid for performance reasons.</p>
<p>Memory allocators are (or can be) shared by well known filters, such as <a href="http://msdn.microsoft.com/en-us/library/dd377544%28VS.85%29.aspx">Sample Grabber Filter</a>, <a href="http://msdn.microsoft.com/en-us/library/dd390336%28VS.85%29.aspx">Infinite Tee Pin Filter</a> and in-place transformation base filters (<a href="http://msdn.microsoft.com/en-us/library/dd388191%28VS.85%29.aspx">CTransInPlaceFilter Class</a>).</p>
<p>Still handling <a href="http://msdn.microsoft.com/en-us/library/dd388799%28VS.85%29.aspx">Dynamic Format Changes</a> (not only from video renderer filter) filters that share memory allocators may run into the problem of being notified of media type change. Because allocator are typically owned by another filter (e.g. <a href="http://msdn.microsoft.com/en-us/library/dd407343%28VS.85%29.aspx">Video Mixing Renderer Filter</a>) and originally its buffer is queried by an upstream filter, the upstream filter obtains allocated buffer independently from the middle filter that shares memory allocators. If the upstream filter decides to never deliver this buffer, however the buffer has a media type attached (see <a href="http://msdn.microsoft.com/en-us/library/dd373499%28VS.85%29.aspx">AM_SAMPLE2_PROPERTIES::pMediaType</a>), there is no way for the middle filter to learn about dynamic format change completed.</p>
<p>As a workaround for handling <a href="http://msdn.microsoft.com/en-us/library/dd388901%28VS.85%29.aspx">Format Changes from the Video Renderer</a>, when resolution is not changed and it is only stride which might be extended, middle filter might be checking data size in lActual field and learn about the change from an increase in this value.</p>
<p>To be reliably notified on media type change the middle filter is to take extra measures while sharing the allocator. Instead using raw allocator obtained from one pin on another pin (typically output pin&#8217;s allocator to be used on an input pin), middle filter may be using an internal proxy object, which implements <a href="http://msdn.microsoft.com/en-us/library/dd407061%28VS.85%29.aspx">IMemAllocator</a> interface and forward calls to internal IMemAllocator, obtained originally. Additionally to that, the proxy can check for attached media types on every buffer taken from the allocator, and once the change is noticed &#8211; at the moment upstream filter is requesting the buffer &#8211; the proxy has a timely chance to remember the new media type so that in the following <a href="http://msdn.microsoft.com/en-us/library/dd407077%28VS.85%29.aspx">IMemInputPin::Receive</a> call this media type can be checked for the case upstream buffer decided to not deliver the buffer with attached media type.</p>
<pre><span style="color: #800000; font-weight: bold;">if</span><span style="color: #808030;">(</span>IsSharingMemAllocators<span style="color: #808030;">(</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span>
<span style="color: #800080;">{</span>
    <span style="color: #696969;">// ...</span>
    ATLASSERT<span style="color: #808030;">(</span><span style="color: #808030;">(</span>InputMediaSampleProperties<span style="color: #808030;">.</span>pMediaType <span style="color: #808030;">!</span><span style="color: #808030;">=</span> <span style="color: #7d0045;">NULL</span><span style="color: #808030;">)</span> <span style="color: #808030;">^</span> <span style="color: #808030;">!</span><span style="color: #808030;">(</span>InputMediaSampleProperties<span style="color: #808030;">.</span>dwSampleFlags <span style="color: #808030;">&amp;</span> AM_SAMPLE_TYPECHANGED<span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: #800080;">;</span>
    <span style="color: #800080;">{</span>
        CRoCriticalSectionLock DataLock<span style="color: #808030;">(</span>GetDataCriticalSection<span style="color: #808030;">(</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: #800080;">;</span>
        <span style="color: #800000; font-weight: bold;">const</span> CObjectPtr<span style="color: #800080;">&lt;</span>CProxyMemAllocator<span style="color: #800080;">&gt;</span><span style="color: #808030;">&amp;</span> pInputProxyMemAllocator <span style="color: #808030;">=</span> m_pInputPin<span style="color: #808030;">-</span><span style="color: #808030;">&gt;</span>GetProxyMemAllocatorReference<span style="color: #808030;">(</span><span style="color: #808030;">)</span><span style="color: #800080;">;</span>
        CMediaType pMediaType<span style="color: #800080;">;</span>
        <span style="color: #800000; font-weight: bold;">if</span><span style="color: #808030;">(</span>pInputProxyMemAllocator <span style="color: #808030;">&amp;</span><span style="color: #808030;">&amp;</span> pInputProxyMemAllocator<span style="color: #808030;">-</span><span style="color: #808030;">&gt;</span>GetDynamicallyChangedMediaType<span style="color: #808030;">(</span>pMediaType<span style="color: #808030;">,</span> TRUE<span style="color: #808030;">)</span><span style="color: #808030;">)</span>
        <span style="color: #800080;">{</span>
            m_pInputPin<span style="color: #808030;">-</span><span style="color: #808030;">&gt;</span>SetMediaType<span style="color: #808030;">(</span>pMediaType<span style="color: #808030;">)</span><span style="color: #800080;">;</span>
            m_pOutputPin<span style="color: #808030;">-</span><span style="color: #808030;">&gt;</span>SetMediaType<span style="color: #808030;">(</span>pMediaType<span style="color: #808030;">)</span><span style="color: #800080;">;</span>
            <span style="color: #696969;">// ...</span>
        <span style="color: #800080;">}</span>
    <span style="color: #800080;">}</span>
    <span style="color: #800000; font-weight: bold;">if</span><span style="color: #808030;">(</span>InputMediaSampleProperties<span style="color: #808030;">.</span>pMediaType<span style="color: #808030;">)</span>
    <span style="color: #800080;">{</span>
        m_pInputPin<span style="color: #808030;">-</span><span style="color: #808030;">&gt;</span>SetMediaType<span style="color: #808030;">(</span>InputMediaSampleProperties<span style="color: #808030;">.</span>pMediaType<span style="color: #808030;">)</span><span style="color: #800080;">;</span>
        m_pOutputPin<span style="color: #808030;">-</span><span style="color: #808030;">&gt;</span>SetMediaType<span style="color: #808030;">(</span>InputMediaSampleProperties<span style="color: #808030;">.</span>pMediaType<span style="color: #808030;">)</span><span style="color: #800080;">;</span>
        <span style="color: #696969;">// ...</span>
    <span style="color: #800080;">}</span>
    DeliverMediaSample<span style="color: #808030;">(</span>pMemInputPin<span style="color: #808030;">,</span> pInputMediaSample<span style="color: #808030;">)</span><span style="color: #800080;">;</span>
<span style="color: #800080;">}</span></pre>
]]></content:encoded>
			<wfw:commentRss>http://alax.info/blog/981/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ProcessSnapshot to take a snapshot of process modules, threads, stacks and performance</title>
		<link>http://alax.info/blog/665</link>
		<comments>http://alax.info/blog/665#comments</comments>
		<pubDate>Tue, 23 Dec 2008 18:30:43 +0000</pubDate>
		<dc:creator>Roman</dc:creator>
				<category><![CDATA[ATL]]></category>
		<category><![CDATA[Seriously]]></category>
		<category><![CDATA[Utilities]]></category>
		<category><![CDATA[WTL]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[process]]></category>
		<category><![CDATA[snapshot]]></category>
		<category><![CDATA[Source]]></category>
		<category><![CDATA[utility]]></category>

		<guid isPermaLink="false">http://alax.info/blog/?p=665</guid>
		<description><![CDATA[<a href="http://alax.info/blog/665" title="ProcessSnapshot to take a snapshot of process modules, threads, stacks and performance"></a>While troubleshooting released application on remote production site, it is very useful to grasp a state of the process for further analysis. There are several scenarios in which the following information about process state is helpful: modules (DLLs) loaded into &#8230;<p class="read-more"><a href="http://alax.info/blog/665">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://alax.info/blog/665" title="ProcessSnapshot to take a snapshot of process modules, threads, stacks and performance"></a><p>While troubleshooting released application on remote production site, it is very useful to grasp a state of the process for further analysis. There are several scenarios in which the following information about process state is helpful:</p>
<ul>
<li>modules (DLLs) loaded into process and their versions</li>
<li>threads and their call stacks</li>
<li>process and thread performance</li>
</ul>
<p>An utility ProcessSnapshot takes advantage of <a href="http://www.microsoft.com/whdc/devtools/debugging/default.mspx">Debugging Tools API</a> (<a href="http://msdn.microsoft.com/en-us/library/ms679294.aspx">dbghelp.dll</a> &#8211; note the dialog also displays DLL version in the right bottom corner) and writes this helpful information to text file and it can also take a sequence of the snapshots to compare thread performance and/or stacks and check the difference.</p>
<p><a href="http://alax.info/blog/wp-content/uploads/2008/10/14-image001.png"><img class="alignnone size-medium wp-image-666" title="Process Snapshot" src="http://alax.info/blog/wp-content/uploads/2008/10/14-image001-300x175.png" alt="" width="300" height="175" /></a></p>
<p>The generated file is in the directory of the utility application and looks like:</p>
<pre>Snapshot
  System Time: 10/14/2008 8:46:33 PM
  Local Time: 10/14/2008 11:46:33 PM

Performance
  Creation System Time: 10/14/2008 8:46:28 PM
  Kernel Time: 0.094 s
  User Time: 0.031 s

Modules

  Module: ProcessSnapshot.exe @00400000
    Base Address: 0x00400000
    Base Size: 0x0005b000 (372736)
    Name: ProcessSnapshot.exe
    Path: D:\Projects\Utilities\ProcessSnapshot\Release\ProcessSnapshot.exe
    Product Version: 1.0.0.1
    File Version: 1.0.0.125

  Module: ntdll.dll @7c900000
    Base Address: 0x7c900000
    Base Size: 0x000af000 (716800)
    Name: ntdll.dll
    Path: C:\WINDOWS\system32\ntdll.dll
    Product Version: 5.1.2600.5512
    File Version: 5.1.2600.5512
[...]

Threads

  Thread: 3824
    Base Priority: 8
    Creation System Time: 10/14/2008 8:46:57 PM
    Kernel Time: 0.063 s
    User Time: 0.031 s
    Call Stack
      ntdll!7c90e4f4 KiFastSystemCallRet (+ 0) @7c900000
      USER32!7e4249c4 GetCursorFrameInfo (+ 460) @7e410000
      USER32!7e424a06 DialogBoxIndirectParamAorW (+ 54) @7e410000
      USER32!7e4247ea DialogBoxParamW (+ 63) @7e410000
      ProcessSnapshot!00403f45 ATL::CDialogImpl&lt;CMainDialog,ATL::CWindow&gt;::DoModal (+ 67) [c:\program files\microsoft visual studio 9.0\vc\atlmfc\include\atlwin.h, 3478] (+ 28) @00400000
      ProcessSnapshot!00403b6f CProcessSnapshotModule::RunMessageLoop (+ 74) [d:\projects\utilities\processsnapshot\processsnapshot.cpp, 67] (+ 0) @00400000
      ProcessSnapshot!004049b9 ATL::CAtlExeModuleT&lt;CProcessSnapshotModule&gt;::Run (+ 17) [c:\program files\microsoft visual studio 9.0\vc\atlmfc\include\atlbase.h, 3552] (+ 0) @00400000
      ProcessSnapshot!004041c3 ATL::CAtlExeModuleT&lt;CProcessSnapshotModule&gt;::WinMain (+ 48) [c:\program files\microsoft visual studio 9.0\vc\atlmfc\include\atlbase.h, 3364] (+ 5) @00400000
      ProcessSnapshot!00434477 wWinMain (+ 5) [*d:\projects\utilities\processsnapshot\release\processsnapshot.inj:5, 14] (+ 0) @00400000
      ProcessSnapshot!00415058 __tmainCRTStartup (+ 274) [f:\dd\vctools\crt_bld\self_x86\crt\src\crt0.c, 263] (+ 27) @00400000
      !00360033</pre>
<p><span id="more-665"></span></p>
<p>How exactly this can facilitate troubleshooting problems with software. Here are several scenarios:</p>
<ul>
<li>the applications shows an unexpected error message and it is desired to find out the position and call stack</li>
<li>the application deadlocks and call stacks are required for further troubleshooting</li>
<li>the application maxes out CPU load on one of the cores and the thread needs to be identified</li>
<li>the applciation runs slowly and bottleneck thread is to be find out</li>
<li>the application loads undesired third party module (or otherwise has it mapped into process, esp. antivirus software, or a DLL hosting undesired DirectShow filter) or a module with improper version</li>
</ul>
<p>In all mentioned above scenarios the snapshot is very helpful for troubleshooting, profiling, fixing.</p>
<p>Update 23-Dec-2008. The application auto-enables <a href="http://msdn.microsoft.com/en-us/library/bb530716(VS.85).aspx">SeDebugPrivilege</a> (SE_DEBUG_NAME) so that snapshot could be taken from processes such as service processes.</p>
<p>A binary [<a href="http://www.assembla.com/code/roatl-utilities/subversion/nodes/trunk/ProcessSnapshot/Win32/Release/ProcessSnapshot.exe?format=raw">Win32</a>, <a href="http://www.assembla.com/code/roatl-utilities/subversion/nodes/trunk/ProcessSnapshot/x64/Release/ProcessSnapshot.exe?format=raw">x64</a>] and Visual C++ .NET 2008 source code <a href="http://trac2.assembla.com/roatl-utilities/browser/trunk/ProcessSnapshot/Release/ProcessSnapshot.exe">are available from SVN</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://alax.info/blog/665/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Performance</title>
		<link>http://alax.info/blog/707</link>
		<comments>http://alax.info/blog/707#comments</comments>
		<pubDate>Wed, 19 Nov 2008 08:30:50 +0000</pubDate>
		<dc:creator>Roman</dc:creator>
				<category><![CDATA[Seriously]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://alax.info/blog/?p=707</guid>
		<description><![CDATA[<a href="http://alax.info/blog/707" title="Performance"></a>&#62; How do you test performance? I don&#8217;t. I just believe in it. This is actually what we have here but still we have managed to deliver software that gives more frames per second than rivals. Why? We hopefully knew &#8230;<p class="read-more"><a href="http://alax.info/blog/707">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://alax.info/blog/707" title="Performance"></a><blockquote><p>&gt; How do you test performance?</p>
<p>I don&#8217;t. I just believe in it.</p></blockquote>
<p>This is actually what we have here but still we have managed to deliver software that gives more frames per second than rivals. Why? We hopefully knew what we did in first place. According to one of our partner hardware vendors, there are only two software packages which could render multiple megapixel video feeds at the rates cameras can provide, the ours one and another one with the track leading to sources in Eastern Europe&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://alax.info/blog/707/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>An effect of excessive RGB conversion onto video streaming perofrmance (continued)</title>
		<link>http://alax.info/blog/551</link>
		<comments>http://alax.info/blog/551#comments</comments>
		<pubDate>Mon, 18 Aug 2008 09:00:07 +0000</pubDate>
		<dc:creator>Roman</dc:creator>
				<category><![CDATA[Source]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[Video]]></category>
		<category><![CDATA[DirectShow]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[RGB]]></category>
		<category><![CDATA[YUV]]></category>

		<guid isPermaLink="false">http://alax.info/blog/?p=551</guid>
		<description><![CDATA[<a href="http://alax.info/blog/551" title="An effect of excessive RGB conversion onto video streaming perofrmance (continued)"></a>This continues the topic raised by previous post. As fairly noticed by The March Hare, video renderer is using hardware overlay and the benchmark is incorrect if we are to extrapolate the performance to scenario with multiple video renderers. So, &#8230;<p class="read-more"><a href="http://alax.info/blog/551">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://alax.info/blog/551" title="An effect of excessive RGB conversion onto video streaming perofrmance (continued)"></a><p>This continues the topic raised by <a href="http://alax.info/blog/538">previous post</a>. As fairly noticed by <a href="http://tmhare.mvps.org/help.htm">The March Hare</a>, video renderer is using hardware overlay and the benchmark is incorrect if we are to extrapolate the performance to scenario with multiple video renderers.</p>
<p>So, an updated test application creates 16 video renderers with 16 threads pumping two meida samples through each of the 16 filter graphs.</p>
<p><a href="http://alax.info/blog/wp-content/uploads/2008/08/18-image001.png"><img class="alignnone size-medium wp-image-552" title="16 VMRs" src="http://alax.info/blog/wp-content/uploads/2008/08/18-image001-300x203.png" alt="" width="300" height="203" /></a></p>
<p>The screen shot shows that there is only one video overlay in use (which image was not captured and blackness is shown instead), so results may be inaccurate for one of the graph among 16. In this simple test I disregard this.</p>
<p>Here go the results (in all tests CPU usage is maxed out):</p>
<ul>
<li>YUY2 Source -&gt; VMR: <span style="color: #0000ff;"><strong>3,480 fps</strong></span></li>
<li>YUY2 Source -&gt; AVI Decompressor (converts to 24-bit RGB) -&gt; Sample Grabber (without processing) -&gt; Color Space Converter (converts to 32-bit RGB) -&gt; VMR: <span style="color: #0000ff;"><strong>560 fps</strong></span></li>
<li>YUY2 Source -&gt; AVI Decompressor (converts to 32-bit RGB) -&gt; Color Space Converter -&gt; VMR: <span style="color: #0000ff;"><strong>390 fps</strong></span></li>
</ul>
<p><span id="more-551"></span></p>
<p>For a comprehensive study, let us also measure direct rendering of 24-bit and 32-bit RGB data the way we stream YUY2:</p>
<p><a href="http://alax.info/blog/wp-content/uploads/2008/08/18-image002.png"><img class="alignnone size-medium wp-image-556" title="24-bit RGB to VMR" src="http://alax.info/blog/wp-content/uploads/2008/08/18-image002-300x182.png" alt="" width="300" height="182" /></a> <a href="http://alax.info/blog/wp-content/uploads/2008/08/18-image003.png"><img class="alignnone size-medium wp-image-557" title="32-bit RGB to VMR" src="http://alax.info/blog/wp-content/uploads/2008/08/18-image003-300x182.png" alt="" width="300" height="182" /></a></p>
<p>It is worth mentioning that <a href="http://msdn.microsoft.com/en-us/library/ms787917(VS.85).aspx">Video Mixing Renderer Filter</a> does not like 24-bit RGB on input and intelligent connect auto-inserts additional <a href="http://msdn.microsoft.com/en-us/library/ms781972(VS.85).aspx">Color Space Converter Filter</a> to convert 24-bit RGB into 32-bit RGB.This explains better performance with 32-bit RGB.</p>
<ul>
<li>24-bit RGB Source -&gt; Color Space Converter (converts to 32-bit RGB) -&gt; VMR: <span style="color: #0000ff;"><strong>1,175 fps</strong></span></li>
<li>32-bit RGB Source -&gt; VMR: <span style="color: #0000ff;"><strong>1,660 fps</strong></span></li>
</ul>
<p>Another popular YUV pixel format is <a href="http://fourcc.org/yuv.php#YV12">YV12</a>, which is also very much important as used as original output of <a href="http://en.wikipedia.org/wiki/MPEG-4">MPEG-4</a> and <a href="http://en.wikipedia.org/wiki/H.264">H.264/MPEG-4 AVC</a> decoders. It is also normally supported natively by hardware:</p>
<p><a href="http://alax.info/blog/wp-content/uploads/2008/08/18-image004.png"><img class="alignnone size-medium wp-image-563" title="DXCapsViewer FourCC Codes" src="http://alax.info/blog/wp-content/uploads/2008/08/18-image004-300x237.png" alt="" width="300" height="237" /></a></p>
<p>Performance tests repeat <a href="http://fourcc.org/yuv.php#YUY2">YUY2</a> frame rates with the difference for higher <a href="http://fourcc.org/yuv.php#YV12">YV12</a> rates because of more compact 12 bit per pixel format (as opposed to 16 bit <a href="http://fourcc.org/yuv.php#YUY2">YUY2</a>):</p>
<ul>
<li>YV12 Source -&gt; VMR: <span style="color: #0000ff;"><strong>4,780 fps</strong></span></li>
<li>YV12 Source -&gt; AVI Decompressor (converts to 24-bit RGB) -&gt; Sample Grabber (without processing) -&gt; Color Space Converter (converts to 32-bit RGB) -&gt; VMR: <span style="color: #0000ff;"><strong>900 fps</strong></span></li>
<li>YV12 Source -&gt; AVI Decompressor (converts to 32-bit RGB) -&gt; Color Space Converter -&gt; VMR: <span style="color: #0000ff;"><strong>1,510 fps</strong></span></li>
</ul>
<p>The difference of <a href="http://fourcc.org/yuv.php#YV12">YV12</a> from <a href="http://fourcc.org/yuv.php#YUY2">YUY2</a> is that scenario with <a href="http://msdn.microsoft.com/en-us/library/ms787594(VS.85).aspx">Sample Grabber Filter</a> removed shows [expectedly] higher frame rates than with 24-bit RGB sample grabber.</p>
<p>Reference source code (will require additional headers to compile): <a href="http://alax.info/blog/wp-content/uploads/2008/08/frameratesample0203.zip">FrameRateSample02.03.zip</a> (note that Release build binary is included)</p>
]]></content:encoded>
			<wfw:commentRss>http://alax.info/blog/551/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>An effect of excessive RGB conversion onto video streaming perofrmance</title>
		<link>http://alax.info/blog/538</link>
		<comments>http://alax.info/blog/538#comments</comments>
		<pubDate>Sat, 16 Aug 2008 22:31:57 +0000</pubDate>
		<dc:creator>Roman</dc:creator>
				<category><![CDATA[Source]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[Video]]></category>
		<category><![CDATA[DirectShow]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[RGB]]></category>
		<category><![CDATA[YUV]]></category>

		<guid isPermaLink="false">http://alax.info/blog/?p=538</guid>
		<description><![CDATA[<a href="http://alax.info/blog/538" title="An effect of excessive RGB conversion onto video streaming perofrmance"></a>Started here: How can I overlay timestamp on the image? on microsoft.public.win32.programmer.directx.video Let us see if RGB conversion adds any noticeable effect on streaming YUY2 video, typical output of video decompressor. As a reference I am taking a simple YUY2 &#8230;<p class="read-more"><a href="http://alax.info/blog/538">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://alax.info/blog/538" title="An effect of excessive RGB conversion onto video streaming perofrmance"></a><p>Started here: <a href="http://groups.google.com/group/microsoft.public.win32.programmer.directx.video/browse_thread/thread/6995bdadae24a279?lnk=igtc#">How can I overlay timestamp on the image? on microsoft.public.win32.programmer.directx.video</a></p>
<p>Let us see if RGB conversion adds any noticeable effect on streaming <a href="http://fourcc.org/yuv.php#YUY2">YUY2</a> video, typical output of video decompressor.</p>
<p>As a reference I am taking a simple YUY2 source -&gt; <a href="http://msdn.microsoft.com/en-us/library/ms787917(VS.85).aspx">Video Mixing Render Filter</a> (VMR) graph, where source filter streams the same pre-allocated and pre-initialized data in an infinite loop:</p>
<pre>while(WaitForSingleObject(TerminationEvent, 0) == WAIT_TIMEOUT)
{
	ATLENSURE_SUCCEEDED(m_pSourceFilter-&gt;InjectMediaSample(m_pnData, m_nDataSize));
	CRoCriticalSectionLock DataLock(m_DataCriticalSection);
	m_pnInjectedFrameCounts[0]++;
}</pre>
<p>Video resolution is 640&#215;480 pixels.</p>
<p>What is actually consuming CPU resources here is data copy into VMR&#8217;s media sample buffer and actually streaming. VMR might be blocking control waiting on rendering completion, I am leaving this for default VMR to decide (it might be hardware dependent etc).</p>
<p><a href="http://alax.info/blog/wp-content/uploads/2008/08/17-image001.png"><img class="alignnone size-medium wp-image-539" title="YUY2 to VMR" src="http://alax.info/blog/wp-content/uploads/2008/08/17-image001-300x182.png" alt="" width="300" height="182" /></a></p>
<p>Running at full pace, the application is rendering 510 frames per second consuming virtually no CPU. That is VMR is waiting until meida sample is rendered, this only allows streaming mentioned number of media samples per second, however rendering process does not take CPU resource, just waiting for video hardware to complete.</p>
<p><span id="more-538"></span></p>
<p>I am inserting <a href="http://msdn.microsoft.com/en-us/library/ms787594(VS.85).aspx">Sample Grabber Filter</a> to the graph, initialized with 640&#215;480 24-bit RGB media type, between the source and the renderer. No callback, just a filter insertion to insist on media type. &#8220;Before&#8221; and &#8220;after&#8221; code:</p>
<pre>#if TRUE &amp;&amp; FALSE
	CComPtr&lt;IBaseFilter&gt; pSampleGrabberBaseFilter;
	ATLENSURE_SUCCEEDED(pSampleGrabberBaseFilter.CoCreateInstance(CLSID_SampleGrabber));
	CComQIPtr&lt;ISampleGrabber&gt; pSampleGrabber = pSampleGrabberBaseFilter;
	CMediaType pRgbMediaType;
	pRgbMediaType.AllocateVideoInfo(640, 480, 24);
	pSampleGrabber-&gt;SetMediaType(pRgbMediaType);
	ATLENSURE_THROW(pSampleGrabber, E_NOINTERFACE);
	ATLENSURE_THROW(pGraphBuilder-&gt;AddFilter(pSampleGrabberBaseFilter, CStringW(_T("24-bit RGB Sample Grabber"))));
	ATLENSURE_SUCCEEDED(pGraphBuilder-&gt;Connect(m_pSourceFilter-&gt;GetOutputPin(), _FilterGraphHelper::GetFilterPin(pSampleGrabberBaseFilter, PINDIR_INPUT)));
	ATLENSURE_SUCCEEDED(pGraphBuilder-&gt;Render(_FilterGraphHelper::GetFilterPin(pSampleGrabberBaseFilter, PINDIR_OUTPUT)));
#else
	ATLENSURE_SUCCEEDED(pGraphBuilder-&gt;Render(m_pSourceFilter-&gt;GetOutputPin()));
#endif</pre>
<p>DirectShow intelligent connect is inserting two additional filters to the graph: <a href="http://msdn.microsoft.com/en-us/library/ms779629(VS.85).aspx">AVI Decompressor Filter</a> to convert YUY2 to RGB and <a href="http://msdn.microsoft.com/en-us/library/ms781972(VS.85).aspx">Color Space Converter Filter</a> for the VMR to have necessary upstream flexibility to choose a media type with extended stride.</p>
<p><a href="http://alax.info/blog/wp-content/uploads/2008/08/17-image002.png"><img class="alignnone size-medium wp-image-540" title="YUY2 to RGB24 to VMR" src="http://alax.info/blog/wp-content/uploads/2008/08/17-image002-300x182.png" alt="" width="300" height="182" /></a></p>
<p>Running still at full pace the application is only rendering 210 frames per second while CPU consumption jumped to 30%. What is consuming CPU cycles in the changed filter graph? The conversion from YUY2 to RGB in <a href="http://msdn.microsoft.com/en-us/library/ms779629(VS.85).aspx">AVI Decompressor Filter</a> and possible additional data copy between buffers.</p>
<p>Hardware:</p>
<ul>
<li>CPU: <a href="http://processorfinder.intel.com/details.aspx?sSpec=SLA9V">Intel® Core™2 Duo Desktop Processor E6750</a></li>
<li>Video Adapter: <a href="http://ati.amd.com/products/radeonhd3400/index.html">ATI Radeon HD 3470</a></li>
</ul>
<p>Reference source code (will require additional headers to compile): <a href="../wp-content/uploads/2008/08/directshowwrappersourcefiltersample01.zip"></a><a href="http://alax.info/blog/wp-content/uploads/2008/08/frameratesample0101.zip">FrameRateSample01.01.zip</a> (note that Release build binaries are included)</p>
<p>UPDATE: Another test to bring more detail into performance impact. I am taking <a href="http://msdn.microsoft.com/en-us/library/ms787594(VS.85).aspx">Sample Grabber Filter</a> out and let the graph run without it but the the auto-inserted filters, in order to remove effect of the <a href="http://msdn.microsoft.com/en-us/library/ms787594(VS.85).aspx">Sample Grabber Filter</a> itself.</p>
<pre>CComPtr&lt;IPin&gt; pSampleGrabberInputPeerPin = _FilterGraphHelper::GetPeerPin(_FilterGraphHelper::GetFilterPin(pSampleGrabberBaseFilter, PINDIR_INPUT));
CComPtr&lt;IPin&gt; pSampleGrabberOutputPeerPin = _FilterGraphHelper::GetPeerPin(_FilterGraphHelper::GetFilterPin(pSampleGrabberBaseFilter, PINDIR_OUTPUT));
ATLENSURE_SUCCEEDED(pGraphBuilder-&gt;RemoveFilter(pSampleGrabberBaseFilter));
ATLENSURE_SUCCEEDED(pGraphBuilder-&gt;Connect(pSampleGrabberInputPeerPin, pSampleGrabberOutputPeerPin));</pre>
<p><a href="http://alax.info/blog/wp-content/uploads/2008/08/17-image003.png"><img class="alignnone size-medium wp-image-546" title="YUY2 to RGB32 to VMR" src="http://alax.info/blog/wp-content/uploads/2008/08/17-image003-300x182.png" alt="" width="300" height="182" /></a></p>
<p>Frame rate is slightly higher than in previous test with <a href="http://msdn.microsoft.com/en-us/library/ms787594(VS.85).aspx">Sample Grabber Filter</a> but it does not keep constant: it keeps jumping between 210 and 250 fps with CPU load jumping between 10% and 50% (note 50% is a 100% load on one of the CPU cores out of the two).</p>
<p>What makes it different in comparison with previous run where <a href="http://msdn.microsoft.com/en-us/library/ms787594(VS.85).aspx">Sample Grabber Filter</a> is present? A check of pins&#8217; media types with <a href="http://msdn.microsoft.com/en-us/library/ms787460(VS.85).aspx">GraphEdit</a> shows that filters decided to not connect on 24-bit RGB media type. Instead they are connecting on 32-bit RGB between <a href="http://msdn.microsoft.com/en-us/library/ms779629(VS.85).aspx">AVI Decompressor Filter</a> and <a href="http://msdn.microsoft.com/en-us/library/ms781972(VS.85).aspx">Color Space Converter Filter</a>. Previously when <a href="http://msdn.microsoft.com/en-us/library/ms787594(VS.85).aspx">Sample Grabber Filter</a> insisted on 24-bit RGB, <a href="http://msdn.microsoft.com/en-us/library/ms781972(VS.85).aspx">Color Space Converter Filter</a> performed additional data conversion from 24-bit RGB to 32-bit RGB because VMR chose to use 32-bit RGB to accept data in.</p>
<p>Updated binary for the third test: <a href="http://alax.info/blog/wp-content/uploads/2008/08/frameratesample01-rgb32.zip">FrameRateSample01-RGB32.zip</a></p>
]]></content:encoded>
			<wfw:commentRss>http://alax.info/blog/538/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

