[FFmpeg-devel] Idea about speedup of startcode search

Michael Niedermayer michaelni
Fri Feb 8 15:36:12 CET 2008


On Fri, Feb 08, 2008 at 02:35:11PM +0100, Thorsten Jordan wrote:
> Hello,
> 
> here is an idea about tiny optimization i stumbled over when scanning
> for mpeg startcodes, that i want to share.
> 
> I was working on some stream correction like ProjectX, and profiling on
> a weak architecture (VIA C3) showed that the start code scan in the
> GOP-analyse was the weak spot, eating over 50% of CPU time.
> 
> This is typically a simple for() loop iterating over a buffer and
> checking for zero bytes. H.264 has the same with the ill "emulation
> prevention three byte" stuff for NALs, see libavcodec/h264.c line 1393
> (SVN of today).
> 
> Here one searches for 00 00 03 xx patterns, for mpeg 1/2 or h264 you
> often look for 00 00 01 xx patterns or 00 00 00 01.
> 
> This can be done in a simple C loop, but gcc does a bad job and uses up
> to 7 instructions per scanned byte. On Core 2 Duo measurement shows 2.8
> instructions per byte.
> 
> This can be brought down to 0.8 cycles per byte (less than 2
> instructions per byte) with the following idea:

If gcc compiles it to 7 instrucions per scanned byte that is a bug in gcc
which should be reported!
As it can easily do it with 3 instructions (or less if unrolled further),
that is:

xor %%eax, %%eax
1:
cmpb %%al,  (%%ebx, %%ecx)
 jz blah
cmpb %%al, 2(%%ebx, %%ecx)
 jz blah2
add $2, %%ebx
 jnc 1

Also dont forget that per scanned byte really is every 2nd byte as not
every is scanned.


> 
> all startcodes mentioned above have two consecutive zero bytes. To
> filter them out, load 8 bytes to a mmx register and check 4x2 bytes for
> equality with zero, by using packed compare, packing to 4x1 bytes,
> or-ing and testing. Do this for 8 bytes at address x and x+1, until
> there are any two consecutive zero bytes found, then fine-check with c-code.
> 
> It is worth only for large data chunks with rather rare startcodes, but
> this is mostly the case. Every byte of a h.264 stream must be piped
> through the "emulation prevention 3 byte" checker.
> 
> Gain is however maybe too small to do it, at 20mbit with h264 that would
> be 2,38mb/sek to parse, so saving ca. 5 million cpu cycles - only a tiny
> fragment of a 2ghz cpu. But everything counts...
> 
> anyway it may not be an important idea, but if anyone wants to try it
> out, here is some test code that i have written and declare as free to
> use (PD).
[...]
> 	uint32_t flag = 0; //dummy
> 	for ( ; buf < bufend_mmx; buf += 8) {
> 		asm volatile("1:			\n\t"
> 			     "movq (%0), %%mm0		\n\t"
> 			     "movq 1(%0), %%mm1		\n\t"
> 			     "pcmpeqw %%mm2, %%mm0	\n\t"
> 			     "pcmpeqw %%mm2, %%mm1	\n\t"
> 			     "packsswb %%mm0, %%mm0	\n\t"
> 			     "packsswb %%mm1, %%mm1	\n\t"
> 			     "por %%mm1, %%mm0		\n\t"

movq (%0), %%mm0
por  1(%0), %%mm0
pcmpeqb %%mm2, %%mm0
packsswb %%mm0, %%mm0

and this is not a patch ...


> 			     "movd %%mm0, %1		\n\t"
> 			     "testl %1, %1		\n\t"
> 			     "jne 2f			\n\t"
> 			     "addl $8, %0		\n\t"
> 			     "cmpl %2, %0		\n\t"
> 			     "jl 1b			\n\t"


[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

The misfortune of the wise is better than the prosperity of the fool.
-- Epicurus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080208/f3ccfcd7/attachment.pgp>



More information about the ffmpeg-devel mailing list