[Ffmpeg-devel] [PATCH] Snow mmx+sse2 asm optimizations

Guillaume POIRIER poirierg
Sun Feb 5 23:50:08 CET 2006


Hi,

On 2/5/06, Robert Edele <yartrebo at earthlink.net> wrote:
> I've written assembly SIMD optimizations (MMX, SSE2) for three parts of
> snow. These changes include:
>
>  - MMX and SSE2 code for the bottom part of add_yblock_buffered.
>    - Left shifting the OBMC tables by 2, and updating parts of the code
> to work with the change. This makes for somewhat faster code by
> eliminating some shift operations in the innermost loop of
> add_yblock_buffered.
>
>  - vertical compose has a straightforward SIMD implementation.
>
>  - horizontal compose has substantially modified internally to allow for
> an efficient SIMD implementation and improving cache performance. For
> plain C code, it may be faster or slower on your system (faster on
> mine). The largest change is that it is almost entirely in-place and the
> temp buffer is only half used now, allowing for SIMD optimization and
> improving cache performance. An added step, interleave_line(), has been
> added because the in-place lifts do not leave the coefficients in the
> proper places. This code is extremely fast in SIMD.

Tested on my machine with GCC-4.1 snapshot, on an x86-64 setup. It
works like a charm. The speed-up depends on the video res and on
encoding features it uses, but it's always there!


> I am aware that conditional compilation for SIMD code is frowned upon,
> so could someone give me some feedback on how my code could be
> efficiently done using function pointers like the other SIMD
> optimizations in ffmpeg? Some functions (interleave_line, 8x8 obmc) take
> nary 500 clocks to finish.

I've looked at this problem, and it puzzled me quite a bit:
Take this function added by the patch:

+static void interleave_line(DWTELEM * low, DWTELEM * high, DWTELEM
*b, int width){
+    int i = width - 2;
+
+    if (width & 1)
+    {
+        b[i+1] = low[(i+1)/2];
+        i--;
+    }
+
+#ifdef HAVE_MMX
+    interleave_line_asm
+#endif
+    for (; i>=0; i-=2){
+        b[i+1] = high[i/2];
+        b[i] = low[i/2];
+    }
+}


It looks like if I were to convert that function to the way they are
defined on FFmpeg so that they can be selected upon CPU detection, 3
flavors would need to be written:

C flavor:

void interleave_line_C(DWTELEM * low, DWTELEM * high, DWTELEM *b, int width){
    int i = width - 2;

    if (width & 1)
    {
        b[i+1] = low[(i+1)/2];
        i--;
    }

    for (; i>=0; i-=2){
        b[i+1] = high[i/2];
        b[i] = low[i/2];
    }
}


(MMX|SSE2) flavor:
void interleave_line_(MMX|SSE2)(DWTELEM * low, DWTELEM * high, DWTELEM
*b, int width){
    int i = width - 2;

    if (width & 1)
    {
        b[i+1] = low[(i+1)/2];
        i--;
    }

/* (MMX|SSE2) ASM block */
}

Note that to the best of my knowledge, that means you _need_ to get
rid of the "static" qualifier, as it just can't work with function
pointers. But that means that there will always be a call overhead in
each of those functions... which hopefully can be compensated a bit by
the execution core of the CPU.

Having those 3 version leads to quite a bit of code duplication... but
it can be mostly okay as hopefully that part of the code forms the
roots of the codec, therefore won't change too much in the future.


Maybe an alternate version could be:

void interleave_line_inner_loop_C(DWTELEM * low, DWTELEM * high,
DWTELEM *b, int width, int i)
    for (; i>=0; i-=2){
        b[i+1] = high[i/2];
        b[i] = low[i/2];
}

void interleave_line_inner_loop_(MMX|SSE2)(DWTELEM * low, DWTELEM *
high, DWTELEM *b, int width)
/* ASM block */
interleave_line_inner_loop_C(low, high, b, width, i) // for the
eventual the remaining loops iterations not covered by the SIMD
version
}


While I was looking at Snow code, I noticed that a fair deal of
functions were inline (the obvious reason is for performance
reasons)... so quite frankly, I have a hard time understanding why we
would want to have all CPU intensive codes having a additional call
overhead...

But maybe I've just miss-understood what the whole dsp context and
function pointer thingie was about, and in that case, I'd like to know
where I'm wrong.

Best regards,

Guillaume
--
Just because code is syntactically "valid" GNU C doesn't mean gcc can
always compile it.
  Steven Bosscher - 2005-01-01
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11203#c14
---
MPlayer's doc isn't up-to-date. Visit my updated mirror here:
http://tuxrip.free.fr//MPlayer-DOCS-HTML/en/
http://tuxrip.free.fr//MPlayer-DOCS-HTML/fr/





More information about the ffmpeg-devel mailing list