[Ffmpeg-devel] [PATCH] Snow mmx+sse2 asm optimizations

Mon Feb 6 14:12:45 CET 2006

Hi

On Sun, Feb 05, 2006 at 12:47:14PM -0500, Robert Edele wrote:
> I've written assembly SIMD optimizations (MMX, SSE2) for three parts of
> snow. These changes include:
> 
>  - MMX and SSE2 code for the bottom part of add_yblock_buffered.
>    - Left shifting the OBMC tables by 2, and updating parts of the code
> to work with the change. This makes for somewhat faster code by
> eliminating some shift operations in the innermost loop of
> add_yblock_buffered.
> 
>  - vertical compose has a straightforward SIMD implementation.
> 
>  - horizontal compose has substantially modified internally to allow for
> an efficient SIMD implementation and improving cache performance. For
> plain C code, it may be faster or slower on your system (faster on
> mine). The largest change is that it is almost entirely in-place and the
> temp buffer is only half used now, allowing for SIMD optimization and
> improving cache performance. An added step, interleave_line(), has been
> added because the in-place lifts do not leave the coefficients in the
> proper places. This code is extremely fast in SIMD.
> 
> I am aware that conditional compilation for SIMD code is frowned upon,
> so could someone give me some feedback on how my code could be
> efficiently done using function pointers like the other SIMD
> optimizations in ffmpeg? Some functions (interleave_line, 8x8 obmc) take
> nary 500 clocks to finish.

1. how much speed do we loose if you convert them to naive function pointers
also keep in mind that gcc has serious difficulty optimizion large functions
2. if you want to decrease the overhead:
then change:
for(){
 func_ptr()
}
to
func_mmx(){
 for(){
  mmx()
 }
}
func_c(){
 for(){
  c()
 }
}

yeah you duplicate a few lines of code, but its MUCH cleaner
and if there is lots of other stuff in the loop which needs to be duplicated
then that should be split into its own inline function ...

[...]
> @@ -1409,6 +1484,121 @@
>          spatial_compose53i_dy(&cs, buffer, width, height, stride);
>  }
>  
> +static void interleave_line(DWTELEM * low, DWTELEM * high, DWTELEM *b, int width){
> +    int i = width - 2;
> +    
> +    if (width & 1)
> +    {
> +        b[i+1] = low[(i+1)/2];

dividing signed integers by 2^x is slow due to the special case of negative
numbers, so use a >>1 here ur use unsigned

[...]
> +static void horizontal_compose97i_unified(DWTELEM *b, int width){
> +    const int w2= (width+1)>>1;
> +#ifdef HAVE_SSE2
> +// SSE2 code runs faster with pointers aligned on a 32-byte boundary.
> +    DWTELEM temp_buf[width + 4];
> +    DWTELEM * const temp = temp_buf + 4 - (((int)temp_buf & 0xF) / 4);
> +#else
> +    DWTELEM temp[width];
> +#endif
> +    const int w_l= (width>>1);
> +    const int w_r= w2 - 1;
> +    int i;
> +        
> +    {
> +    DWTELEM * const ref = b + w2 - 1;
> +    DWTELEM b_0 = b[0]; //By allowing the first entry in b[0] to be calculated twice
> +    // (the first time erroneously), we allow the SSE2 code to run an extra pass.
> +    // The savings in code and time are well worth having to store this value and
> +    // calculate b[0] correctly afterwards.
> +
> +    i = 0;
> +#ifdef HAVE_MMX
> +horizontal_compose97i_lift0_asm
> +#endif
> +    for(; i<w_l; i++){
> +      b[i] = b[i] - ((W_DM * (ref[i] + ref[i + 1]) + W_DO) >> W_DS);
> +    }
> +
> +    if(width&1){
> +        b[w_l] = b[w_l] - ((W_DM * 2 * ref[w_l] + W_DO) >> W_DS);
> +    }
> +    b[0] = b_0 - ((W_DM * 2 * ref[1]+W_DO)>>W_DS);
> +    }
> +    
> +    {
> +    DWTELEM * const dst = b+w2;
> +    DWTELEM * const src = dst;
> +    
> +    i = 0;
> +    for(; (((long)&dst[i]) & 0xF) && i<w_r; i++){
> +        dst[i] = src[i] - (b[i] + b[i + 1]);
> +    }
> +#ifdef HAVE_SSE2    
> +    horizontal_compose97i_lift1_asm
> +#endif
> +    for(; i<w_r; i++){
> +        dst[i] = src[i] - (b[i] + b[i + 1]);
> +    }
> +
> +    if(!(width&1)){
> +        dst[w_r] = src[w_r] - (2 * b[w_r]);
> +    }
> +    }

umm, indention ...
and this is quite repeative code, so should be in its own function/macro
similar to lift()

[...]

-- 
Michael