[FFmpeg-devel] [PATCH] VP8 MMX optimizations (MC and IDCT dc_add)

Jason Garrett-Glaser darkshikari
Wed Jun 23 00:32:13 CEST 2010


On Tue, Jun 22, 2010 at 3:29 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Tue, Jun 22, 2010 at 03:35:40PM -0400, Ronald S. Bultje wrote:
>> Hi,
>>
>> as per $subj.
>>
>> Speed gain:
>> - dc_add goes from 1800 to 1350 cycles (where 1150 is overhead,
>> measured as empty asm func), so about 3-3.5x faster.
>> - The MC functions are each about 4-5x faster (I only measured the 4x4
>> ones, the rest I assume are similarly faster but not measured).
>> - Total time spent on a shell-script that decodes the whole testsuite
>> (vp8-test-vectors-r1, file 001-017) including shell overhead and
>> everything goers from 2.3 to 2.1 seconds with these applied.
>>
>> Results are bit-identical, and this is my first MMX/etc. ever! Thanks
>> to Jason for teaching me. ;-).
>>
>> Ronald
>
> [...]
>> +; 4x4 block, H-only 4-tap filter
>> +cglobal put_vp8_epel4_h4_mmxext, 5, 5
>> + ? ?sub ? ? ? ?r0, r1
>> + ? ?movd ? ? ?mm4, [fourtap_filter+r4*4-4] ; set up 4tap filter in words
>> + ? ?movd ? ? ?mm5, [fourtap_filter+r4*4]
>> + ? ?movq ? ? ?mm7, [ff_pw_64]
>> + ? ?pxor ? ? ?mm6, mm6
>
>> + ? ?punpckldq mm4, mm4
>> + ? ?punpckldq mm5, mm5
>
> you could avoid these by doing them to th table

Already done locally.

>> +; 4x4 block, V-only 4-tap filter
>> +cglobal put_vp8_epel4_v4_mmxext, 4, 5
>> + ? ?mov ? ? ? ?r4, r5m ? ? ? ? ? ? ? ? ? ? ; my - FIXME prevent this on X86_64
>> + ? ?sub ? ? ? ?r0, r1
>> + ? ?movq ? ? ?mm7, [fourtap_filter+r4*4-4] ; load 4-tap filter coeffs
>> + ? ?pxor ? ? ?mm6, mm6
>> + ? ?movq ? ? ?mm5, [ff_pw_64]
>> +
>> + ? ?; read 3 lines
>> + ? ?sub ? ? ? ?r1, r2
>> + ? ?movd ? ? ?mm0, [r1]
>> + ? ?movd ? ? ?mm1, [r1+ ?r2]
>> + ? ?movd ? ? ?mm2, [r1+2*r2]
>> + ? ?add ? ? ? ?r1, r2
>> + ? ?punpcklbw mm0, mm6
>> + ? ?punpcklbw mm1, mm6
>> + ? ?punpcklbw mm2, mm6
>> +
>> +.nextrow
>> + ? ?; first tap
>> + ? ?pshufw ? ?mm3, mm7, 0x0 ? ? ? ? ? ? ? ?; splat first coeff
>
> are you sure all these pshufw are faster than reading them from a table?

Already done locally.

>> + ? ?pmullw ? ?mm3, mm0
>> +
>
>> + ? ?; update cache for second/third already
>> + ? ?movq ? ? ?mm0, mm1
>> + ? ?movq ? ? ?mm1, mm2
>
> these could be avoided by unrolling the loop but i guess that makes it
> too bloated?

Working on this locally.

> [...]
>> +cglobal vp8_idct_dc_add_mmx, 3, 3
>> + ? ?; load data
>> + ? ?movd ? ? ? mm0, [r1]
>> + ? ?pxor ? ? ? mm2, mm2
>> + ? ?mov ? ? ? ? r1, 4
>> +
>> + ? ?; calculate DC
>> + ? ?paddw ? ? ?mm0, [ff_pw_4]
>> + ? ?punpcklwd ?mm0, mm0
>> + ? ?punpckldq ?mm0, mm0
>> + ? ?psraw ? ? ?mm0, 3
>> +
>> +.nextblock
>> + ? ?; add DC
>> + ? ?movd ? ? ? mm1, [r0]
>> + ? ?punpcklbw ?mm1, mm2
>> + ? ?paddw ? ? ?mm1, mm0
>> +
>> + ? ?; write out
>> + ? ?packuswb ? mm1, mm2
>> + ? ?movd ? ? ?[r0], mm1
>
> movq ? ?mm0, [r0]
> paddusb mm0, mm1
> psubusb mm0, mm2
> movq ? ?mm0, [r0]
>
> can be used to do this with 8 samples at once, aka 2 4x4 blocks

Already done this locally for the 4x4 version, 8-sample versions can come later.

Dark Shikari



More information about the ffmpeg-devel mailing list