[FFmpeg-devel] [PATCH] MMX2/SSSE3 VC1 loop filter

Ronald S. Bultje rsbultje
Mon Jul 5 22:30:29 CEST 2010


Hi,

On Mon, Jul 5, 2010 at 1:44 AM, David Conrad <lessen42 at gmail.com> wrote:
> Updated to patch cleanly, compile, and added mmx/sse2 versions
[..]
> +SECTION_RODATA
> +pw_4: times 8 dw 4
> +pw_5: times 8 dw 5

cextern pw_4, pw_5 (i.e. use the ones in dsputil_mmx.c) maybe?

> +; low, high (src), zero
> +%macro UNPACK2 4
> +    mova      m%2, m%3
> +    punpckh%1 m%3, m%4
> +    punpckl%1 m%2, m%4
> +%endmacro

duplicate of SBUTTERFLY in x86util.asm, maybe?

> +%macro STORE_4_WORDS_MMX 6
> +    movd   %6, %5
> +%if mmsize==16
> +    psrldq %5, 4
> +%else
> +    psrlq  %5, 32
> +%endif
> +    mov    %1, %6w
> +    shr    %6, 16
> +    mov    %2, %6w
> +    movd   %6, %5
> +    mov    %3, %6w
> +    shr    %6, 16
> +    mov    %4, %6w
> +%endmacro

For VP8 H loopfilter, I save the neighbouring two rows (p1/q1) and
write the four out as dwords using movd at once from the mm register,
have you tried that (I'm not asking you to rewrite it if you didn't),
and if so, is it faster?

(I suppose this isn't very practical because of the SSE4 version below...)

> +%macro STORE_4_WORDS_SSE4 6
> +    pextrw %1, %5, %6+0
> +    pextrw %2, %5, %6+1
> +    pextrw %3, %5, %6+2
> +    pextrw %4, %5, %6+3
> +%endmacro
[..]

> +%macro VC1_H_LOOP_FILTER 1-2
> +    movq      m0, [r0     -4]
> +    movq      m1, [r0+  r1-4]
> +    movq      m2, [r0+2*r1-4]
> +    movq      m3, [r0+  r3-4]
> +%if %1 > 4
> +    movq      m4, [r4     -4]
> +    movq      m5, [r4+  r1-4]
> +    movq      m6, [r4+2*r1-4]
> +    movq      m7, [r4+  r3-4]
> +    punpcklbw m0, m1
> +    punpcklbw m2, m3
> +    punpcklbw m4, m5
> +    punpcklbw m6, m7
> +    SWAP 1, 2
> +    SWAP 2, 4
> +    SWAP 3, 6
> +    SBUTTERFLY wd, 0, 1, 4
> +    SBUTTERFLY wd, 2, 3, 4
> +    SBUTTERFLY dq, 0, 2, 4
> +    SBUTTERFLY dq, 1, 3, 4
> +%else
> +    SBUTTERFLY bw, 0, 1, 4
> +    SBUTTERFLY bw, 2, 3, 4
> +    SBUTTERFLY wd, 0, 2, 4
> +    SBUTTERFLY wd, 1, 3, 4
> +%endif

TRANSPOSE4x4W, TRANSPOSE4x4B?

> +cglobal vc1_h_loop_filter8_sse4, 3,5,8

Should this (and others like it) be under #ifdef X86_64? I got compile
errors if I tried to use xmm8-15 on x86_32.

Ronald



More information about the ffmpeg-devel mailing list