[FFmpeg-devel] [PATCH] MMX2/SSSE3 VC1 loop filter

Tue Jul 6 05:34:23 CEST 2010

On Jul 5, 2010, at 4:30 PM, Ronald S. Bultje wrote:

> Hi,
> 
> On Mon, Jul 5, 2010 at 1:44 AM, David Conrad <lessen42 at gmail.com> wrote:
>> Updated to patch cleanly, compile, and added mmx/sse2 versions
> [..]
>> +SECTION_RODATA
>> +pw_4: times 8 dw 4
>> +pw_5: times 8 dw 5
> 
> cextern pw_4, pw_5 (i.e. use the ones in dsputil_mmx.c) maybe?

This doesn't cause non-PIC problems on x86-64? I remember something similar causing a problem in libvpx.

Well, if it's problematic vp8dsp.asm has it already so changed.

>> +; low, high (src), zero
>> +%macro UNPACK2 4
>> +    mova      m%2, m%3
>> +    punpckh%1 m%3, m%4
>> +    punpckl%1 m%2, m%4
>> +%endmacro
> 
> duplicate of SBUTTERFLY in x86util.asm, maybe?

Not quite conceptually: this is intended to expand one vector from 8-bit to 16-bit, whereas SBUTTERFLY is intended to merge 2 vectors. The SWAP of SBUTTERFLY in particular will send the constant 0 vector all over the place.

I could implement it by calling SBUTTERFLY but IMO it should be a different macro.

>> +%macro STORE_4_WORDS_MMX 6
>> +    movd   %6, %5
>> +%if mmsize==16
>> +    psrldq %5, 4
>> +%else
>> +    psrlq  %5, 32
>> +%endif
>> +    mov    %1, %6w
>> +    shr    %6, 16
>> +    mov    %2, %6w
>> +    movd   %6, %5
>> +    mov    %3, %6w
>> +    shr    %6, 16
>> +    mov    %4, %6w
>> +%endmacro
> 
> For VP8 H loopfilter, I save the neighbouring two rows (p1/q1) and
> write the four out as dwords using movd at once from the mm register,
> have you tried that (I'm not asking you to rewrite it if you didn't),
> and if so, is it faster?

I just tried doing it this way, and it seems to be a little slower. I'll try using STORE_4_WORDS_MMX in vp8's simple loop filter later, but not soon.

> (I suppose this isn't very practical because of the SSE4 version below...)
> 
>> +%macro STORE_4_WORDS_SSE4 6
>> +    pextrw %1, %5, %6+0
>> +    pextrw %2, %5, %6+1
>> +    pextrw %3, %5, %6+2
>> +    pextrw %4, %5, %6+3
>> +%endmacro
> [..]
> 
>> +%macro VC1_H_LOOP_FILTER 1-2
>> +    movq      m0, [r0     -4]
>> +    movq      m1, [r0+  r1-4]
>> +    movq      m2, [r0+2*r1-4]
>> +    movq      m3, [r0+  r3-4]
>> +%if %1 > 4
>> +    movq      m4, [r4     -4]
>> +    movq      m5, [r4+  r1-4]
>> +    movq      m6, [r4+2*r1-4]
>> +    movq      m7, [r4+  r3-4]
>> +    punpcklbw m0, m1
>> +    punpcklbw m2, m3
>> +    punpcklbw m4, m5
>> +    punpcklbw m6, m7
>> +    SWAP 1, 2
>> +    SWAP 2, 4
>> +    SWAP 3, 6
>> +    SBUTTERFLY wd, 0, 1, 4
>> +    SBUTTERFLY wd, 2, 3, 4
>> +    SBUTTERFLY dq, 0, 2, 4
>> +    SBUTTERFLY dq, 1, 3, 4
>> +%else
>> +    SBUTTERFLY bw, 0, 1, 4
>> +    SBUTTERFLY bw, 2, 3, 4
>> +    SBUTTERFLY wd, 0, 2, 4
>> +    SBUTTERFLY wd, 1, 3, 4
>> +%endif
> 
> TRANSPOSE4x4W, TRANSPOSE4x4B?

Fixed.

>> +cglobal vc1_h_loop_filter8_sse4, 3,5,8
> 
> Should this (and others like it) be under #ifdef X86_64? I got compile
> errors if I tried to use xmm8-15 on x86_32.

What compiler errors are you getting? This shouldn't use xmm8+ ever.

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: textmate stdin E0V4hx.txt
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100705/49a4ba13/attachment.txt>