[FFmpeg-devel] [PATCH] VP3 DC-only IDCT

Fri Apr 16 14:20:44 CEST 2010

On Mar 13, 2010, at 2:18 PM, Michael Niedermayer wrote:

> On Sat, Mar 13, 2010 at 01:36:20AM -0500, David Conrad wrote:
>> Hi,
>> 
>> This gives 2-4% faster overall decode for normal files.
>> 
>> Some thoughts:
>> I can't think of any shortcuts that could make the IDCT faster with 128-byte simd that don't rely on knowing the last non-zero coefficient.
>> 
>> Knowing that before calling the idct, you could do a slightly faster IDCT that assumes the right and bottom of the block are all 0. This seems to be significantly faster only for mmx; for sse2 it's nearly a wash between the added check vs. the time saved.
>> 
>> For an average video, around a third of all idcts are DC-only, a third more could be done with that shortcut (i.e. last_nnz is under 10), and the rest require a full IDCT.
>> 
>> libtheora only does the 10 element shortcut, not DC-only. It also only has a mmx IDCT.
>> 
>> I also haven't really looked at whether a DC-only IDCT is beneficial for mpeg codecs, thus the vp3-specific dsputil function.
>> 
> 
> [...]
>> diff --git a/libavcodec/vp3dsp.c b/libavcodec/vp3dsp.c
>> index 87b64de..606e361 100644
>> --- a/libavcodec/vp3dsp.c
>> +++ b/libavcodec/vp3dsp.c
>> @@ -223,6 +223,25 @@ void ff_vp3_idct_add_c(uint8_t *dest/*align 8*/, int line_size, DCTELEM *block/*
>>     idct(dest, line_size, block, 2);
>> }
>> 
>> +void ff_vp3_idct_dc_add_c(uint8_t *dest/*align 8*/, int line_size, DCTELEM *block/*align 16*/){
>> +    const uint8_t *cm = ff_cropTbl + MAX_NEG_CROP;
>> +    int i, dc = block[0];
> 
>> +    dc = (46341*dc)>>16;
>> +    dc = (46341*dc)>>16;
> 
> me searches for a bag to vomit into ...
> do they do all x>>1 in theora that way or just selected ones?

Every multiplication in the IDCT is immediately followed by cutting the least significant 16 bits.

> [...]
>> diff --git a/libavcodec/x86/vp3dsp_mmx.c b/libavcodec/x86/vp3dsp_mmx.c
>> index fead8e8..e39d0a1 100644
>> --- a/libavcodec/x86/vp3dsp_mmx.c
>> +++ b/libavcodec/x86/vp3dsp_mmx.c
>> @@ -395,3 +395,65 @@ void ff_vp3_idct_add_mmx(uint8_t *dest, int line_size, DCTELEM *block)
>>     ff_vp3_idct_mmx(block);
>>     add_pixels_clamped_mmx(block, dest, line_size);
>> }
>> +
> 
>> +void ff_vp3_idct_dc_add_mmx2(uint8_t *dest, int linesize, DCTELEM *block)
>> +{
>> +    int dc = block[0];
>> +    dc = (46341*dc)>>16;
> 
>> +    dc = (46341*dc)>>16;
>> +    dc = (dc + 8) >> 4;
> 
> you can merge these 2

Done

> +    __asm__ volatile(
>> +        "movd          %0, %%mm0 \n\t"
>> +        "pshufw $0, %%mm0, %%mm0 \n\t"
>> +        "pxor       %%mm1, %%mm1 \n\t"
>> +        "psubw      %%mm0, %%mm1 \n\t"
>> +        "packuswb   %%mm0, %%mm0 \n\t"
>> +        "packuswb   %%mm1, %%mm1 \n\t"
>> +        ::"r"(dc)
>> +    );
>> +    __asm__ volatile(
>> +        "movq          %0, %%mm2 \n\t"
>> +        "movq          %1, %%mm3 \n\t"
>> +        "movq          %2, %%mm4 \n\t"
>> +        "movq          %3, %%mm5 \n\t"
>> +        "paddusb    %%mm0, %%mm2 \n\t"
>> +        "paddusb    %%mm0, %%mm3 \n\t"
>> +        "paddusb    %%mm0, %%mm4 \n\t"
>> +        "paddusb    %%mm0, %%mm5 \n\t"
>> +        "psubusb    %%mm1, %%mm2 \n\t"
>> +        "psubusb    %%mm1, %%mm3 \n\t"
>> +        "psubusb    %%mm1, %%mm4 \n\t"
>> +        "psubusb    %%mm1, %%mm5 \n\t"
>> +        "movq       %%mm2, %0    \n\t"
>> +        "movq       %%mm3, %1    \n\t"
>> +        "movq       %%mm4, %2    \n\t"
>> +        "movq       %%mm5, %3    \n\t"
>> +        :"+m"(*(uint32_t*)(dest+0*linesize)),
>> +         "+m"(*(uint32_t*)(dest+1*linesize)),
>> +         "+m"(*(uint32_t*)(dest+2*linesize)),
>> +         "+m"(*(uint32_t*)(dest+3*linesize))
>> +    );
>> +    dest += 4*linesize;
>> +    __asm__ volatile(
>> +        "movq          %0, %%mm2 \n\t"
>> +        "movq          %1, %%mm3 \n\t"
>> +        "movq          %2, %%mm4 \n\t"
>> +        "movq          %3, %%mm5 \n\t"
>> +        "paddusb    %%mm0, %%mm2 \n\t"
>> +        "paddusb    %%mm0, %%mm3 \n\t"
>> +        "paddusb    %%mm0, %%mm4 \n\t"
>> +        "paddusb    %%mm0, %%mm5 \n\t"
>> +        "psubusb    %%mm1, %%mm2 \n\t"
>> +        "psubusb    %%mm1, %%mm3 \n\t"
>> +        "psubusb    %%mm1, %%mm4 \n\t"
>> +        "psubusb    %%mm1, %%mm5 \n\t"
>> +        "movq       %%mm2, %0    \n\t"
>> +        "movq       %%mm3, %1    \n\t"
>> +        "movq       %%mm4, %2    \n\t"
>> +        "movq       %%mm5, %3    \n\t"
>> +        :"+m"(*(uint32_t*)(dest+0*linesize)),
>> +         "+m"(*(uint32_t*)(dest+1*linesize)),
>> +         "+m"(*(uint32_t*)(dest+2*linesize)),
>> +         "+m"(*(uint32_t*)(dest+3*linesize))
>> +    );
> 
> please write it as a single asm block, gcc had the habit of putting unneeded
> instructions between asm blocks

Fixed

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: textmate stdin s0mPt4.txt
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100416/12d7ac46/attachment.txt>