[FFmpeg-devel] [PATCH] mmx implementation of vc-1 inverse transformations

Mon Jul 7 21:02:28 CEST 2008

Michael Niedermayer schrieb:
> On Thu, Jul 03, 2008 at 02:51:18PM +0200, Victor Pollex wrote:
>   
>> Michael Niedermayer schrieb:
>>     
>>> On Sat, Jun 28, 2008 at 12:31:41PM +0200, Victor Pollex wrote:
>>>       
> [...]
>   
>>> [...]
>>>
>>>   
>>>       
>>>> +#define LOAD_ADD_CLAMP_STORE_8X1(io,t0,t1,r0,r1)\
>>>> +    "movq      "#io", "#t0"\n\t"\
>>>> +    "movq      "#t0", "#t1"\n\t"\
>>>> +    "punpcklbw %%mm7, "#t0"\n\t"\
>>>> +    "punpckhbw %%mm7, "#t1"\n\t"\
>>>> +    "paddw     "#r0", "#t0"\n\t"\
>>>> +    "paddw     "#r1", "#t1"\n\t"\
>>>> +    "packuswb  "#t1", "#t0"\n\t"\
>>>> +    "movq      "#t0", "#io"\n\t"
>>>>     
>>>>         
>>> some of the movq seem redundant
>>>   
>>>       
>> I'm sorry but I don' see it. Although this is now obsolete, I'd still like 
>> to know which one is redundant and why.
>>     
>
> i thought io was a register, if it is memory then things are different
> sorry
>
>
> [...]
>
> the LOAD4/STORE4 patch is ok
>
> [...]
>
>   
>> +/*
>> +    postcondition:
>> +        dst0 = [15:0](dst0 + src);
>> +        dst1 = [15:0](dst1 + src);
>> +        dst2 = [15:0](dst2 - src);
>> +        dst3 = [15:0](dst3 - src);
>> +*/
>> +#define ADD2SUB2(src, dst0, dst1, dst2, dst3)\
>> +    ADD1SUB1(src, dst0, dst2)\
>> +    ADD1SUB1(src, dst1, dst3)
>>     
>
> This occurs 6 times, thus it safes 6 ADD1SUB1 lines while the macro
> with documentation needs 10 lines thus overall this is a loss, not only
> is it one more macro the reader has to understand it is also more code
> with the macro than without
>   
removed macro and replaced all occurences with two ADD1SUB1
>
> [...]
>   
>> +/*
>> +    precodition:
>> +        for all values v in r0, r1, r2, r3: -3971 <= v <= 3971
>> +
>> +    postcondition:
>> +        r3 = ((17 * (r0 + r2) + (22 * r1 + 10 * r3) + c) >> 3)
>> +        r4 = ((17 * (r0 - r2) - (10 * r1 - 22 * r3) + c) >> 3)
>> +        r1 = ((17 * (r0 - r2) + (10 * r1 - 22 * r3) + c) >> 3)
>> +        r2 = ((17 * (r0 + r2) - (22 * r1 + 10 * r3) + c) >> 3)
>> +        r0 undefined
>> +        r5 undefined
>> +        r6 undefined
>> +        r7 undefined
>> +*/
>> +#define TRANSFORM_4X4_ROW(r0,r1,r2,r3,r4,r5,r6,r7,c)\
>> +    TRANSPOSE4(r0,r1,r2,r3,r4)\
>> +    TRANSFORM_4X4_COMMON(r0,r3,r4,r2,r1,r5,r6,r7,c)\
>> +    "paddw "#r4", "#r4"\n\t" /* 2 * (r0 + r2) */\
>> +    SUMSUB_BA(r3,r4)\
>> +    "paddw "#r1", "#r3"\n\t"\
>> +    "paddw "#r7", "#r4"\n\t"\
>> +    "paddw "#r0", "#r0"\n\t" /* 2 * (r0 - r2) */\
>> +    SUMSUB_BA(r2,r0)\
>> +    "paddw "#r5", "#r0"\n\t"\
>> +    "paddw "#r6", "#r2"\n\t"\
>> +    TRANSPOSE4(r3,r0,r2,r4,r1)
>>     
>
> It should be possible to merge one transpose into the scantble (the mpeg1/2/4
> decoder does that too)
>
>   
I'm not sure if this should be done as I found the following lines in 
decode_sequence_header in vc1.c
    if (!v->res_fasttx)
    {
        v->s.dsp.vc1_inv_trans_8x8 = ff_simple_idct;
        v->s.dsp.vc1_inv_trans_8x4 = ff_simple_idct84_add;
        v->s.dsp.vc1_inv_trans_4x8 = ff_simple_idct48_add;
        v->s.dsp.vc1_inv_trans_4x4 = ff_simple_idct44_add;
    }

> [...]
>   
>> +/*
>> +    postcondition:
>> +        r0 = [15:0](2 * r0);
>> +        r1 = [15:0](3 * r0);
>> +*/
>> +#define G3X(r0,r1)\
>> +    "movq  "#r0", "#r1"\n\t" /* r0 */\
>> +    "paddw "#r0", "#r0"\n\t" /* 2 * r0 */\
>> +    "paddw "#r0", "#r1"\n\t" /* 3 * r0 */
>>     
>
> 4 uses, saving 8 lines, macro with docs is 9 lines
>   
removed docs
>
> [...]
>   
>> +static void vc1_inv_trans_4x4_mmx(uint8_t *dest, int linesize, DCTELEM *block)
>> +{
>> +    asm volatile(
>> +    LOAD4(0x10,0x00(%0),%%mm0,%%mm1,%%mm2,%%mm3)
>> +    TRANSFORM_4X4_ROW(%%mm0,%%mm1,%%mm2,%%mm3,%%mm4,%%mm5,%%mm6,%%mm7,%3)
>> +    TRANSFORM_4X4_COL(%%mm3,%%mm4,%%mm1,%%mm2,%%mm0,%%mm5,%%mm6,%%mm7,0x08+%3)
>> +    "pxor %%mm7, %%mm7\n\t"
>> +    LOAD_ADD_CLAMP_STORE_4X4(%2,%1,%%mm0,%%mm4,%%mm3,%%mm2,%%mm1)
>> +    :
>> +    : "r"(block), "r"(dest), "r"(linesize), "m"(constants[0])
>> +    : "memory"
>>     
>
> you are modifying a register (%1) which is just an input, this isnt
> correct, gcc could assume it didnt change ...
> it should be "+r" not "r" and listed with the outputs.
>
> [...]
>   
changed it
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: vc1_inv_trans_mmx_v5.patch
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080707/ed3b45ef/attachment.asc>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: dsputil_v2.patch
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080707/ed3b45ef/attachment.txt>