[FFmpeg-devel] what is h264_idct_add8()?

Ronald S. Bultje rsbultje
Tue Sep 14 15:37:11 CEST 2010


Hi,

On Mon, Sep 13, 2010 at 6:03 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> Hi,
>
> On Mon, Sep 13, 2010 at 5:26 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> On Sun, Sep 12, 2010 at 08:24:45PM -0400, Ronald S. Bultje wrote:
>>> Hi,
>>>
>>> On Sun, Sep 12, 2010 at 8:26 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
>>> > On Fri, Sep 10, 2010 at 09:48:53PM -0400, Ronald S. Bultje wrote:
>>> >> On Mon, Sep 6, 2010 at 4:32 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>>> >> > On Mon, Sep 06, 2010 at 12:33:13PM -0400, Ronald S. Bultje wrote:
>>> >> > [...]
>>> >> >> Michael, do you still have the patch that enables using idct_add8()
>>> >> >> for chroma (probably in h264.c) so I can test it performance of
>>> >> >> yasmified idct_add8 against the current code that doesn't use
>>> >> >> idct_add8()?
>>> >> >
>>> >> > i tried a bit of find and grep but it seems iam not looking at the right
>>> >> > place or not searching for the right thing
>>> >>
>>> >> So what do you suggest we do?
>>> >> a) remove the idct_add8() functions from H264DSPContext
>>> >> b) leave as-is (because I can't test the my yasm conversion is correct)
>>> >> c) convert it to yasm along with the rest, hope that it is correct
>>> >> without testing (?)
>>> >> d) something else?
>>> >>
>>> >> (A) is easiest, but (C) may have some benefit if I decide to test the
>>> >> performance benefit in the future with the yasmified version. (B)
>>> >> means duplication of code and thus sounds like a bad plan...
>>> >
>>> > iam against a, i dont care about the rest, mans suggestion is possible too but
>>> > seems much work
>>>
>>> I appear to waste too much time on this already, so let's get this
>>> over with. I only did a single measure because the difference is quite
>>> strong (the reason is obviously MMX vs SSE2, along with what you did
>>> earlier to not have to call a vfunc 8 times)
>>>
>>> Current SVN:
>>> 1838 dezicycles in chroma idct add8, 262111 runs, 33 skips
>>>
>>> Using add8 (see attached patch):
>>> 1745 dezicycles in chroma idct add8, 262124 runs, 20 skips
>>>
>>> add8, SSE2:
>>> 1264 dezicycles in chroma idct add8, 262106 runs, 38 skips
>>>
>>> My recommendation: we should apply this (along with the rest of my
>>> yasmification).
>>>
>>> The rest of the yasmification patch is attached and will have to be
>>> applied with it. I can in all honesty (I measured them all, bleh) say
>>> that no single function is slower in yasm at this point, although that
>>> took a good hack in h264_idct_add16_sse2() (somehow the unroll of the
>>> loop plus inlining of scan8[] makes it a good 20% faster - right now
>>> it's 10 cycles faster than the gcc one, but the not-unrolled one was
>>> 20-25% slower than gcc (which unrolls it too)).
>>>
>>> Many (+/- half of the) functions are a few (5-30) cycles faster in
>>> yasm, the other half is approximately equal speed. The speedups are
>>> generally in functions where gcc screws up loop conditionals (e.g. for
>>> (x=0;<16;x++) { if (a || b) { .. } }, which it performs horribly at by
>>> creating something like if (!a1) goto end1; { yes1: .. } if (!a2) goto
>>> end2; { yes2: .. } [.. and so on until 16 ..] end1: if (b1) goto yes1;
>>> if (b2) goto yes2; [.. and so on ..]). It's quite hilarious.
>>>
>>> Ronald
>>
>>> ?h264.c | ? ?8 ++++++++
>>> ?1 file changed, 8 insertions(+)
>>> b89da7914f847f12bbd9c9ca547deedafe4f6326 ?h264_use_add8.patch
>>
>> if its faster (also time ./ffmpeg) and someone looked over the code
>> then ive no objections
>
> everything on core i7 OSX 10.6 cathedral sample:
>
> time ffmpeg (x86-64) after:
> 9.393
> 9.468
> 9.353
>
> before:
> 9.411
> 9.537
> 9.649
>
> time ffmpeg (x86-32) after
> 10.110
> 10.143
> 10.098
>
> x86-32 before
> 10.161
> 10.154
> 10.210
>
> decode_mb START/STOP_TIMER before x86-32:
> 8453 dezicycles in decode_mb, 4192657 runs, 1647 skips
> 8462 dezicycles in decode_mb, 4192564 runs, 1740 skips
> 8439 dezicycles in decode_mb, 4192540 runs, 1764 skips
>
> after x86-32:
> 8371 dezicycles in decode_mb, 4192574 runs, 1730 skips
> 8384 dezicycles in decode_mb, 4192549 runs, 1755 skips
> 8375 dezicycles in decode_mb, 4192546 runs, 1758 skips
>
> decode_mb START/STOP_TIMER before x86-64:
> 7617 dezicycles in decode_mb, 4192592 runs, 1712 skips
> 7594 dezicycles in decode_mb, 4192654 runs, 1650 skips
> 7610 dezicycles in decode_mb, 4192527 runs, 1777 skips
>
> after x86-64:
> 7524 dezicycles in decode_mb, 4192683 runs, 1621 skips
> 7587 dezicycles in decode_mb, 4192043 runs, 2261 skips
> 7528 dezicycles in decode_mb, 4192627 runs, 1677 skips
>
> Will apply tomorrow if nobody objects.

Today is tomorrow; applied.

Ronald



More information about the ffmpeg-devel mailing list