[FFmpeg-devel] [PATCH] SSE dct32() [Was: r23095 - in trunk/libavcodec: ...]

Michael Niedermayer michaelni
Sun Jun 20 00:35:17 CEST 2010


On Sun, Jun 20, 2010 at 12:24:45AM +0200, Vitor Sessak wrote:
> On 06/20/2010 12:03 AM, Michael Niedermayer wrote:
>> On Sat, Jun 19, 2010 at 09:25:51PM +0200, Vitor Sessak wrote:
>>> On 06/19/2010 08:47 PM, Michael Niedermayer wrote:
>>>> On Fri, Jun 11, 2010 at 11:34:21PM +0200, Vitor Sessak wrote:
>>>>> On 06/08/2010 04:04 PM, Michael Niedermayer wrote:
>>>>>> On Tue, Jun 08, 2010 at 12:56:16PM +0200, Vitor Sessak wrote:
>>>>>>> On 06/08/2010 01:52 AM, Michael Niedermayer wrote:
>>>>>>>> On Sat, Jun 05, 2010 at 07:35:29AM +0200, Vitor Sessak wrote:
>>>>>>>>> Moving discussion to -devel...
>>>>>>>>>
>>>>>>>>> On 05/31/2010 09:59 PM, Vitor Sessak wrote:
>>>>>>>>>> On 05/14/2010 05:52 PM, Michael Niedermayer wrote:
>>>>>>>>>>> On Fri, May 14, 2010 at 08:39:48AM +0200, Vitor Sessak wrote:
>>>>>>>>>>>> Michael Niedermayer wrote:
>>>>>>>>>>>>> On Tue, May 11, 2010 at 03:56:45PM -0400, Alex Converse wrote:
>>>>>>>>>>>>>> On Tue, May 11, 2010 at 3:52 PM,
>>>>>>>>>>>>>> michael<subversion at mplayerhq.hu>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> Author: michael
>>>>>>>>>>>>>>> Date: Tue May 11 21:52:42 2010
>>>>>>>>>>>>>>> New Revision: 23095
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Log:
>>>>>>>>>>>>>>> float based mp1/mp2/mp3 decoders.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> :)
>>>>>>>>>>>>> btw, any volunteers to try to hook it up to our split radix dct
>>>>>>>>>>>>> and
>>>>>>>>>>>>> or
>>>>>>>>>>>>> simd optimize it?
>>>>>>>>>>>>
>>>>>>>>>>>> Without rdft or dct simd, our split radix code is slower. Ugly
>>>>>>>>>>>> hack
>>>>>>>>>>>> to test
>>>>>>>>>>>> it attached.
>>>>>>>>>>>
>>>>>>>>>>> if dct32() is faster then it should be used by our generic dct
>>>>>>>>>>> code.
>>>>>>>>>>> at least for the plain C case
>>>>>>>>>>
>>>>>>>>>> I've given a try at a SSE dct32(). It is much faster than current 
>>>>>>>>>> C
>>>>>>>>>> code. The only problem is that current code in mpegaudiodec.c 
>>>>>>>>>> expect
>>>>>>>>>> two
>>>>>>>>>> arguments, one input (which is destructed) and one output. ITOH,
>>>>>>>>>> ff_dct_calc() does everything in-place, what does not glue well 
>>>>>>>>>> with
>>>>>>>>>> the
>>>>>>>>>> current mpegaudiodec.c code. Can you (or anyone else that knows
>>>>>>>>>> mpegaudiodec.c well) fix it?
>>>>>>>>>
>>>>>>>>> I've given a try of making mpegaudiodec.c use the same buffer for 
>>>>>>>>> dct
>>>>>>>>> input
>>>>>>>>> and output and it is not trivial. It is much easier (and has no
>>>>>>>>> measurable
>>>>>>>>> slowdown) to make ff_dct_calc() take both an input and an output
>>>>>>>>> pointer
>>>>>>>>> as
>>>>>>>>> in attached patch.
>>>>>>>>>
>>>>>>>>> -Vitor
>>>>>>>>
>>>>>>>>>      avfft.c     |    2 +-
>>>>>>>>>      binkaudio.c |    2 +-
>>>>>>>>>      dct.c       |   40 +++++++++++++++++++++++-----------------
>>>>>>>>>      fft-test.c  |    6 ++----
>>>>>>>>>      fft.h       |   11 +++++++++--
>>>>>>>>>      wmavoice.c  |    4 ++--
>>>>>>>>>      6 files changed, 38 insertions(+), 27 deletions(-)
>>>>>>>>> 91cf0cde9a50a47a8df3fbd171b35535abe00d16  dct_inout.diff
>>>>>>>>
>>>>>>>> ok if tested and no slowdown is confirmed
>>>>>>>
>>>>>>> I retested carefully and found a 3% slowdown. It is due to aliasing,
>>>>>>> which
>>>>>>> does not allow the compiler to unroll the loops. I tested unrolling 
>>>>>>> by
>>>>>>> hand
>>>>>>> the loops and afterwards it is as fast as before.
>>>>>>>
>>>>>>> Are you ok with the patch as is or ok if I apply another patch
>>>>>>> afterwards
>>>>>>> unrolling the loops?
>>>>>>
>>>>>> i think that a 3% speedloss is significant so iam definitly not ok 
>>>>>> with
>>>>>> something that leads to such speedloss.
>>>>>>
>>>>>> also if yu test this patch + unroll against svn, i wonder how
>>>>>> svn+unroll performs
>>>>>> as well as what code cache effects the unroll actually has in actual 
>>>>>> use
>>>>>
>>>>> Ok, I took some time to test it really careful and I gave up making a
>>>>> code
>>>>> as fast as in-place (to begin with, gcc always get register-starved). 
>>>>> So
>>>>> I
>>>>> propose the attached patch. At least the faster code can be used by the
>>>>> common DCT framework and it makes easier to add ASM optimisations.
>>>>
>>>> looks like you created a mess
>>>> The questions are
>>>> 1. what is the ideal way speedwise to implement this for mp3?
>>>
>>> Inlining either the DCT32 C code (or when available the ASM code) in
>>> mpegaudiodec.c since there is only speed gain and no loss in inlining - 
>>> it
>>> is called in just one place. BTW, I do not believe that optimizing (simd
>>> asm or C) my general pre-process->FFT->post-process dct implementation
>>> might beat the 32-point specific code. Allowing for different 
>>> input/output
>>> pairs or not is irrelevant for the current C or ASM implementation (nor
>>> could I imagine any difference for a future one).
>>
>> i didnt mean inlining i meant the in!= out
>
> I don't remember seeing a big difference _for the dct32 code_ between in == 
> out and in != out.

now iam confused, i thought the 3% you quoted was about in ==out vs in!= out
?

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Its not that you shouldnt use gotos but rather that you should write
readable code and code with gotos often but not always is less readable
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100620/d10e1506/attachment.pgp>



More information about the ffmpeg-devel mailing list