[FFmpeg-devel] [PATCH] avutil/mathematics: speed up av_gcd by using Stein's binary GCD algorithm

Ganesh Ajjanagadde gajjanag at mit.edu
Sun Oct 11 02:37:59 CEST 2015


On Sat, Oct 10, 2015 at 8:34 PM, Ganesh Ajjanagadde <gajjanag at mit.edu> wrote:
> On Sat, Oct 10, 2015 at 8:22 PM, Michael Niedermayer
> <michael at niedermayer.cc> wrote:
>> On Sun, Oct 11, 2015 at 02:13:59AM +0200, Michael Niedermayer wrote:
>>> On Sat, Oct 10, 2015 at 07:45:17PM -0400, Ganesh Ajjanagadde wrote:
>>> > On Sat, Oct 10, 2015 at 7:14 PM, Michael Niedermayer
>>> > <michael at niedermayer.cc> wrote:
>>> > > On Sat, Oct 10, 2015 at 11:32:06PM +0200, Henrik Gramner wrote:
>>> > >> On Sat, Oct 10, 2015 at 11:06 PM, Ganesh Ajjanagadde
>>> > >> <gajjanagadde at gmail.com> wrote:
>>> > >> > This uses Stein's binary GCD algorithm:
>>> > >> > https://en.wikipedia.org/wiki/Binary_GCD_algorithm
>>> > >> > to get a reported 1.7-2.1x speedup over Euclidean GCD on standard architectures.
>>> > >> > Have not benchmarked, so I can't comment
>>> > >>
>>> > >> Before you submit a patch that's supposed to make something faster,
>>> > >> you should benchmark it to verify that it is in fact faster. Do this
>>> > >> with inputs of various sizes on both 32- and 64-bit architectures and
>>> > >> both with and without compilers that support __builtin_ctzll(v).
>>> > >
>>> > > without __builtin_ctzll() the old code seems faster in a simple test
>>> > > like:
>>> > > make -j12 && ./ffmpeg -i matrixbench_mpeg2.mpg test.mov
>>> >
>>> > >
>>> > > also it seems a simple
>>> > >     while (u != v) {
>>> > >         if (u > v)
>>> > >             FFSWAP(int64_t, v, u);
>>> > >         v -= u;
>>> > >         do {
>>> > >             v >>= 1;
>>> > >         } while(!(v & 1));
>>> > >     }
>>> > > is faster than the non builtin ctzll in its place but still not as
>>> > > fast as the current code
>>> >
>>> > So I checked out the various implementation of (non built in) ctzll at
>>> > https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightBinSearch.
>>> > The funny thing is, among the looping ones, the fastest is the simplest :):
>>> > static inline int ff_ctzll_c(long long v)
>>> > {
>>> >     int c = 0;
>>> >     while (!(v & 1)) {
>>> >         c++;
>>> >         v >>= 1;
>>> >     }
>>> >     return c;
>>> > }
>>
>> tested this with the matrixbench_mpeg2 ->mov test and it seems
>> 20% or so slower than the LUT based version
>
> If the LUT seems to be a good idea, I think the De Bruijn one should
> be even better - no branching, only a multiply. and some shifts etc
> Cache effects should be similar for the LUT and the De Bruijn, perhaps
> slightly in favor of De Bruijn due to a smaller table (64 vs 256
> entries).
>
> Most places I see only the 32 bit version, so I can't quickly check this.

For reference: http://supertech.csail.mit.edu/papers/debruijn.pdf.

>
>>
>>
>>>
>>> its possible gcc/clang recognizes this and replaces it by the
>>> instruction for the builtin
>>> this is perfectly fine when such an instruction is available
>>> but when not this might be slower than what we would expect from
>>> just testing on architectures with such instructions
>>
>> gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
>> fails to replace the loop by anything smart
>>
>> you can use
>> make libavutil/mathematics.s
>>
>> to compile the C code to asm and see how the functions get compiled
>>
>> [...]
>>
>> --
>> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>>
>> What does censorship reveal? It reveals fear. -- Julian Assange
>>
>> _______________________________________________
>> ffmpeg-devel mailing list
>> ffmpeg-devel at ffmpeg.org
>> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>


More information about the ffmpeg-devel mailing list