[FFmpeg-devel] [PATCHv2] lavc/cbrt_tablegen: speed up tablegen

Wed Jan 6 06:01:40 CET 2016

On Tue, Jan 5, 2016 at 10:46 AM, Ganesh Ajjanagadde <gajjanag at mit.edu> wrote:
> On Tue, Jan 5, 2016 at 10:10 AM, Daniel Serpell <dserpell at gmail.com> wrote:
>> Hi!,
>>
>> El Tue, Jan 05, 2016 at 08:08:35AM -0800, Ganesh Ajjanagadde escribio:
>>> On Tue, Jan 5, 2016 at 7:44 AM, Daniel Serpell <dserpell at gmail.com> wrote:
>>> > Hi!,
>>> >
>>> > El Mon, Jan 04, 2016 at 06:33:59PM -0800, Ganesh Ajjanagadde escribio:
>>> >> This exploits an approach based on the sieve of Eratosthenes, a popular
>>> >> method for generating prime numbers.
>>> >>
>>> >> Tables are identical to previous ones.
>>> >>
>>> >> Tested with FATE with/without --enable-hardcoded-tables.
>>> >>
>>> >> Sample benchmark (Haswell, GNU/Linux+gcc):
>>> >> prev:
>>> >> 7860100 decicycles in cbrt_tableinit,       1 runs,      0 skips
>>> >> 7777490 decicycles in cbrt_tableinit,       2 runs,      0 skips
>>> >> [...]
>>> >> 7582339 decicycles in cbrt_tableinit,     256 runs,      0 skips
>>> >> 7563556 decicycles in cbrt_tableinit,     512 runs,      0 skips
>>> >>
>>> >> new:
>>> >> 2099480 decicycles in cbrt_tableinit,       1 runs,      0 skips
>>> >> 2044470 decicycles in cbrt_tableinit,       2 runs,      0 skips
>>> >> [...]
>>> >> 1796544 decicycles in cbrt_tableinit,     256 runs,      0 skips
>>> >> 1791631 decicycles in cbrt_tableinit,     512 runs,      0 skips
>>> >>
>>> >
>>> > See attached code, function "test1", based on an approximation of:
>>> >
>>> >   (i+1)^(1/3) ~= i^(1/3) * ( 1 + 1/(3i) - 1/(9i) + 5/(81i) - .... )
[...]

So I looked up Mr. Fog's manuals, and it seems like divides are
considerably slower than multiplies and adds. This made me somewhat
skeptical of your idea, and so I benched it.
Seems like your patch is actually significantly slower on my setup
(gcc 5.3.0, GNU libm):
3349080 decicycles in cbrt_tableinit,       1 runs,      0 skips
3466605 decicycles in cbrt_tableinit,       2 runs,      0 skips
[...]
3425939 decicycles in cbrt_tableinit,     256 runs,      0 skips
3414528 decicycles in cbrt_tableinit,     512 runs,      0 skips

(clang 3.7.0):
3337590 decicycles in cbrt_tableinit,       1 runs,      0 skips
3276225 decicycles in cbrt_tableinit,       2 runs,      0 skips
[...]
2871065 decicycles in cbrt_tableinit,     256 runs,      0 skips
2832137 decicycles in cbrt_tableinit,     512 runs,      0 skips

The divides seem to be what is killing your approach.
Times (just for the divisions, clang):
1085430 decicycles in cbrt_tableinit,       1 runs,      0 skips
1005165 decicycles in cbrt_tableinit,       2 runs,      0 skips
[...]
 863732 decicycles in cbrt_tableinit,     256 runs,      0 skips
 864907 decicycles in cbrt_tableinit,     512 runs,      0 skips

Another illustration, even if I change the 1.0/(3*i) to simply i to
get rid of the multiplication as well, obviously not possible but
really wishful thinking, you still only get:
(clang)
2013390 decicycles in cbrt_tableinit,       1 runs,      0 skips
1950135 decicycles in cbrt_tableinit,       2 runs,      0 skips
[...]
1668963 decicycles in cbrt_tableinit,     256 runs,      0 skips
1653179 decicycles in cbrt_tableinit,     512 runs,      0 skips

To complete my curiousity,
(gcc)
1485240 decicycles in cbrt_tableinit,       1 runs,      0 skips
1522115 decicycles in cbrt_tableinit,       2 runs,      0 skips
[...]
1324325 decicycles in cbrt_tableinit,     256 runs,      0 skips
1322110 decicycles in cbrt_tableinit,     512 runs,      0 skips

i.e not a whole lot faster than the benchmark I posted.

If you feel this can't be right, I can do give a higher level
justification, which shows that for a reasonable libm cbrt, and
standard architectural assumptions, the approach I have is faster.

>
>>
>>     Daniel.
>>
>> _______________________________________________
>> ffmpeg-devel mailing list
>> ffmpeg-devel at ffmpeg.org
>> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel