[FFmpeg-devel] [PATCH] swr/resample: use fma when it is faster

Ganesh Ajjanagadde gajjanagadde at gmail.com
Mon Dec 14 00:08:33 CET 2015


On Sun, Dec 13, 2015 at 5:55 PM, Ganesh Ajjanagadde
<gajjanagadde at gmail.com> wrote:
> On Sun, Dec 13, 2015 at 5:47 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
>> Hi,
>>
>> On Sun, Dec 13, 2015 at 4:59 PM, Ganesh Ajjanagadde <gajjanagadde at gmail.com>
>> wrote:
>>>
>>> fma is a faster function on architectures supporting a native CPU
>>> instruction for it.
>>> This may be tested by the ISO C optionally defined FP_FAST_FMA. Although
>>> in the x86 lineup this came fairly late
>>> (from Haswell onwards, and hence is absent unless appropriate -march is
>>> passed),
>>> numerous other architectures support it:
>>> https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation.
>>>
>>> Concretely, one can expect ~ 15-25% speedup that is of course heavily
>>> architecture dependent.
>>>
>>> This patch also ensures that as people migrate to newer CPU's, the
>>> benefit will slowly trickle in.
>>>
>>> I doubt this will cause build failures on broken libm's since I can't
>>> imagine a platform where FP_FAST_FMA is defined but the function fma is
>>> absent.
>>>
>>> Sample benchmark (x86-64, Haswell, GNU/Linux under -march=native)
>>>
>>> old:
>>> 515828458 decicycles in build_filter (loop 1000),    1024 runs,      0
>>> skips
>>>
>>> new (fma):
>>> 435866377 decicycles in build_filter (loop 1000),    1024 runs,      0
>>> skips
>>>
>>> Tested with FATE.
>>>
>>> Signed-off-by: Ganesh Ajjanagadde <gajjanagadde at gmail.com>
>>> ---
>>>  libswresample/resample.c | 4 ++++
>>>  1 file changed, 4 insertions(+)
>>>
>>> diff --git a/libswresample/resample.c b/libswresample/resample.c
>>> index 34eb4c0..e61d4c5 100644
>>> --- a/libswresample/resample.c
>>> +++ b/libswresample/resample.c
>>> @@ -33,8 +33,12 @@ static inline double eval_poly(const double *coeff, int
>>> size, double x) {
>>>      double sum = coeff[size-1];
>>>      int i;
>>>      for (i = size-2; i >= 0; --i) {
>>> +#ifdef FP_FAST_FMA
>>> +        sum = fma(sum, x, coeff[i]);
>>> +#else
>>>          sum *= x;
>>>          sum += coeff[i];
>>> +#endif
>>>      }
>>>      return sum;
>>>  }
>>> --
>>> 2.6.4
>>
>>
>> Nope, this is not how we do CPU-specific optimizations. Check example
>> implementations in libswresample/x86/*.asm and the related init functions
>> plus macros to check for runtime cpu support in libswresample/x86/*_init.c.
>> You want to follow that pattern.
>
> No, this is not x86 specific. This is generic code. If I did such a
> maneouver, benefits would apply only to x86, an inferior outcome.

To clarify: yes, in theory one could dump such things into
swresample/x86, swresample/aarch64, and a ton of other architectures
(for which some arches are actually lacking). Such a diff is far
larger and more brittle - I can't even test things like mips and the
like, and looking up the manuals for each and every one of these to
find out when/what is the fma equivalent is a pain in the neck.

ISO C provides a mechanism, albeit build-time and not runtime detection.

This patch is thus something that gives benefits at minimal scope for
regressions. Unless others show where/how fma detection can be done
for all arches (aarch64, arm, mips, powerpc, itanium, etc in addition
to x86-64), I view your idea as future work.

>
>>
>> Ronald


More information about the ffmpeg-devel mailing list