[FFmpeg-devel] [PATCH] avfilter/avf_showcqt: cqt_calc optimization on x86

Muhammad Faiz mfcc64 at gmail.com
Wed Jun 8 20:13:36 CEST 2016


On Wed, Jun 8, 2016 at 11:34 PM, James Almer <jamrial at gmail.com> wrote:
> On 6/7/2016 6:18 AM, Muhammad Faiz wrote:
>>>> +        sub     lend, 2
>>>> >> +        lea     dstq, [dstq + 16]
>>> >
>>> > Use add
>>> >
>>>> >> +        lea     coeffsq, [coeffsq + 2*Coeffs.sizeof]
>>> >
>>> > Same, assuming sizeof is an immediate.
>>> >
>> This is optimization to separate sub and jnz with lea.
>
> Does it make a noticeable difference? We don't really bother
> keeping the instruction that sets the flag and the one that
> checks it separated anywhere else, so i wonder.
>
I don't know.

>> Using add will clobber flag register.
>
> You could keep the coeffsq lea but change the dstq into an add
> and move it above the sub instruction. No need for two slow leas
> for the above purpose.
>
OK, probably add is faster.
But AMD64 optimization guide said that lea instr have a latency 1
when there are two source operand (so it is as fast as add)
Anyway, this code is not the critical path, so measuring faster one
is difficult.

>> Also lea does not need rex prefix
>
> add dstq, 16 seems to be four bytes, same as lea dstq [dstq+16],
> at least on win64.
I'm sorry, I'm wrong. It uses rex prefix

>
>>>> +    SELECT_CQT_CALC(sse,  SSE,  4, 0);
>>>> >> +    SELECT_CQT_CALC(sse3, SSE3, 4, 0);
>>>> >> +    SELECT_CQT_CALC(avx,  AVX,  8, 01452367);
>>> >
>>> > Use AVX_FAST, so this function will not be used on CPUs that set the
>>> > AV_CPU_FLAG_AVXSLOW flag.
>>> >
>>>> >> +    SELECT_CQT_CALC(fma3, FMA3, 8, 01452367);
>>> >
>>> > Same, use FMA3_FAST. The result will then be the FMA3 version used by
>>> > Intel CPUs and hopefully AMD Zen, and the FMA4 one by Bulldozer based
>>> > CPUs.
>>> >
>>>> >> +    SELECT_CQT_CALC(fma4, FMA4, 8, 01452367);
>>>> >> +}
>>>> >>
>>> >
>> OK, also reorder (FMA4 before AVX because AVX/ymm without FMA4 is faster than
>> FMA4/xmm)
>
> Not in any machine supporting FMA4. And they will not use the
> AVX or FMA3 functions here because they set the avxslow flag,
> which the EXTERNAL_*_FAST macros check for.
>
> You could have kept the FMA4 init at the end and it would be
> functionally the same, but in any case it's purely cosmetic.
>
If there is machine that support AVX (and fast ymm), FMA3 and FMA4,
without reorder the selected code is FMA4.

Anyway, I have applied the patch.

Thank's for the review


More information about the ffmpeg-devel mailing list