[FFmpeg-devel] [PATCH 07/10] avcodec/vc1: Arm 64-bit NEON inverse transform fast paths
Martin Storsjö
martin at martin.st
Wed Mar 30 17:01:27 EEST 2022
On Wed, 30 Mar 2022, Martin Storsjö wrote:
> On Fri, 25 Mar 2022, Ben Avison wrote:
>
>> checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.
>>
>> vc1dsp.vc1_inv_trans_4x4_c: 158.2
>> vc1dsp.vc1_inv_trans_4x4_neon: 65.7
>> vc1dsp.vc1_inv_trans_4x4_dc_c: 86.5
>> vc1dsp.vc1_inv_trans_4x4_dc_neon: 26.5
>> vc1dsp.vc1_inv_trans_4x8_c: 335.2
>> vc1dsp.vc1_inv_trans_4x8_neon: 106.2
>> vc1dsp.vc1_inv_trans_4x8_dc_c: 151.2
>> vc1dsp.vc1_inv_trans_4x8_dc_neon: 25.5
>> vc1dsp.vc1_inv_trans_8x4_c: 365.7
>> vc1dsp.vc1_inv_trans_8x4_neon: 97.2
>> vc1dsp.vc1_inv_trans_8x4_dc_c: 139.7
>> vc1dsp.vc1_inv_trans_8x4_dc_neon: 16.5
>> vc1dsp.vc1_inv_trans_8x8_c: 547.7
>> vc1dsp.vc1_inv_trans_8x8_neon: 137.0
>> vc1dsp.vc1_inv_trans_8x8_dc_c: 268.2
>> vc1dsp.vc1_inv_trans_8x8_dc_neon: 30.5
>>
>> Signed-off-by: Ben Avison <bavison at riscosopen.org>
>> ---
>> libavcodec/aarch64/vc1dsp_init_aarch64.c | 19 +
>> libavcodec/aarch64/vc1dsp_neon.S | 678 +++++++++++++++++++++++
>> 2 files changed, 697 insertions(+)
>
> Looks generally reasonable. Is it possible to factorize out the individual
> transforms (so that you'd e.g. invoke the same macro twice in the 8x8 and 4x4
> functions) without too much loss? The downshift which differs between thw two
> could either be left outside of the macro, or the downshift amount could be
> made a macro parameter.
Another aspect: I forgot the aspect that we have existing arm assembly for
the idct. In some cases, there's value in keeping the implementations
similar if possible and relevant. But your implementation seems quite
straightforward, and seems to get better benchmark numbers on the same
cores, so I guess it's fine to diverge and add a new from-scratch
implementation here.
// Martin
More information about the ffmpeg-devel
mailing list