[FFmpeg-devel] [PATCH 07/10] avcodec/vc1: Arm 64-bit NEON inverse transform fast paths

Wed Mar 30 17:01:27 EEST 2022

On Wed, 30 Mar 2022, Martin Storsjö wrote:

> On Fri, 25 Mar 2022, Ben Avison wrote:
>
>> checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.
>> 
>> vc1dsp.vc1_inv_trans_4x4_c: 158.2
>> vc1dsp.vc1_inv_trans_4x4_neon: 65.7
>> vc1dsp.vc1_inv_trans_4x4_dc_c: 86.5
>> vc1dsp.vc1_inv_trans_4x4_dc_neon: 26.5
>> vc1dsp.vc1_inv_trans_4x8_c: 335.2
>> vc1dsp.vc1_inv_trans_4x8_neon: 106.2
>> vc1dsp.vc1_inv_trans_4x8_dc_c: 151.2
>> vc1dsp.vc1_inv_trans_4x8_dc_neon: 25.5
>> vc1dsp.vc1_inv_trans_8x4_c: 365.7
>> vc1dsp.vc1_inv_trans_8x4_neon: 97.2
>> vc1dsp.vc1_inv_trans_8x4_dc_c: 139.7
>> vc1dsp.vc1_inv_trans_8x4_dc_neon: 16.5
>> vc1dsp.vc1_inv_trans_8x8_c: 547.7
>> vc1dsp.vc1_inv_trans_8x8_neon: 137.0
>> vc1dsp.vc1_inv_trans_8x8_dc_c: 268.2
>> vc1dsp.vc1_inv_trans_8x8_dc_neon: 30.5
>> 
>> Signed-off-by: Ben Avison <bavison at riscosopen.org>
>> ---
>> libavcodec/aarch64/vc1dsp_init_aarch64.c |  19 +
>> libavcodec/aarch64/vc1dsp_neon.S         | 678 +++++++++++++++++++++++
>> 2 files changed, 697 insertions(+)
>
> Looks generally reasonable. Is it possible to factorize out the individual 
> transforms (so that you'd e.g. invoke the same macro twice in the 8x8 and 4x4 
> functions) without too much loss? The downshift which differs between thw two 
> could either be left outside of the macro, or the downshift amount could be 
> made a macro parameter.

Another aspect: I forgot the aspect that we have existing arm assembly for 
the idct. In some cases, there's value in keeping the implementations 
similar if possible and relevant. But your implementation seems quite 
straightforward, and seems to get better benchmark numbers on the same 
cores, so I guess it's fine to diverge and add a new from-scratch 
implementation here.

// Martin