[FFmpeg-devel] [aarch64] improve performance of ff_hscale_8_to_15_neon

Sun Dec 1 23:01:29 EET 2019

On Sun, 1 Dec 2019, Clément Bœsch wrote:

> On Wed, Nov 27, 2019 at 12:30:35PM -0600, Sebastian Pop wrote:
> [...]
>> From 9ecaa99fab4b8bedf3884344774162636eaa5389 Mon Sep 17 00:00:00 2001
>> From: Sebastian Pop <spop at amazon.com>
>> Date: Sun, 17 Nov 2019 14:13:13 -0600
>> Subject: [PATCH] [aarch64] use FMA and increase vector factor to 4
>> 
>> This patch implements ff_hscale_8_to_15_neon with NEON fused multiply accumulate
>> and bumps the vectorization factor from 2 to 4.
>> The speedup is of 34% on Graviton A1 instances based on A-72 cpus:
>> 
>> $ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -
>> before: t:0.040303 avg:0.040287 max:0.040371 min:0.039214
>> after:  t:0.030079 avg:0.030102 max:0.030462 min:0.030051
>> 
>> Tested with `make check` on aarch64-linux.

FWIW, I'm not certain how much this routine actually is tested by that - 
in particular, there's no checkasm test for it as far as I can see.

>> +        add                 x17, x3, x18                // srcp + filterPos[0]
>> +        add                 x18, x3, x0                 // srcp + filterPos[1]
>> +        add                 x0, x3, x2                  // srcp + filterPos[2]
>> +        add                 x2, x3, x6                  // srcp + filterPos[3]
>
>> +2:      ldr                 d4, [x17, x15]              // srcp[filterPos[0] + {0..7}]
>> +        ldr                 q5, [x16]                   // load 8x16-bit filter values, part 1
>> +        ldr                 d6, [x18, x15]              // srcp[filterPos[1] + {0..7}]
>> +        ldr                 q7, [x16, x12]              // load 8x16-bit at filter+filterSize
>
> Why not use ld1 {v4.8B} etc like it was before? The use of Dn/Qn in is
> very confusing here.

The ldr instruction, instead of ld1, allows you to to do a load (or store, 
similarly, for str instead of st1) with a constant/register offset, like 
[x17, x15] here, without incrementing the source register inbetween for 
each load (which can help with latency between individual load 
instructions, or can avoid extra instructions for incrementing the 
register inbetween).

That works for loading the first 1/2/4/8/16 bytes of a vector, but can't 
be used e.g. for loading a lane other than the first (e.g. ld1 {v4.s}[1]). 
But it does require using the b/h/s/d/q names for the registers instead of 
v.

I didn't check the changes here if they're essential for the optimization 
though.

// Martin