[FFmpeg-devel] [PATCH] swscale/aarch64: dotprod implementation of rgba32_to_Y

Sun Mar 2 00:55:55 EET 2025

On Thu, 27 Feb 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote:

> ---
> I was curious whether it's possible to implement this function without
> any widening, and it turns out it not only is, but it's quite
> performant at the same time!
>
> The idea is to split the 16 bit coefficients into lower and upper half,
> invoke udot for the lower half, shift by 8, and follow by udot for the
> upper half. The code is based upon existing version.

As in the others; this explanation and the benchmarks are valuable to keep 
even after committing it, so please include it in the permanent commit 
message part above "---".

> Benchmark on A78:
> bgra_to_y_128_c:                                       682.0 ( 1.00x)
> bgra_to_y_128_neon:                                    181.2 ( 3.76x)
> bgra_to_y_128_dotprod:                                 117.8 ( 5.79x)
> bgra_to_y_1080_c:                                     5742.5 ( 1.00x)
> bgra_to_y_1080_neon:                                  1472.5 ( 3.90x)
> bgra_to_y_1080_dotprod:                                906.5 ( 6.33x)
> bgra_to_y_1920_c:                                    10194.0 ( 1.00x)
> bgra_to_y_1920_neon:                                  2589.8 ( 3.94x)
> bgra_to_y_1920_dotprod:                               1573.8 ( 6.48x)
>
> Krzysztof
>
> libswscale/aarch64/input.S   | 88 ++++++++++++++++++++++++++++++++++++
> libswscale/aarch64/swscale.c | 17 +++++++
> 2 files changed, 105 insertions(+)
>
> diff --git a/libswscale/aarch64/input.S b/libswscale/aarch64/input.S
> index 5cb18711fb..5fe6c3f6f5 100644
> --- a/libswscale/aarch64/input.S
> +++ b/libswscale/aarch64/input.S
> @@ -313,3 +313,91 @@ rgbToUV_neon bgr24, rgb24, element=3
> rgbToUV_neon bgra32, rgba32, element=4
>
> rgbToUV_neon abgr32, argb32, element=4, alpha_first=1
> +
> +#if HAVE_DOTPROD
> +ENABLE_DOTPROD
> +
> +function ff_bgra32ToY_neon_dotprod, export=1
> +        cmp             w4, #0                  // check width > 0
> +        ldp             w12, w11, [x5]          // w12: ry, w11: gy
> +        ldr             w10, [x5, #8]           // w10: by
> +        b.gt            4f
> +        ret
> +endfunc
> +
> +function ff_rgba32ToY_neon_dotprod, export=1
> +        cmp             w4, #0                  // check width > 0
> +        ldp             w10, w11, [x5]          // w10: ry, w11: gy
> +        ldr             w12, [x5, #8]           // w12: by
> +        b.le            3f
> +4:
> +        mov             w9, #256                // w9 = 1 << (RGB2YUV_SHIFT - 7)
> +        movk            w9, #8, lsl #16         // w9 += 32 << (RGB2YUV_SHIFT - 1)
> +        dup             v6.4s, w9               // w9: const_offset
> +
> +        cmp             w4, #16
> +        mov             w7, w10
> +        bfi             w7, w11, 8, 8
> +        bfi             w7, w12, 16, 8

These bfi instructions are quite esoteric; it'd probably be good to add 
some comments to explain what you do here.

> +        dup             v0.4s, w7
> +
> +        lsr             w6, w10, #8
> +        lsr             w7, w11, #8
> +        lsr             w8, w12, #8
> +
> +        bfi             w6, w7, 8, 8
> +        bfi             w6, w8, 16, 8
> +        dup             v1.4s, w6
> +        b.lt            2f
> +1:
> +        ld1             { v16.16b, v17.16b, v18.16b, v19.16b }, [x1], #64
> +        sub             w4, w4, #16             // width -= 16
> +        cmp             w4, #16                 // width >= 16 ?

The cmp could be moved e.g. below the mov

Other than that, this patch looks really good to me, thanks!

And while swscale is being rewritten elsewhere, adding this function 
shouldn't make the transition to a rewrite any harder, so I don't see any 
problem with adding this in the meantime.

// Martin