[FFmpeg-devel] [PATCH 4/4] swscale/aarch64: add neon {lum, chr}ConvertRange
Martin Storsjö
martin at martin.st
Mon Jun 10 14:56:00 EEST 2024
On Fri, 7 Jun 2024, Ramiro Polla wrote:
> chrRangeFromJpeg_8_c: 28.5
> chrRangeFromJpeg_8_neon: 21.2
> chrRangeFromJpeg_24_c: 81.2
> chrRangeFromJpeg_24_neon: 34.7
> chrRangeFromJpeg_128_c: 425.2
> chrRangeFromJpeg_128_neon: 162.0
> chrRangeFromJpeg_144_c: 480.2
> chrRangeFromJpeg_144_neon: 180.2
> chrRangeFromJpeg_256_c: 838.2
> chrRangeFromJpeg_256_neon: 318.0
> chrRangeFromJpeg_512_c: 1698.2
> chrRangeFromJpeg_512_neon: 630.0
> chrRangeToJpeg_8_c: 56.0
> chrRangeToJpeg_8_neon: 23.5
> chrRangeToJpeg_24_c: 147.7
> chrRangeToJpeg_24_neon: 38.2
> chrRangeToJpeg_128_c: 760.2
> chrRangeToJpeg_128_neon: 182.5
> chrRangeToJpeg_144_c: 857.7
> chrRangeToJpeg_144_neon: 204.5
> chrRangeToJpeg_256_c: 1504.2
> chrRangeToJpeg_256_neon: 358.5
> chrRangeToJpeg_512_c: 3025.7
> chrRangeToJpeg_512_neon: 710.5
> lumRangeFromJpeg_8_c: 24.0
> lumRangeFromJpeg_8_neon: 18.2
> lumRangeFromJpeg_24_c: 64.0
> lumRangeFromJpeg_24_neon: 22.2
> lumRangeFromJpeg_128_c: 289.2
> lumRangeFromJpeg_128_neon: 79.2
> lumRangeFromJpeg_144_c: 334.7
> lumRangeFromJpeg_144_neon: 87.7
> lumRangeFromJpeg_256_c: 579.5
> lumRangeFromJpeg_256_neon: 152.0
> lumRangeFromJpeg_512_c: 1208.0
> lumRangeFromJpeg_512_neon: 299.0
> lumRangeToJpeg_8_c: 30.0
> lumRangeToJpeg_8_neon: 19.0
> lumRangeToJpeg_24_c: 82.2
> lumRangeToJpeg_24_neon: 24.0
> lumRangeToJpeg_128_c: 440.7
> lumRangeToJpeg_128_neon: 90.5
> lumRangeToJpeg_144_c: 502.0
> lumRangeToJpeg_144_neon: 102.2
> lumRangeToJpeg_256_c: 893.7
> lumRangeToJpeg_256_neon: 178.0
> lumRangeToJpeg_512_c: 1793.7
> lumRangeToJpeg_512_neon: 355.0
> ---
> libswscale/aarch64/Makefile | 1 +
> libswscale/aarch64/range_convert_neon.S | 103 ++++++++++++++++++++++++
> libswscale/aarch64/swscale.c | 21 +++++
> libswscale/swscale_internal.h | 1 +
> libswscale/utils.c | 4 +-
> 5 files changed, 129 insertions(+), 1 deletion(-)
> create mode 100644 libswscale/aarch64/range_convert_neon.S
>
> diff --git a/libswscale/aarch64/Makefile b/libswscale/aarch64/Makefile
> index da1d909561..6923827f82 100644
> --- a/libswscale/aarch64/Makefile
> +++ b/libswscale/aarch64/Makefile
> @@ -4,5 +4,6 @@ OBJS += aarch64/rgb2rgb.o \
>
> NEON-OBJS += aarch64/hscale.o \
> aarch64/output.o \
> + aarch64/range_convert_neon.o \
> aarch64/rgb2rgb_neon.o \
> aarch64/yuv2rgb_neon.o \
> diff --git a/libswscale/aarch64/range_convert_neon.S b/libswscale/aarch64/range_convert_neon.S
> new file mode 100644
> index 0000000000..5e104971f0
> --- /dev/null
> +++ b/libswscale/aarch64/range_convert_neon.S
> @@ -0,0 +1,103 @@
> +/*
> + * Copyright (c) 2024 Ramiro Polla
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include "libavutil/aarch64/asm.S"
> +
> +.macro lumConvertRange name max mult offset shift
We usually use commas between the macro arguments here. Apparently it
doesn't make any difference for any of the tools we support, but it would
be nice for consistency. (When invoking macros, commas between arguments
are optional for most platforms, but not when targeting Apple platforms,
so being strict with consistent use of commas is generally good.)
> +const offset_\name, align=4
> + .word \offset, \offset, \offset, \offset
> +endconst
> +function ff_\name, export=1
> +.if \max != 0
> + mov w3, #\max
> + dup v24.8h, w3
> +.endif
> + mov w3, #\mult
> + dup v25.4s, w3
> + movrel x3, offset_\name
> + ld1 {v26.4s}, [x3]
FWIW, I did see that you were recommended this form, over ld1r, based on
some microarchitectural performance numbers. However in our preexisting
assembly, manually pre-splatting vectors like this is unusual I would say.
I don't have a strong opinion on the matter though.
Anyway, the assembly looks reasonable to me.
// Martin
More information about the ffmpeg-devel
mailing list