[FFmpeg-devel] [RFC] New swscale internal design prototype
Niklas Haas
ffmpeg at haasn.xyz
Wed Mar 12 01:56:51 EET 2025
On Sun, 09 Mar 2025 20:45:23 +0100 Niklas Haas <ffmpeg at haasn.xyz> wrote:
> On Sun, 09 Mar 2025 18:11:54 +0200 Martin Storsjö <martin at martin.st> wrote:
> > On Sat, 8 Mar 2025, Niklas Haas wrote:
> >
> > > What are the thoughts on the float-first approach?
> >
> > In general, for modern architectures, relying on floats probably is
> > reasonable. (On architectures that aren't of quite as widespread interest,
> > it might not be so clear cut though.)
> >
> > However with the benchmark example you provided a couple of weeks ago, we
> > concluded that even on x86 on modern HW, floats were faster than int16
> > only in one case: When using Clang, not GCC, and when compiling with
> > -mavx2, not without it. In all the other cases, int16 was faster than
> > float.
>
> Hi Martin,
>
> I should preface that this particular benchmark was a very specific test for
> floating point *filtering*, which is considerably more punishing than the
> conversion pipeline I have implemented here, and I think it's partly the
> fault of compilers generating very unoptimal filtering code.
>
> I think it would be better to re-assess using the current prototype on actual
> hardware. I threw up a quick NEON test branch: (untested, should hopefully work)
> https://github.com/haasn/FFmpeg/commits/swscale3-neon
>
> # adjust the benchmark iters count as needed based on the HW perf
> make libswscale/tests/swscale && libswscale/tests/swscale -unscaled 1 -bench 50
>
> If this differs significantly from the ~1.8x speedup I measure on x86, I
> will be far more concerned about the new approach.
I gave it a try. So, the result of a naive/blind run on a Cortex-X1 using clang
version 20.0.0 (from the latest Android NDK v29) is:
Overall speedup=1.688x faster, min=0.141x max=45.898x
This has quite a lot more significant speed regressions compared to x86 though.
In particular, clang/LLVM refuses to vectorize packed reads of 2 or 3 elements,
so any sort of operation involving rgb24 or bgr24 suffers horribly:
Conversion pass for rgb24 -> rgba:
[ u8 XXXX -> +++X] SWS_OP_READ : 3 elem(s) packed >> 0
[ u8 ...X -> ++++] SWS_OP_CLEAR : {_ _ _ 255}
[ u8 .... -> XXXX] SWS_OP_WRITE : 4 elem(s) packed >> 0
(X = unused, + = exact, 0 = zero)
rgb24 1920x1080 -> rgba 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
time=2856 us, ref=387 us, speedup=0.136x slower
Another thing LLVM seemingly does not optimize at all is integer shifts, they
also end up as horribly inefficient scalar code:
Conversion pass for yuv444p -> yuv444p16le:
[ u8 XXXX -> +++X] SWS_OP_READ : 3 elem(s) planar >> 0
[ u8 ...X -> +++X] SWS_OP_CONVERT : u8 -> u16
[u16 ...X -> +++X] SWS_OP_LSHIFT : << 8
[u16 ...X -> XXXX] SWS_OP_WRITE : 3 elem(s) planar >> 0
(X = unused, + = exact, 0 = zero)
yuv444p 1920x1080 -> yuv444p16le 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000}
time=1564 us, ref=590 us, speedup=0.377x slower
On the other hand, float performance does not seem to be an issue here:
Conversion pass for rgba -> yuv444p:
[ u8 XXXX -> +++X] SWS_OP_READ : 4 elem(s) packed >> 0
[ u8 ...X -> +++X] SWS_OP_CONVERT : u8 -> f32
[f32 ...X -> ...X] SWS_OP_LINEAR : matrix3+off3 [[0.256788 0.504129 0.097906 0 16] [-0.148223 -0.290993 112/255 0 128] [112/255 -0.367788 -0.071427 0 128] [0 0 0 1 0]]
[f32 ...X -> ...X] SWS_OP_DITHER : 16x16 matrix
[f32 ...X -> ...X] SWS_OP_CLAMP : 0 <= x <= {255 255 255 _}
[f32 ...X -> +++X] SWS_OP_CONVERT : f32 -> u8
[ u8 ...X -> XXXX] SWS_OP_WRITE : 3 elem(s) planar >> 0
(X = unused, + = exact, 0 = zero)
rgba 1920x1080 -> yuv444p 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
time=4074 us, ref=6987 us, speedup=1.715x faster
So in summary, from what I gather, on all platforms I tested so far, the two
most important ASM routines to focus on are:
1. packed reads/writes
2. integer shifts
Because compilers seem to have a very hard time generating good code for
these. On the other hand, simple floating point FMAs and planar reads/writes
are handled quite well as is.
>
> > After doing those benchmarks, my understanding was that you concluded that
> > we probably need to keep int16 based codepaths still, then.
>
> This may have been a misunderstanding. While I think we should keep the option
> of using fixed point precision *open*, the main take-away for me was that we
> will definitely need to transition to custom SIMD; since we cannot rely on the
> compiler to generate good code for us.
>
> > Did something fundamental come up since we did these benchmarks that
> > changed your conclusion?
> >
> > // Martin
> >
> > _______________________________________________
> > ffmpeg-devel mailing list
> > ffmpeg-devel at ffmpeg.org
> > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> >
> > To unsubscribe, visit link above, or email
> > ffmpeg-devel-request at ffmpeg.org with subject "unsubscribe".
More information about the ffmpeg-devel
mailing list