[FFmpeg-devel] [PATCH] flac/x86: add ff_flac_lpc_32_sse4()

Sat Feb 1 13:24:28 CET 2014

On Sat, 1 Feb 2014, James Almer wrote:
> On 01/02/14 1:38 AM, James Almer wrote:
> > x64
> > 1261661 decicycles in flac_lpc_32_c, 32768 runs
> > 1045689 decicycles in ff_flac_lpc_32_sse4, 32768 runs
> >
> > 1431506 decicycles in flac_lpc_32_c, 32768 runs
> > 1209322 decicycles in ff_flac_lpc_32_sse4, 32768 runs
> >
> > x86
> > 1429597 decicycles in flac_lpc_32_c, 32768 runs
> > 953667 decicycles in ff_flac_lpc_32_sse4, 32768 runs
> >
> > 1610348 decicycles in flac_lpc_32_c, 32768 runs
> > 1079424 decicycles in ff_flac_lpc_32_sse4, 32768 runs
> >
> > About 100 to 500 ms faster decoding using -threads 1 depending on song and arch.
> > Tested using a few 24 bits samples on an AMD FX 6300, Win7 x64 and x86.
> > Biggest speedup appears to be on x86 builds.
> >
> > Signed-off-by: James Almer <jamrial at gmail.com>
> > ---
> >  libavcodec/flacdsp.c          |  2 ++
> >  libavcodec/flacdsp.h          |  1 +
> >  libavcodec/x86/Makefile       |  2 ++
> >  libavcodec/x86/flacdsp.asm    | 61 +++++++++++++++++++++++++++++++++++++++++++
> >  libavcodec/x86/flacdsp_init.c | 39 +++++++++++++++++++++++++++
> >  5 files changed, 105 insertions(+)
> >  create mode 100644 libavcodec/x86/flacdsp.asm
> >  create mode 100644 libavcodec/x86/flacdsp_init.c
> >
>
> Couldn't test with Valgrind, or on a Linux box for that matter.
> I have access to this FX 6300 for the time being so I used it to write this, but can't
> install a VM.
>
> I originally wrote this doing two calculations per packed instruction (using all 128
> bits on the xmm registers instead of 64), but after punpckldq-ing and pshufd-ing values
> around and adding extra checks for odd pred_order values it somehow ended up slower
> than the pure c implementation.
> This will do until i get that other version working faster. If i can, of course.

Did you try applying the optimization from flac_lpc_16_c to flac_lpc_32_c?

A simd implementation shouldn't need any shuffles, just leave the samples
in their natural order in the xmmregs and let a single pmuldq apply to
nonadjacent samples. You also shouldn't need any check on the parity of
pred_order if you zero-pad coefs[].

--Loren Merritt