[FFmpeg-devel] [WIP][PATCH]v4 Opus Pyramid Vector Quantization Search in x86 SIMD asm

Ivan Kalvachev ikalvachev at gmail.com
Sun Jul 9 03:49:22 EEST 2017

This should be the final work-in-progress patch.

What's changed:

1. Removed macros conditional defines. The defaults seems to be
optimal on all machines that I got benchmarks from. HADDPS and PHADDD
are always slower, "BLEND"s are never slower than the emulation.

2. Remove SHORT_SYY_UPDATE. It is always slower.

3. Remove ALL_FLOAT_PRESEARCH, it is always slower. Remove the ugly
hack to use 256bit ymm with avx1 and integer operations.

4. Remove remaining disabled code.

5. Use HADDD macro from "x86util", I don't need the result in all lanes/elements

6. Use v-prefix for all avx code.

7. Small optimization: Move some of the HSUMPS in the K!=0 branch.

8. Small optimization: Instead of pre-calculation 2*Y[i] and then
correcting it on exit, It is possible to use Syy/2 instead in
distortion parameter calculations. It saves few multiplications in
pre-search and sign restore loop. It however gives different
approximation of sqrt(). It's not (consistently) better or worse than
the previous approximation.

9. Using movdqa to load "const_int32_offsets". Wrong type might
explain why directly using mem constants is sometimes faster.

10. Move some code around and do minor tweaks.

I do not intend of removing "PRESEARCH_ROUNDING" and
"USE_APPROXIMATION", (while for the latter I think I will remove
method#1, I've left it this time just for consistency").
These defines control the precision and the type of results that the
function generates.
E.g. This function can produce same results as opus functions with
If you want same results as the ffmpeg improved function, then you
need "approx#0". It uses real division and is much slower on older
cpu's, but reasonably fast on anything recent.

I've left 2 other defines. "CONST_IN_X64_REG_IS_FASTER" and
On Sandy Bridge and laters, "const_in_x64" has always been faster. On
my cpu it is about the same.
On Ryzen the "const_in_x64" was consistently faster in all sse/avx
variants, with about 5%. But not if "stall_write" is enabled too.
Ryzen (allegedly) has no write stalling, but that method alone is just
a few cycles faster (about 0.5% ).

I'd like to see if the changes I've done this time, would affect the
above results.

The code is much cleaner and you are free to nitpick on it.

There is something that I'm not exactly sure if I need it.
The function gets 2 integer parameters, and I am not sure
if I have to sign extend them in 64 bit more, in order to clear
the high 32 bits. These parameters should never be negative, so the
sign is not needed.
All 32bit operands should also clear the high bits.
Still I'm not sure if there is guarantee that
the high bits won't contain garbage.

Best Regards
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-SIMD-opus-pvq_search-implementation-v4.patch
Type: text/x-patch
Size: 22633 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20170709/a36c6985/attachment.bin>

More information about the ffmpeg-devel mailing list