[FFmpeg-devel] [WIP][PATCH]v4 Opus Pyramid Vector Quantization Search in x86 SIMD asm

Rostislav Pehlivanov atomnuker at gmail.com
Tue Jul 18 06:08:19 EEST 2017


On 9 July 2017 at 01:49, Ivan Kalvachev <ikalvachev at gmail.com> wrote:

> This should be the final work-in-progress patch.
>
> What's changed:
>
> 1. Removed macros conditional defines. The defaults seems to be
> optimal on all machines that I got benchmarks from. HADDPS and PHADDD
> are always slower, "BLEND"s are never slower than the emulation.
>
> 2. Remove SHORT_SYY_UPDATE. It is always slower.
>
> 3. Remove ALL_FLOAT_PRESEARCH, it is always slower. Remove the ugly
> hack to use 256bit ymm with avx1 and integer operations.
>
> 4. Remove remaining disabled code.
>
> 5. Use HADDD macro from "x86util", I don't need the result in all
> lanes/elements
>
> 6. Use v-prefix for all avx code.
>
> 7. Small optimization: Move some of the HSUMPS in the K!=0 branch.
>
> 8. Small optimization: Instead of pre-calculation 2*Y[i] and then
> correcting it on exit, It is possible to use Syy/2 instead in
> distortion parameter calculations. It saves few multiplications in
> pre-search and sign restore loop. It however gives different
> approximation of sqrt(). It's not (consistently) better or worse than
> the previous approximation.
>
> 9. Using movdqa to load "const_int32_offsets". Wrong type might
> explain why directly using mem constants is sometimes faster.
>
> 10. Move some code around and do minor tweaks.
> ---
>
> I do not intend of removing "PRESEARCH_ROUNDING" and
> "USE_APPROXIMATION", (while for the latter I think I will remove
> method#1, I've left it this time just for consistency").
> These defines control the precision and the type of results that the
> function generates.
> E.g. This function can produce same results as opus functions with
> "PRESEARCH_ROUNDING 0".
> If you want same results as the ffmpeg improved function, then you
> need "approx#0". It uses real division and is much slower on older
> cpu's, but reasonably fast on anything recent.
>
> I've left 2 other defines. "CONST_IN_X64_REG_IS_FASTER" and
> "STALL_WRITE_FORWARDING".
> On Sandy Bridge and laters, "const_in_x64" has always been faster. On
> my cpu it is about the same.
> On Ryzen the "const_in_x64" was consistently faster in all sse/avx
> variants, with about 5%. But not if "stall_write" is enabled too.
> Ryzen (allegedly) has no write stalling, but that method alone is just
> a few cycles faster (about 0.5% ).
>
> I'd like to see if the changes I've done this time, would affect the
> above results.
>
>
> The code is much cleaner and you are free to nitpick on it.
>
> There is something that I'm not exactly sure if I need it.
> The function gets 2 integer parameters, and I am not sure
> if I have to sign extend them in 64 bit more, in order to clear
> the high 32 bits. These parameters should never be negative, so the
> sign is not needed.
> All 32bit operands should also clear the high bits.
> Still I'm not sure if there is guarantee that
> the high bits won't contain garbage.
>
>
> Best Regards
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
>
No detectable regression from v3.

Whitespace error though:
.git/rebase-apply/patch:154: trailing whitespace.
; Horizontal Sum Packed Single precision float
warning: 1 line adds whitespace errors.


More information about the ffmpeg-devel mailing list