[FFmpeg-devel] [PATCH]v5 Opus Pyramid Vector Quantization Search in x86 SIMD asm

Rostislav Pehlivanov atomnuker at gmail.com
Sun Jul 23 03:47:45 EEST 2017

On 22 July 2017 at 12:18, Ivan Kalvachev <ikalvachev at gmail.com> wrote:

> This patch is ready for review and inclusion.
>+%macro emu_blendvps 3 ; dst/src_a, src_b, mask
>+%macro lea_pic_base 2; reg, const_label
Capitalize macro names.

>+      %error sse41 blendvps uses xmm0 as default 3d operand, you used %3
Remove this error

>+    %error "blendvps" emulation needs SSE
Definitely remove this error too.

>+        %error AVX1 does not support integer 256bit ymm operations
And this one is particularly pointless...

>+      %error sse41 blendvps uses xmm0 as default 3 operand, you used %3

>+    %error "broadcastss" emulation needs SSE

>+    %error "pbroadcastd" emulation needs SSE2
Yep, the same...

>+    %error pick HSUMPS implementation

All of these are pointless and unnecessary. Always assume at least SSE2.

>+; 256bit version seems slower than the 128bit AVX on all current CPU's
>+; So disable it for now and keep the code in case something changes in
>+%if HAVE_AVX2_EXTERNAL > 0 && 0
>+INIT_YMM avx2

Remove this altogether.

>+%if 1
>+    emu_pbroadcastd m1,   xmm1
>+        ; Ryzen is the only CPU at the moment that doesn't have
>+        ; write forwarding stall and where this code is faster
>+        ; with about 7 cycles on avarage.
>+        %{1}ps      m5,   mm_const_float_1
>+        movss       [tmpY + r4q], xmm5      ; Y[max_idx] +-= 1.0;

Remove the %else and always use the first bit of code.

>+%if cpuflag(avx2) && 01
>+%elif cpuflag(avx) && 01
>+%if cpuflag(avx2) && 01

Remove the && 01 in these 3 places.

>+;      vperm2f128   %1,   %1,   %1,   0 ; ymm, ymm, ymm     ; 2-3c 1/1

Remove this commented out code.

>+cglobal pvq_search, 4, 5+num_pic_regs, 11, 256*4, inX, outY, K, N
>+%if ARCH_X86_64 == 0    ;sbrdsp

You're using more than 11 xmm regs, so the entire code is always going to
need a 64 bit system.
So wrap the entire file, from top to bottom after the %include in
%if ARCH_X86_64

>+const_float_abs_mask:   times 8 dd 0x7fffffff
>+const_align_abs_edge:   times 8 dd 0
>+const_float_0_5:        times 8 dd 0.5
>+const_float_1:          times 8 dd 1.0
>+const_float_sign_mask:  times 8 dd 0x80000000
>+                        %rep 8
>+                                dd $-const_int32_offsets
>+                        %endrep
>+SECTION .text

This whole thing needs to go right at the very top after the %include. All
macros must then follow it.
Having read only sections in the middle of the file is very confusing and
not at all what all of our asm code follows.

More information about the ffmpeg-devel mailing list