[FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

Henrik Gramner henrik at gramner.com
Fri Aug 4 23:20:38 EEST 2017


On Thu, Aug 3, 2017 at 11:36 PM, Ivan Kalvachev <ikalvachev at gmail.com> wrote:
>> 1234_____________1234_1234_123
>>     VBROADCASTSS ym1, xm1
>>     BLENDVPS      m1, m2, m3
>>
>> is the most commonly used alignment.
>
> I see that a lot of .asm files use different alignments.
> I'll try to pick something similar that I find good looking.
>
> I also see that different function/macro can have
> different position for the first comma.
> This could be quite useful, to visually separate
> the macros.

Things like indentation can be a bit inconsistent at times, yes. Cody
by different authors written over the span of many year etc.

It's normal to diverge from "standard alignment" in cases when macro
names (or memory addresses) are fairly long. Otherwise the amount of
whitespace between instructions and operands would easily get too
large to be practical.

>> Now that I think about it I believe that part wasn't merged because
>> smartalign is broken on some older nasm versions (which is another
>> reason for avoiding things like this in individual files) and FFmpeg
>> doesn't enforce any specific version requirements. I guess it could be
>> worked around with some version checking ifdeffery if we know which
>> versions are affected.
>
> I don't see anything relevant in here:
> http://repo.or.cz/nasm.git/history/HEAD:/macros/smartalign.mac
> I also didn't notice anything similar in the changelog.

http://repo.or.cz/nasm.git/commit/f29123b936b1751c6d72dad86b05eb293cca5c6c
https://bugzilla.nasm.us/show_bug.cgi?id=3392319

> I had it mixed before, but I changed it to be consistent.
> Some new avx instruction are not overloaded or
> available without their "v-" prefix.
> e.g. x86util uses vpbroadcastw in SPLATW.

Every instruction that has both a legacy encoding and a VEX-encoding
will be automatically converted to VEX in AVX functions when written
without the v-prefix. There is no legacy encoding for newer
instructions so there's no reason for x86inc to overload those.

> Isn't one of the selling points for x86inc that
> it would warn if unsupported instructions are used?

Not really, no. It's done in a small subset of cases as a part of the
aforementioned auto legacy/VEX conversion because it was trivial to
add that functionality, but correctly tracking every single x86
instruction in existence would be a whole different thing.

Some instructions requires different instruction sets not just
depending on the vector width but the operand types as well (for
example, pextrw from mm to gpr requires mmx2, from xmm to gpr requires
sse2, from xmm to memory requires sse4.1) and there's no generic way
to handle every special corner case which would make things fairly
complex. I'm guessing doing so would also cause a significant
reduction in compilation speed which I'm sure some people would
dislike.

>> Add your improvements to the existing x86util macro (can be done in
>> the same patch) and then use that.
>
> FFmpeg guidelines request that it is a separate patch in separate mail.

Doesn't really seem to be followed historically by looking at the git
history, but doing things by the book is never wrong so making it a
separate commit might be the correct choice.

> Just renaming it is not enough, because SPLATD
> works differently (it can broadcast second, third or forth element too)
> and it is already used in existing code.
> Maybe PBROADCASTD macro could use SPLATD internally.

SPLATD is hardcoded to broadcast the lowest element.

http://git.videolan.org/?p=ffmpeg.git;a=blob;f=libavutil/x86/x86util.asm;h=cc7d272cada6d73e5ddc42ae359491086a206743;hb=HEAD#l782

Ideally the SSE branch of SPLATD should be also removed and any float
code using it should be changed to use the BROADCASTSS macro instead,
but that's perhaps going a bit off-topic.

> "cmpps" is not scalar, but it is handled by x86inc macros and converted
> to "vcmpps", so no SSE penalties are possible with it.
>
> Read again what I wrote previously.

'cmpps' is packed, not scalar (hence the 'p' instead of 's') so it
obviously shouldn't be used in scalar code. I believe you misread my
initial comment which said to use 'cmpss' instead of 'cmpless'.

>> One could of course argue whether or not x86inc should deal with
>> psuedo-instructions but right now it doesn't so it will cause AVX
>> state transitions if INIT_YMM is used since it wont be VEX-encoded.
>
> Let's say that using a number for selecting comparison mode
> is NOT developer friendly.

I did indeed say one can argue about it. I can see if there's any
clean way of handling it that doesn't require a hundred lines of code.

>> Doing PIC-mangling optimally will in many cases be context-dependent
>> which is probably why it doesn't already exist. For example if you
>> need to do multiple complex PIC-loads you should do a single lea and
>> reuse it which is harder to convey as a generic macro.
>
> I think that this macro can help with that too:
> e.g.
>     SET_PIC_BASE lea,  r5, const_1
>     movaps        m1,  [pic_base_const_1 + r2 ]
>     movaps        m2,  [pic_base_const_1 + const_2 - const_1 + r2]
>
> I guess you might want a second macro, something like:
>     %define pic_mangle(a,b) (pic_base_%+ a + b - a)
>     movaps        m2,  [pic_mangle(const_1, const_2) ]

Hmm, something like this might be doable (completely untested):

%macro SET_PIC_BASE 1-2 $$ ; reg, base
    %ifdef PIC
        lea %1, [%2]
        %xdefine WRT_PIC_BASE +%1-%2
    %else
        %xdefine WRT_PIC_BASE
    %endif
%endmacro

SET_PIC_BASE r5, const_1
movaps m1, [const_1 + r2 WRT_PIC_BASE]
movaps m1, [const_2 + r2 WRT_PIC_BASE]

Note that it's generally not possible to create a valid relocation
between symbols if one or more of them are external. In that case the
relocations needs to be relative to some local reference point, such
as $$, which in the above example would be done by simply omitting the
base:

SET_PIC_BASE r5

It's of course possible to just always use $$, but it wastes a bit of
code size by always forcing imm32 offsets so it'd be preferable to
leave it optional.

> The cmpps version definitely got converted to vcmpps
> and no SSE penalty could have been imposed on it.
>
> The "vcmpps" had absolutely the same speed on Skylake.
> On Ryzen it was slower.
>
> The reason for the slowdown is somewhere else.

The code you posted used cmpless, not cmpps. Pseudo-instruction do
_NOT_ automatically become VEX-encoded in AVX mode, feel free to look
at the disassembly.

> As for now,
> should I bring back the disabled "INIT_YMM avx2"
> or leave it as it is?

If you tested it with no AVX state transitions and it was still slower
than 128-bit SIMD on a CPU with 256-bit SIMD units than it's of course
better to just stick with the existing 128-bit code.


More information about the ffmpeg-devel mailing list