[FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

Sat Aug 5 01:58:11 EEST 2017

On 8/4/17, Henrik Gramner <henrik at gramner.com> wrote:
> On Thu, Aug 3, 2017 at 11:36 PM, Ivan Kalvachev <ikalvachev at gmail.com>
> wrote:
>>> 1234_____________1234_1234_123
>>>     VBROADCASTSS ym1, xm1
>>>     BLENDVPS      m1, m2, m3
>>>
>>> is the most commonly used alignment.
>>
>> I see that a lot of .asm files use different alignments.
>> I'll try to pick something similar that I find good looking.
>>
>> I also see that different function/macro can have
>> different position for the first comma.
>> This could be quite useful, to visually separate
>> the macros.
>
> Things like indentation can be a bit inconsistent at times, yes. Cody
> by different authors written over the span of many year etc.
>
> It's normal to diverge from "standard alignment" in cases when macro
> names (or memory addresses) are fairly long. Otherwise the amount of
> whitespace between instructions and operands would easily get too
> large to be practical.

I tried different things. The one space after comma indeed looks best.
I used 2 spaces after instruction.

>>> Now that I think about it I believe that part wasn't merged because
>>> smartalign is broken on some older nasm versions (which is another
>>> reason for avoiding things like this in individual files) and FFmpeg
>>> doesn't enforce any specific version requirements. I guess it could be
>>> worked around with some version checking ifdeffery if we know which
>>> versions are affected.
>>
>> I don't see anything relevant in here:
>> http://repo.or.cz/nasm.git/history/HEAD:/macros/smartalign.mac
>> I also didn't notice anything similar in the changelog.
> http://repo.or.cz/nasm.git/commit/f29123b936b1751c6d72dad86b05eb293cca5c6c
> https://bugzilla.nasm.us/show_bug.cgi?id=3392319

This would break generation of dependencies?
I think that FFmpeg uses combined mode with nasm,
so it should not call preprocessor only mode,
unless DBG=1 is used.

I'll investigate this further.
(revert the nasm fix, try building ffmpeg dependencies ...)

>> I had it mixed before, but I changed it to be consistent.
>> Some new avx instruction are not overloaded or
>> available without their "v-" prefix.
>> e.g. x86util uses vpbroadcastw in SPLATW.
>
> Every instruction that has both a legacy encoding and a VEX-encoding
> will be automatically converted to VEX in AVX functions when written
> without the v-prefix. There is no legacy encoding for newer
> instructions so there's no reason for x86inc to overload those.

That's great, but it doesn't address the original problem.
Do you insist on removing the "v-" prefix from AVX only instructions?

You said that this is purely cosmetic and
I think that it would make compilation a bit slower,
because of unnecessary macro expansions. ;)

>> Isn't one of the selling points for x86inc that
>> it would warn if unsupported instructions are used?
>
> Not really, no. It's done in a small subset of cases as a part of the
> aforementioned auto legacy/VEX conversion because it was trivial to
> add that functionality, but correctly tracking every single x86
> instruction in existence would be a whole different thing.
>
> Some instructions requires different instruction sets not just
> depending on the vector width but the operand types as well (for
> example, pextrw from mm to gpr requires mmx2, from xmm to gpr requires
> sse2, from xmm to memory requires sse4.1) and there's no generic way
> to handle every special corner case which would make things fairly
> complex. I'm guessing doing so would also cause a significant
> reduction in compilation speed which I'm sure some people would
> dislike.

There is already check for register size and integer operations,
because SSE1 does not have 128bit xmm integer operations.
It should be just few lines of code to add the same check for
AVX1 and 256bit ymm integer operations.
(I'm not confident enough to mess with that code, yet.)

As for the immediate task:
The "error" message stays.

Maybe I should also add "use PBROADCASTD instead" to it?

>>> Add your improvements to the existing x86util macro (can be done in
>>> the same patch) and then use that.
>>
>> FFmpeg guidelines request that it is a separate patch in separate mail.
>
> Doesn't really seem to be followed historically by looking at the git
> history, but doing things by the book is never wrong so making it a
> separate commit might be the correct choice.

Will do.

>> Just renaming it is not enough, because SPLATD
>> works differently (it can broadcast second, third or forth element too)
>> and it is already used in existing code.
>> Maybe PBROADCASTD macro could use SPLATD internally.
>
> SPLATD is hardcoded to broadcast the lowest element.
>
> http://git.videolan.org/?p=ffmpeg.git;a=blob;f=libavutil/x86/x86util.asm;h=cc7d272cada6d73e5ddc42ae359491086a206743;hb=HEAD#l782
>
> Ideally the SSE branch of SPLATD should be also removed and any float
> code using it should be changed to use the BROADCASTSS macro instead,
> but that's perhaps going a bit off-topic.

It seems I've been looking at x264 x86util.asm:

https://git.videolan.org/?p=x264.git;a=blob;f=common/x86/x86util.asm;h=7a140ebd93ccd845b61bfbe1e7e379b09d41d293;hb=HEAD#l280

I won't be using SPLATD in PBROADCASTD.

>> "cmpps" is not scalar, but it is handled by x86inc macros and converted
>> to "vcmpps", so no SSE penalties are possible with it.
>>
>> Read again what I wrote previously.
>
> 'cmpps' is packed, not scalar (hence the 'p' instead of 's') so it
> obviously shouldn't be used in scalar code. I believe you misread my
> initial comment which said to use 'cmpss' instead of 'cmpless'.

cmpss/ps seem to have same latency/throughput on AVX capable CPUs.
I'm using cmpss simply because I need to compare only the lowest element.
It's the only scalar instruction in that block.

Anyway, already changed the cmpless to cmpss.
Maybe I'll be able to ask for another benchmark on Ryzen.

>>> One could of course argue whether or not x86inc should deal with
>>> psuedo-instructions but right now it doesn't so it will cause AVX
>>> state transitions if INIT_YMM is used since it wont be VEX-encoded.
>>
>> Let's say that using a number for selecting comparison mode
>> is NOT developer friendly.
>
> I did indeed say one can argue about it. I can see if there's any
> clean way of handling it that doesn't require a hundred lines of code.

Hundred??

cmpless is already handled by the compiler.
I think all you'd need to handle it is:
AVX_INSTR cmpless, sse, 1, 0, 0

and the same for cmpeqss, cmpeqps, cmpltss, cmpltps, cmpleps, etc...
8 packed, 8 scalar.

Unless I miss something (and as I've said before,
I'm not confident enough to mess with that code.)

(AVX does extend to 32 variants, but they are not
SSE compatible, so no need to emulate them.)

>>> Doing PIC-mangling optimally will in many cases be context-dependent
>>> which is probably why it doesn't already exist. For example if you
>>> need to do multiple complex PIC-loads you should do a single lea and
>>> reuse it which is harder to convey as a generic macro.
>>
>> I think that this macro can help with that too:
>> e.g.
>>     SET_PIC_BASE lea,  r5, const_1
>>     movaps        m1,  [pic_base_const_1 + r2 ]
>>     movaps        m2,  [pic_base_const_1 + const_2 - const_1 + r2]
>>
>> I guess you might want a second macro, something like:
>>     %define pic_mangle(a,b) (pic_base_%+ a + b - a)
>>     movaps        m2,  [pic_mangle(const_1, const_2) ]
>
> Hmm, something like this might be doable (completely untested):
>
> %macro SET_PIC_BASE 1-2 $$ ; reg, base
>     %ifdef PIC
>         lea %1, [%2]
>         %xdefine WRT_PIC_BASE +%1-%2
>     %else
>         %xdefine WRT_PIC_BASE
>     %endif
> %endmacro
>
> SET_PIC_BASE r5, const_1
> movaps m1, [const_1 + r2 WRT_PIC_BASE]
> movaps m1, [const_2 + r2 WRT_PIC_BASE]

    movaps m1, [WRT_PIC_BASE + const_2 + r2 ]

Looks better. (Also not tested. Will do, later.)

> Note that it's generally not possible to create a valid relocation
> between symbols if one or more of them are external. In that case the
> relocations needs to be relative to some local reference point, such
> as $$, which in the above example would be done by simply omitting the
> base:
>
> SET_PIC_BASE r5
>
> It's of course possible to just always use $$, but it wastes a bit of
> code size by always forcing imm32 offsets so it'd be preferable to
> leave it optional.

Yeh $$ is the start of the current section, and that's is going to be
".text"  not "rodata".

Anyway, for now I'll keep my PIC macro code in my file.
OK?

>> The cmpps version definitely got converted to vcmpps
>> and no SSE penalty could have been imposed on it.
>>
>> The "vcmpps" had absolutely the same speed on Skylake.
>> On Ryzen it was slower.
>>
>> The reason for the slowdown is somewhere else.
>
> The code you posted used cmpless, not cmpps. Pseudo-instruction do
> _NOT_ automatically become VEX-encoded in AVX mode, feel free to look
> at the disassembly.

And I've been saying that I asked people to do benchmarks
with "cmpless" changed to "cmpps", just to rule out SSE vs AVX.

>> As for now,
>> should I bring back the disabled "INIT_YMM avx2"
>> or leave it as it is?
>
> If you tested it with no AVX state transitions and it was still slower

YES!

> than 128-bit SIMD on a CPU with 256-bit SIMD units than it's of course
> better to just stick with the existing 128-bit code.

This comes close, but does not exactly answer my question.

Should I restore the disabled code (as disabled code),
or should I just keep it deleted, so other people would
also wonder "What's going on here?!". ;)

Disabled or Deleted.

Best Regards