[FFmpeg-devel] [PATCH] avfilter/vf_overlay: add x86 SIMD for yuv444 format when main stream has no alpha

Paul B Mahol onemda at gmail.com
Mon Apr 30 21:57:04 EEST 2018


On 4/30/18, Henrik Gramner <henrik at gramner.com> wrote:
> On Mon, Apr 30, 2018 at 6:17 PM, Paul B Mahol <onemda at gmail.com> wrote:
>> +    .loop0:
>> +        movu      m1, [dq + xq]
>> +        movu      m2, [aq + xq]
>> +        movu      m3, [sq + xq]
>> +
>> +        pshufb       m1, [pb_b2dw]
>> +        pshufb       m2, [pb_b2dw]
>> +        pshufb       m3, [pb_b2dw]
>> +        mova         m4, [pd_255]
>> +        psubd        m4, m2
>> +        pmulld       m1, m4
>> +        pmulld       m3, m2
>> +        paddd        m1, m3
>> +        paddd        m1, [pd_128]
>> +        pmulld       m1, [pd_257]
>> +        psrad        m1, 16
>> +        pshufb       m1, [pb_dw2b]
>> +        movd    [dq+xq], m1
>> +        add          xq, mmsize / 4
>
> Unpacking to dwords seems inefficient when you could do something like
> this (untested):
>
>     mova         m3, [pw_255]
>     mova         m4, [pw_128]
>     mova         m5, [pw_257]
> .loop0:
>     pmovzxbw     m0, [sq + xq]
>     pmovzxbw     m2, [aq + xq]
>     pmovzxbw     m1, [dq + xq]
>     pmullw       m0, m2
>     pxor         m2, m3
>     pmullw       m1, m2
>     paddw        m0, m4
>     paddw        m0, m1
>     pmulhuw      m0, m5
>     packuswb     m0, m0
>     movq    [dq+xq], m0
>     add          xq, mmsize / 2


Will experiment with this.

>
> which does twice as much per iteration. Also note that pmulld is slow
> on most CPUs.

This SIMD is not for CPUs found in museums.

>
>> +    .loop1:
>> +        xor         tq, tq
>> +        xor         uq, uq
>> +        xor         vq, vq
>> +        mov         rd, 255
>> +        mov         tb, [aq + xq]
>> +        neg         tb
>> +        add         rb, tb
>> +        mov         ub, [sq + xq]
>> +        neg         tb
>> +        imul        ud, td
>> +        mov         vb, [dq + xq]
>> +        imul        rd, vd
>> +        add         rd, ud
>> +        add         rd, 128
>> +        imul        rd, 257
>> +        sar         rd, 16
>> +        mov  [dq + xq], rb
>> +        add         xq, 1
>> +        cmp         xq, wq
>> +        jl .loop1
>
> Is doing the tail in scalar necessary? E.g. can you pad the buffers so
> that reading/writing past the end is OK and just run the SIMD loop?

Overlay does not operate that way, you can overlay 1 pixel onto hd720 frame.
Do you get it now?

>
> If that's impossible it'd probably be better to do a separate SIMD
> loop and pinsr/pextr input/output pixels depending on the number of
> elements left.

That seems too complicated.


More information about the ffmpeg-devel mailing list