[FFmpeg-devel] [PATCH 3/3] avfilter/vf_framerate: add SIMD functions for frame blending

Marton Balint cus at passwd.hu
Mon Jan 15 01:09:46 EET 2018



On Sun, 14 Jan 2018, Henrik Gramner wrote:

> On Sat, Jan 13, 2018 at 10:57 PM, Marton Balint <cus at passwd.hu> wrote:
>> +    .loop:
>> +        movu            m0, [src1q + xq]
>> +        movu            m1, [src2q + xq]
>> +        punpckl%1%2     m5, m0, m2         ; 0e0f0g0h
>> +        punpckh%1%2     m0, m2             ; 0a0b0c0d
>> +        punpckl%1%2     m6, m1, m2         ; 0E0F0G0H
>> +        punpckh%1%2     m1, m2             ; 0A0B0C0D
>> +        pmull%2         m0, m3
>> +        pmull%2         m5, m3
>> +        pmull%2         m1, m4
>> +        pmull%2         m6, m4
>> +        padd%2          m0, m7
>> +        padd%2          m5, m7
>> +        padd%2          m0, m1
>> +        padd%2          m5, m6
>
> pmaddubsw should work here for the 8-bit case. pmaddwd might work for
> the 16-bit case depending on how many bits are actually used.
>

As far as I see, I have to make the blending factors 7-bit (15-bit) in 
order for this to work because pmadd* functions are working on signed 
integers. Losing 1 bit of precision of the blending factors is 
probably not a problem for the framerate filter.

So my loop would look like this:

     .loop:
         movu            m0, [src1q + xq]
         movu            m1, [src2q + xq]
         SBUTTERFLY     %1%2, 0, 1, 5        ; aAbBcCdD
                                             ; eEfFgGhH
         pmadd%3         m0, m3
         pmadd%3         m1, m3

         padd%2          m0, m7
         padd%2          m1, m7
         psrl%2          m0, %4              ; 0A0B0C0D
         psrl%2          m1, %4              ; 0E0F0G0H

         packus%2%1      m0, m1              ; ABCDEFGH
         movu   [dstq + xq], m0
         add             xq, mmsize
     jl .loop

Is this what you had in mind?

>> +    pinsrw    xm3, r8m, 0                   ; factor1
>> +    pinsrw    xm4, r9m, 0                   ; factor2
>> +    pinsrw    xm7, r10m, 0                  ; half
>> +    SPLATW     m3, xm3
>> +    SPLATW     m4, xm4
>> +    SPLATW     m7, xm7
>
> vpbroadcast* from memory on avx2, otherwise movd instead of pxor+pinsrw.
>
>> +    pxor       m3, m3
>> +    pxor       m4, m4
>> +    pxor       m7, m7
>> +    pinsrw    xm3, r8m, 0                   ; factor1
>> +    pinsrw    xm4, r9m, 0                   ; factor2
>> +    pinsrw    xm7, r10m, 0                  ; half
>> +    XSPLATD       3
>> +    XSPLATD       4
>> +    XSPLATD       7
>
> Ditto.
>
>> +    neg word r11m                           ; shift = -shift
>> +    add word r11m, 16                       ; shift += 16
>> +    pxor       m2, m2
>> +    pinsrw    xm2, r11m, 0                  ; 16 - shift
>> +    pslld      m3, xm2
>> +    pslld      m4, xm2
>> +    pslld      m7, xm2
>
> You probably want to use a temporary register instead of doing slow
> load-modify-store instructions.

Ok, I will rework these, although these parts are only the initialization 
code, so I guess these are not performance critical.

Thanks,
Marton


More information about the ffmpeg-devel mailing list