[FFmpeg-devel] [PATCH] x86: hevc: adding transform_add

Ronald S. Bultje rsbultje at gmail.com
Wed Jul 30 23:12:44 CEST 2014


Hi,

On Wed, Jul 30, 2014 at 5:04 PM, James Almer <jamrial at gmail.com> wrote:

> On 30/07/14 10:33 AM, Pierre Edouard Lepere wrote:
>
> > +%macro TR_ADD_INIT_SSE_8 2
> > +    movu              m4, [r1]
> > +    movu              m6, [r1+16]
> > +    movu              m8, [r1+32]
> > +    movu             m10, [r1+48]
>
> You can use mova here, and probably in every other movu as well.
>
> > +    lea               %1, [%2*3]
> > +    pxor              m5, m5
> > +    psubw             m5, m4
> > +    packuswb          m4, m4
> > +    packuswb          m5, m5
> > +    pxor              m7, m7
> > +    psubw             m7, m6
> > +    packuswb          m6, m6
> > +    packuswb          m7, m7
> > +    pxor              m9, m9
> > +    psubw             m9, m8
> > +    packuswb          m8, m8
> > +    packuswb          m9, m9
> > +    pxor             m11, m11
> > +    psubw            m11, m10
> > +    packuswb         m10, m10
> > +    packuswb         m11, m11
> > +%endmacro
> >
> > +%macro TR_ADD_OP_SSE 4
> > +    %1                m0, [%2     ]
> > +    %1                m1, [%2+%3  ]
> > +    %1                m2, [%2+%3*2]
> > +    %1                m3, [%2+%4  ]
> > +    paddusb           m0, m4
> > +    paddusb           m1, m6
> > +    paddusb           m2, m8
> > +    paddusb           m3, m10
> > +    psubusb           m0, m5
> > +    psubusb           m1, m7
> > +    psubusb           m2, m9
> > +    psubusb           m3, m11
> > +    %1         [%2     ], m0
> > +    %1         [%2+%3  ], m1
> > +    %1         [%2+2*%3], m2
> > +    %1         [%2+%4  ], m3
> > +%endmacro
>
> You can use packuswb to pack two regs into one, like you did in
> TR_ADD_INIT_SSE_16.
> Then you simply use movq+movhps to load and store data, like so:
>
> %macro TR_ADD_INIT_SSE_8 2
>     mova              m4, [r1]
>     mova              m6, [r1+16]
>     mova              m0, [r1+32]
>     mova              m2, [r1+48]
>     lea               %1, [%2*3]
>     pxor              m5, m5
>     psubw             m5, m4
>     pxor              m7, m7
>     psubw             m7, m6
>     pxor              m1, m1
>     psubw             m1, m0
>     packuswb          m4, m0
>     packuswb          m5, m1
>     pxor              m3, m3
>     psubw             m3, m2
>     packuswb          m6, m2
>     packuswb          m7, m3
> %endmacro
>
> %macro TR_ADD_OP_SSE 4
>     movq                m0, [%2     ]
>     movq                m1, [%2+%3  ]
>     movhps              m0, [%2+%3*2]
>     movhps              m1, [%2+%4  ]
>     paddusb             m0, m4
>     paddusb             m1, m6
>     psubusb             m0, m5
>     psubusb             m1, m7
>     movq         [%2     ], m0
>     movq         [%2+%3  ], m1
>     movhps       [%2+2*%3], m0
>     movhps       [%2+%4  ], m1
> %endmacro


Why all these memory round-trips?

Ronald


More information about the ffmpeg-devel mailing list