[FFmpeg-devel] [PATCH] JPEG2000: SSE optimisation of DWT decoding

Thu Aug 10 22:58:11 EEST 2017

> From: Ivan Kalvachev <ikalvachev at gmail.com>
> 
> On 8/8/17, maxime taisant <maximetaisant at hotmail.fr> wrote:
> > From: Maxime Taisant <maximetaisant at hotmail.fr>
> >
> > Hi,
> >
> > Here is some SSE optimisations for the dwt function used to decode
> JPEG2000.
> > I tested this code by using the time command while reading a
> JPEG2000
> > encoded video with ffmpeg and, on average, I observed a 4.05%
> general
> > improvement, and a 12.67% improvement on the dwt decoding part
> alone.
> > In the nasm code, you can notice that the SR1DFLOAT macro appear
> > twice. One version is called in the nasm code by the HORSD macro
> and
> > the other is called in the C code of the dwt function, I couldn't
> > figure out a way to make only one macro.
> 
> You want to use the same macro at two locations or you want to have
> 1 function and "call" it from 2 places?
> 
> For the former, I'd guess that you might have been getting errors
> about duplicated labels, since you use the local to the file form instead
> local to the macro form. aka: ".loop" vs "%%loop".

Currently I have one function declared with "cglobal" and called in the C code, and one macro with exactly the same behavior used in the nasm code.
So I guess I would like to keep only one of the two and call it from both places. (Sorry if it's still not clear, English is not my native language).

> > I also couldn't figure out a good way to optimize the VER_SD part, so
> > that is why I left it unchanged, with just a SSE-optimized version of
> > the SR_1D_FLOAT function.
> 
> [...]
> > +.extend:
> > +    shl i0d, 2
> > +    shl i1d, 2
> > +    mov j0q, i0q
> > +    mov j1q, i1q
> > +    movups m0, [lineq+j0q+4]
> > +    shufps m0, m0, 0x1B
> 
> The x86inc provides with readable method for the shuffle constant.
> qXXXX where X is index in the source reg.
> Using q3210 would generate constant that leaves all elements at their
> original places.
> The 0x1B is q0123 , that is swap, isn't it?.
> 
> Also, minor cosmetic nitpick.
>  usually the first parameters are placed so their commas are vertically
> aligned.
> This applies only when the parameter is register (so no jmp labels or []
> addresses ).
> 

Ok, I will change all that.

> [...]
> > +    ;line{2*i,2*(i+1),2*(i+2),2*(i+3)} -=
> > F_LFTG_DELTA*(line{2*i-1,2*(i+1)-1,2*(i+2)-1,2*(i+3)-
> 1}+line{2*i+1,2*(
> > i+1)+1,2*(i+2)+1,2*(i+3)+1})
> > +    movups m0, [lineq+2*j0q-28]
> > +    movups m4, [lineq+2*j0q-12]
> > +    movups m1, m0
> > +    shufps m0, m4, 0xDD
> > +    shufps m1, m4, 0x88
> 
> The x86inc provides with a way to emulate 3 operand avx.
> This means it hides one of the movaps (use 'a' for reg reg).
>     shufps m1, m0, m4, 0x88
>     shufps m0, m4, 0xDD

I know, but I figured that I would do a sse version first and add avx support afterwards.

> 
> [...]
> > +    movups m2, [lineq+2*j0q-24]
> > +    movups m5, [lineq+2*j0q-8]
> > +    shufps m2, m5, 0xDD
> > +    addps m2, m1
> > +    mulps m2, m3
> > +    subps m0, m2
> > +    movups m4, m1
> > +    shufps m1, m0, 0x44 ; 0100'0100 q1010
> Is that movlhps m1, m0 ?

No, this command place the first two values of m1 in the last two doublewords of m1, and the first two values of m0 in the first two doublewords of m1.
Movhlps would simply replace the first two values of m1 by the ones of m0.

Thank you for all of your comments.

Best regards.