[FFmpeg-devel] [PATCH] JPEG2000: SSE optimisation of DWT decoding

Ivan Kalvachev ikalvachev at gmail.com
Tue Aug 8 21:23:45 EEST 2017


On 8/8/17, maxime taisant <maximetaisant at hotmail.fr> wrote:
> From: Maxime Taisant <maximetaisant at hotmail.fr>
>
> Hi,
>
> Here is some SSE optimisations for the dwt function used to decode JPEG2000.
> I tested this code by using the time command while reading a JPEG2000
> encoded video with ffmpeg and, on average, I observed a 4.05% general
> improvement, and a 12.67% improvement on the dwt decoding part alone.
> In the nasm code, you can notice that the SR1DFLOAT macro appear twice. One
> version is called in the nasm code by the HORSD macro and the other is
> called in the C code of the dwt function, I couldn't figure out a way to
> make only one macro.

You want to use the same macro at two locations or
you want to have 1 function and "call" it from 2 places?

For the former, I'd guess that you might have been getting
errors about duplicated labels, since you use the local to the file form
instead local to the macro form. aka: ".loop" vs "%%loop".

> I also couldn't figure out a good way to optimize the VER_SD part, so that
> is why I left it unchanged, with just a SSE-optimized version of the
> SR_1D_FLOAT function.

[...]
> +.extend:
> +    shl i0d, 2
> +    shl i1d, 2
> +    mov j0q, i0q
> +    mov j1q, i1q
> +    movups m0, [lineq+j0q+4]
> +    shufps m0, m0, 0x1B

The x86inc provides with readable method for the shuffle constant.
qXXXX where X is index in the source reg.
Using q3210 would generate constant that leaves all elements at their
original places.
The 0x1B is q0123 , that is swap, isn't it?.

Also, minor cosmetic nitpick.
 usually the first parameters are placed so their commas are vertically aligned.
This applies only when the parameter is register (so no jmp labels or
[] addresses ).

[...]
> +    ;line{2*i,2*(i+1),2*(i+2),2*(i+3)} -=
> F_LFTG_DELTA*(line{2*i-1,2*(i+1)-1,2*(i+2)-1,2*(i+3)-1}+line{2*i+1,2*(i+1)+1,2*(i+2)+1,2*(i+3)+1})
> +    movups m0, [lineq+2*j0q-28]
> +    movups m4, [lineq+2*j0q-12]
> +    movups m1, m0
> +    shufps m0, m4, 0xDD
> +    shufps m1, m4, 0x88

The x86inc provides with a way to emulate 3 operand avx.
This means it hides one of the movaps (use 'a' for reg reg).
    shufps m1, m0, m4, 0x88
    shufps m0, m4, 0xDD

[...]
> +    movups m2, [lineq+2*j0q-24]
> +    movups m5, [lineq+2*j0q-8]
> +    shufps m2, m5, 0xDD
> +    addps m2, m1
> +    mulps m2, m3
> +    subps m0, m2
> +    movups m4, m1
> +    shufps m1, m0, 0x44 ; 0100'0100 q1010
Is that movlhps m1, m0 ?

> +    shufps m1, m1, q3120
> +    shufps m4, m0, 0xEE
> +    shufps m4, m4, 0xD8


That's not comprehensive review, so other should still join in.

Best Regards.


More information about the ffmpeg-devel mailing list