[FFmpeg-devel] libavcodec/exr : add SSE SIMD for reorder_pixels v2 (WIP)

Tue Aug 29 16:13:48 EEST 2017

On 8/26/17, Martin Vignali <martin.vignali at gmail.com> wrote:
> Hello,
>
> in attach new patch for SSE simd of reorder pixels in exr decoder (use by
> zip and rle uncompress),
> after comments on the previous patch by Ivan Kalvachev.
>
> After testing only on a small buffer, i fix the overread problem of the
> previous patch (who hid the last loop trouble)
>
> pass fate test for me (on Mac Os X)
>
>
> Tested with the decoding of a sequence of 150 HD Exr images (CGI render
> with 17 layers per file in float pixel, ZIP16 compression)
>
> SSE :
> 349190 decicycles in reorder_pixels_zip,  130716 runs,    356 skips
> bench: utime=109.222s
> bench: maxrss=607002624kB
>
> Scalar :
> 3039686 decicycles in reorder_pixels_zip,  130395 runs,    677 skips
> bench: utime=123.042s
> bench: maxrss=607019008kB

That's impressive. :)

> Comments Welcome

Since you asked ;)
---
> [...]
> +;******************************************************************************
> +
> +%include "libavutil/x86/x86util.asm"

Still missing explicit x86inc.asm

> +SECTION .text
> +
> +;------------------------------------------------------------------------------
> +; void ff_reorder_pixels_sse2(uint8_t *src, uint8_t *dst, int size)
> +;------------------------------------------------------------------------------
> +
> +
> +INIT_XMM sse2
> +cglobal reorder_pixels, 3,6,3, src, dst, size

You do use r5, but not r4. Change r5 to r4 and decrease the usage above.

> +
> +    shr                sizeq, 1;       sizeq = half_size
> +    mov                   r3, sizeq
> +    shr                   r3, 4;       r3 = half_size/16 -> loop_simd count
> +
> +loop_simd:
> +;initial condition loop
> +    jle      after_loop_simd;          jump to scalar part if loop_simd count(r3) is 0
> +
> +    movdqa                m0, [srcq];           load first part
> +    movdqu                m1, [srcq + sizeq];   load second part

Would you test if moving the movdqu first makes any difference in speed?
I had a similar case and I think that makes it faster,
since movdqu has bigger latency.
Might not matter on newer cpu.

(If you can't tell the difference, leave it as it is.)

> +    movdqa                m2, m0;               copy m0
> +
> +    punpcklbw             m2, m1;               interleaved part 1

Both instructions could be combined as
     punpcklbw             m2, m0, m1
And the movdqa would be inserted by the x86inc macro.

> +    movdqa            [dstq], m2;               copy to dst array
> +
> +    punpckhbw             m0, m1;               interleaved part 2
> +    movdqa     [dstq+mmsize], m0;               copy to dst array
> +    add                 dstq, 2*mmsize;         inc dst
> +    add                 srcq, mmsize;           inc src
> +    sub                   r3, 1
> +    jmp            loop_simd

I think there is a way to remove some of these arithmetics.
x86 allows a bit more complicated addressing in the form of
[const+r1+r2*x] where x could be power of two -> 1, 2, 4, 8.

If you use r3 as address offset (aka r3+=mmsize)
you can do:
    mov  [dsrq+r3*2+mmsize], m3

This however would mean that you'd need
src2 as separate pointer/register, because
[src1q+sizeq+r3] uses 3 different registers.

You'll also need 2 instruction "add" and "cmp"
at the end of the loop.

Well, there are (at least) 2 ways to combine the add+cmp.

1. If you start with r3=aligned(sizeq)-mmsize ,
and you make descending loop with single "sub r3, mmsize",
until you go negative.

2. A trick used in x264.
You start with r3=-1*aligned(sizeq) and you use "add r3,mmsize"
until you reach zero.
Since r3 is used as offset, you'd have to correct all the pointers too
(aka, add aligned sizeq to src1, src2 and double to dst).

In my tests, on my machine #2 was always slower
(and that's without second scalar loop,
that has to deal with "corrected" pointers).

I also found that accessing memory in descending order
is no worse than ascending, so I've been using
descending loops unless I needed increasing
positive index for some other reason.

I do not request that you use any of these methods,
but you might want to test and see
if any of them is faster for you.

> +after_loop_simd:
> +;scalar part
> +    mov                   r3, sizeq;            r3 = half_size
> +    and                   r3, 15;               r3 = half_size % 16

mmsize-1 , instead of 15.

> +loop_scalar:
> +;initial condition loop
> +    cmp                   r3, 0
> +    jle                    end
> +
> +    mov                  r5b, [srcq];           load byte first part
> +    mov               [dstq], r5b;              copy first byte to dst
> +
> +    mov                  r5b, [srcq + sizeq];   load byte second part
> +    mov             [dstq+1], r5b;              copy byte second part

In theory doing one write instead of two should be faster.

You can do some evil code here, using high and low byte register access
available from the 16 bit days of x86 (ax={ah,al}). Doing so is not better,
because they would do the same shifting and "or"-ing as the explicit code.

I'm ok if you don't change that.

> +    add                 dstq, 2;                inc dst
> +    inc                 srcq

Use "add ,1"  instead of "inc", just like "dec"/"sub",
the increment instruction does partial update of the flag register
and introduces dependency.

> +    sub                   r3, 1
> +    jmp          loop_scalar

In the first (vector) loop you did not use "cmp" at the start of the loop,
you used the "sub" at the end to set the flags.
Any reason to don't do the same here?

Also, I think that "and" sets the Zero flag, so you
might skip the "cmp" entirely,
if you use different branch instruction.

---
Now, there is one big questions.
Can you get rid of the scalar loop entirely?

That is, can you make the C code use bigger (padded) buffers,
(input and output) so overreading/overwriting is not a problem?

This way the scalar code could get the final
data bytes with some garbage data
write the result and a bit more garbage data,
that is ignored.

The answer depends on the codec C code.

In order, to make SIMD easier, ffmpeg usually uses stride/linesize
that is bigger than the width and is aligned to max SIMD alignment.

Also, there are instructions that do masked read and write from memory,
unfortunately maskmov* are insanely slow on almost all AMD cpus,
to the point of negating all the speedup of the vector code.

> +/*
> +static void ff_reorder_pixels_sse2_intrinsics(uint8_t *src, uint8_t *dst, int size)
> +{

How fast/slow is the intrinsic version compared to the others?

Best Regards