[FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

Wed Jul 6 17:14:19 EEST 2016

Hi,

ok, so let's review your implementation of rgb24toY:

On Tue, Jul 5, 2016 at 10:37 PM, Dan Parrot <dan.parrot at mail.com> wrote:

> +static void rgb24ToY_c_vsx(uint8_t *_dst, const uint8_t *src, const
> uint8_t *unused1, const uint8_t *unused2,
> +                           int width, uint32_t *rgb2yuv)
> +{
> +//START_TIMER;
>

OK, we have to start with basics here (why didn't anyone review this
properly?):
- remove commented code (START_TIMER)
- this is not a c function, so it should not have a _c suffix or _c as part
of its suffix. Please rename to rgb24ToY_vsx.

Performance basics:
- it is _really_ important to do _identical_ speed comparisons between C
and SIMD. Therefore, the START/STOP_TIMER should be in the caller (i.e. in
hscale.c where it calls lumToYV12()), not in the callee.

> +    for (i = 0; i < width_adj; i += 8) {
> +        v_rd0 = vec_vsx_ld(0, (unsigned char *)src_addr);
> +        v_rd1 = vec_vsx_ld(0, (unsigned char *)(src_addr + 16));
> +        src_addr += 24;
> +
> +        for (j = 0; j < 2; j++) {
> +            v_tmpr = vec_perm(v_rd0, v_rd0, ((vector unsigned char)
> +                                             {0, 3, 6, 9, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0}));
> +            v_tmpg = vec_perm(v_rd0, v_rd0, ((vector unsigned char)
> +                                             {1, 4, 7, 10, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0}));
> +            v_tmpb = vec_perm(v_rd0, v_rd0, ((vector unsigned char)
> +                                             {2, 5, 8, 11, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0}));
> +
> +            v_rd0 = vec_perm(v_rd0, v_rd1, ((vector unsigned char)
> +                                            {12, 13, 14, 15, 16, 17, 18,
> 19, 20, 21, 22, 23, 0, 0, 0, 0}));
>

Why are you loading 24 bytes (8 pixels), using only 12 bytes (4 pixels) in
your inner loop, and then shifting your pixels within the register to "make
it work"? The proper way to do this is to:
- hand-unroll the code so the inner loop handles all 8 pixels (24 bytes)
without this vec_perm
- load only 4 pixels so you don't need to vec_perm
In both cases, you should not have this vec_perm here. As cute as it looks,
data manipulation (like shuffling - which is what vec_perm does) should be
minimized or removed entirely for optimal simd code.

+            v_tmp_s = vec_unpackh((vector signed char)v_tmpr);
> +            v_tmp_s = vec_and(v_tmp_s, vec_splats((short)0x0ff));
> +            v_r = vec_unpackh(v_tmp_s);
> +            v_tmp_s = vec_unpackh((vector signed char)v_tmpg);
> +            v_tmp_s = vec_and(v_tmp_s, vec_splats((short)0x0ff));
> +            v_g = vec_unpackh(v_tmp_s);
> +            v_tmp_s = vec_unpackh((vector signed char)v_tmpb);
> +            v_tmp_s = vec_and(v_tmp_s, vec_splats((short)0x0ff));
> +            v_b = vec_unpackh(v_tmp_s);
>

vec_splats should be outside the loop.

Shuffling like vec_unpackh should be done as part of the vec_perm. Multiple
permutations on the same data are never a good idea, particularly with
free-form reshuffling instructions like vec_perm which can go essentially
form any to any. Don't forget that 12 out of 16 bytes in v_tmpr/g/b are
unused, so you might just as well do something useful with them.

> +            vector unsigned v_opr1 =
> vec_splats((unsigned)(RGB2YUV_SHIFT-1));
> +            vector unsigned v_opr2 =
> vec_splats((unsigned)(RGB2YUV_SHIFT-7));
> +            vector unsigned v_opr3 =
> vec_splats((unsigned)(RGB2YUV_SHIFT-6));
>

vec_splats outside loop.

> +            v_rslt = v_ry*v_r + v_gy*v_g + v_by*v_b;
>

Is this SIMD'ed? Can you show disassembly? I wouldn't be surprised if the
compiler screws this up.

Which multiply instructions exist (at word level) for ppc64/vsx? Can you
interleave and multiply (like x86 pmaddwd)? In that case, the interleave
should be done as part of the vec_perm also. See x86 SIMD for examples
(input.asm).

> +            v_rslt += vec_sl(vec_splats((int)32), v_opr1);
> +            v_rslt += vec_sl(vec_splats((int)1), v_opr2);
>

vec_splats outside loop.

> +            v_rslt = vec_sr(v_rslt, v_opr3);
> +
> +            v_tmp_s = vec_pack(v_rslt, v_rslt);
>

So here's why the x86 code unrolls to 8 pixels per iteration: it saves a
store. But that only helps if you actually pack 2 sets of 4 pixels - and
ideally interleave the instructions doing the 4 pixels per iteration in the
inner loop. You should do that here also.

> +            v_dst = vec_sld(v_dst, v_tmp_s, 8);
> +        }
> +        v_dst = vec_sld(v_dst, v_dst, 8);
>

The vec_slds should be merged.

> +//STOP_TIMER("rgb24ToY_c_vsx");
>

Remove commented code.

Ronald