[FFmpeg-devel] [PATCH] vp9: 16bpp tm/dc/h/v intra pred simd (mostly sse2) functions.

Sat Oct 3 11:12:01 CEST 2015

On Sat, Oct 3, 2015 at 2:12 AM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> Well, they prototype is different. For H/V, it's not critical, but for the
> directional ones, the edge handling is very quirky so I wanted to do that
> in C, so l/a are arguments instead of part of the source buffer.
>
> (And because we do in-loop filtering, doing V as-is from h264 won't work,
> since a can be post-loopfilter, whereas in h264 it's required to be pre-,
> and we don't swap in vp9.)

Oh, I see. Then it's fine.

> +cglobal vp9_ipred_v_32x32_16, 2, 4, 4, dst, stride, l, a
[...]
> +.loop:
> +    mova   [dstq+strideq*0+ 0], m0
> +    mova   [dstq+strideq*0+16], m1
> +    mova   [dstq+strideq*0+32], m2
> +    mova   [dstq+strideq*0+48], m3
> +    mova   [dstq+strideq*1+ 0], m0
> +    mova   [dstq+strideq*1+16], m1
> +    mova   [dstq+strideq*1+32], m2
> +    mova   [dstq+strideq*1+48], m3
> +    mova   [dstq+strideq*2+ 0], m0
> +    mova   [dstq+strideq*2+16], m1
> +    mova   [dstq+strideq*2+32], m2
> +    mova   [dstq+strideq*2+48], m3
> +    mova   [dstq+stride3q + 0], m0
> +    mova   [dstq+stride3q +16], m1
> +    mova   [dstq+stride3q +32], m2
> +    mova   [dstq+stride3q +48], m3
> +    lea                   dstq, [dstq+strideq*4]
> +    dec               cntd
> +    jg .loop
> +    RET

Missed this one before, but you could cut the number of
stores/iteration in half here as well. Feel free to push after that.