[FFmpeg-devel] [PATCH] VP8 luma(16) inner-MB H/V loopfilter MMX/SSE2

Mon Jul 12 01:02:19 CEST 2010

Hi Eli,

On Sun, Jul 11, 2010 at 2:20 PM, Eli Friedman <eli.friedman at gmail.com> wrote:
> On Sun, Jul 11, 2010 at 8:53 AM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
>> as per $subj. All tested to be identical to C reference. If wanted, I
>> can try to share parts of the filter code with the simple loopfilter,
>> but I'm a little scared that it'll turn into massive spaghetti so I
>> didn't do it yet.
>
> + ? ?mova ? ? ? ? ? ? m4, m1
> + ? ?SWAP ? ? ? ? ? ? ?4, 1
>
> This pattern seems to be repeated a lot... I fail to see the point.
> Swapping two registers with the same contents doesn't do anything
> significant.

What I've been told is that for a mova x, y, the x (dest) is available
one cycle after the source, so using y (src) directly after is
preferred over using x directly after. In this function, I'm trying to
organize it such that m2-m5 are (at least in the source) consistently
referring to p1/p0/q0/q1, hence the SWAP.

> For the following:
> + ? ?mova ? ? ? ? ? ? m4, [rsp+mmsize]
> + ? ?pxor ? ? ? ? ? ? m3, m3
> + ? ?psubusb ? ? ? ? ?m0, m4
> + ? ?psubusb ? ? ? ? ?m1, m4
> + ? ?psubusb ? ? ? ? ?m7, m4
> + ? ?psubusb ? ? ? ? ?m6, m4
> + ? ?pcmpeqb ? ? ? ? ?m0, m3 ? ? ? ?; abs(p3-p2) <= I
> + ? ?pcmpeqb ? ? ? ? ?m1, m3 ? ? ? ?; abs(p2-p1) <= I
> + ? ?pcmpeqb ? ? ? ? ?m7, m3 ? ? ? ?; abs(q3-q2) <= I
> + ? ?pcmpeqb ? ? ? ? ?m6, m3 ? ? ? ?; abs(q2-q1) <= I
> + ? ?pand ? ? ? ? ? ? m0, m1
> + ? ?pand ? ? ? ? ? ? m7, m6
> + ? ?pand ? ? ? ? ? ? m0, m7
>
> The following should be faster with mmxext/sse2:
>
> ? ?mova ? ? ? ? ? ? m4, [rsp+mmsize]
> ? ?pxor ? ? ? ? ? ? m3, m3
> ? ?pmaxub ? ? ? ? ?m0, m1
> ? ?pmaxub ? ? ? ? ?m6, m7
> ? ?pmaxub ? ? ? ? ?m0, m6
> ? ?psubusb ? ? ? ? ?m0, m4
> ? ?pcmpeqb ? ? ? ? ?m0, m3

Indeed, and I've extended that a bit also, that's quite a big win, >10 cycles.

> + ? ?mova ? ? ? ? ? ? m6, [rsp+mmsize*3]
> + ? ?pxor ? ? ? ? ? ? m7, m7
> + ? ?pand ? ? ? ? ? ? m0, m6
> + ? ?pand ? ? ? ? ? ? m1, m6
> + ? ?pavgb ? ? ? ? ? ?m0, m7 ? ? ? ?; a
> + ? ?psubusb ? ? ? ? ?m1, [pb_1]
> + ? ?pavgb ? ? ? ? ? ?m1, m7 ? ? ? ?; -a
> + ? ?psubusb ? ? ? ? ?m5, m0
> + ? ?paddusb ? ? ? ? ?m5, m1 ? ? ? ?; q1-a
> + ? ?psubusb ? ? ? ? ?m2, m1
> + ? ?paddusb ? ? ? ? ?m2, m0 ? ? ? ?; p1+a
>
> pavgb is mmxext/sse2 only.

Indeed again, I've replaced the MMX version with a slightly slower code.

New version attached, still bitexact for everything, thanks for the
comments so far.

Ronald.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vp8_inner_loop_filter16.patch
Type: application/octet-stream
Size: 17156 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100711/943af748/attachment.obj>