[FFmpeg-devel] [PATCH] VP8 chroma(8) inner loopfilter MMX/MMX2/SSE2

Ronald S. Bultje rsbultje
Sun Jul 18 18:54:50 CEST 2010


Hi,

(I thought I had sent this out yesterday, but it displays in my draft
and isn't in my sent, so sorry if I sent this twice.)

as per $subj.

What I did, as suggested by Jason, is to change the function prototype
of "filter8_inner" from the regular uint8_t *dst, int stride,
[options] to uint8_t *dstU, uint8_t *dstV, int stride, [options]. This
has no advantage on C, MMX or MMX2, but allows me to handle a complete
row of U *and* V pixels at once in the SSE2 case, and should thus
speedup that one quite a bit. I can't test that, of course, because my
CPU is shitty. Jason also suggested I should rename "filter16" and
"filter8" in VP8DSPContext to "filtery" or "filter16y" and "filteruv"
or "filter8uv", other suggestions welcome, I haven't done that because
I wasn't really clear which function name is better.

performance, C:
12971 dezicycles in h8 inner, 64 runs, 0 skips
15601 dezicycles in v8 inner, 64 runs, 0 skips
12858 dezicycles in h8 inner, 128 runs, 0 skips
15384 dezicycles in v8 inner, 128 runs, 0 skips
12690 dezicycles in h8 inner, 256 runs, 0 skips
15292 dezicycles in v8 inner, 256 runs, 0 skips
12621 dezicycles in h8 inner, 512 runs, 0 skips
15273 dezicycles in v8 inner, 512 runs, 0 skips
12653 dezicycles in h8 inner, 1024 runs, 0 skips
15295 dezicycles in v8 inner, 1024 runs, 0 skips

performance, MMX2 (MMX is nearly identical to this, similar to the
luma loopfilter, so not measured):
3712 dezicycles in h8 inner, 64 runs, 0 skips
2838 dezicycles in v8 inner, 64 runs, 0 skips
3563 dezicycles in h8 inner, 128 runs, 0 skips
2732 dezicycles in v8 inner, 128 runs, 0 skips
3491 dezicycles in h8 inner, 256 runs, 0 skips
2680 dezicycles in v8 inner, 256 runs, 0 skips
3453 dezicycles in h8 inner, 512 runs, 0 skips
2651 dezicycles in v8 inner, 512 runs, 0 skips
3436 dezicycles in h8 inner, 1023 runs, 1 skips
2634 dezicycles in v8 inner, 1023 runs, 1 skips

performance, SSE2:
4102 dezicycles in h8 inner, 64 runs, 0 skips
2791 dezicycles in v8 inner, 64 runs, 0 skips
4067 dezicycles in h8 inner, 128 runs, 0 skips
2734 dezicycles in v8 inner, 128 runs, 0 skips
3894 dezicycles in h8 inner, 256 runs, 0 skips
2644 dezicycles in v8 inner, 256 runs, 0 skips
3810 dezicycles in h8 inner, 512 runs, 0 skips
2599 dezicycles in v8 inner, 512 runs, 0 skips
3790 dezicycles in h8 inner, 1024 runs, 0 skips
2603 dezicycles in v8 inner, 1024 runs, 0 skips

Same as inner luma, H/SSE2 is a little slower than MMX2 on my shitty
CPU, this is expected and Loren's patch should fix this once we have a
flag to distinguish "slow" or "fake" SSE2 CPUs from real ones (which
will show a 30-40% performance improvement over MMX2). Loren, what
happened to that patch?

I will change the mbedge loopfilter function prototypes in the same
way, but separately (once I have that loopfilter SIMD'ified, next on
my list). Once that's done, we should be able to do some performance
comparisons of our decoder vs. libvpx. Jason, how's the CABAC ops
going? :-).

Ronald
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vp8_chroma_mmx_loopfilter.patch
Type: application/octet-stream
Size: 18917 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100718/b542ce36/attachment.obj>



More information about the ffmpeg-devel mailing list