[Ffmpeg-devel] [PATCH] fix mpeg4 lowres chroma bug and increase h264/mpeg4 MC speed

Trent Piepho xyzzy
Mon Feb 12 00:40:39 CET 2007


On Fri, 9 Feb 2007, Michael Niedermayer wrote:
> > Do you disagree with me that avg_h264_chroma_mc4_mmx2 is completely broken?

How come you never answer this?

> > Estimated relative speed improvement against current svn:
> > fixed	       -5.68%
> > svn		0.00%
> > table	       16.04%
> > fixed+table    14.77%
> >
> > The version with both patches is only 1.51% slower than if just the table
> > lookup is applied (which will not fix the bugs), and is still 14.77% faster
> > than what is in svn now.
>
> ive benchmarked it too, and the table version alone is slower then svn

What processor?  I'm using Athlon-XP and gcc 4.0.1.  Could you use a
publicaly available clip, so that the benchmark can be replicated?

> static void put_h264_chroma_mc2_mmx2_wrap(uint8_t *dst/*align 2*/, uint8_t *src/*align 1*/, int stride, int h, int x, int y){
> START_TIMER
>     put_h264_chroma_mc2_mmx2(dst,src,stride,h,x,y);
> STOP_TIMER("put_h264_chroma_mc2_mmx2")
> }
>
> and *_h264_chroma_mc2_mmx2 marked with attribute((noinline))
> file was a 512x256 movie trailer i had laying around decoded with
> ffplay -lowres 2

I re-did the benchmarks the same way and got different results than my
initial benchmark.  I think the problem may have been that I was counting
the total cycles in a global variable that was close to the xtimesy table
in memory, and that changed the cache behaviour, making the table lookup
cheaper than it should have been.

Why do you discard some times in your TIMER code?  Is the goal just to
discard those times in which an interrupt occured?  Or is there some other
reason you think discarding some times makes for a better benchmark?

You can see a graph of my results at:
http://www.speakeasy.org/~xyzzy/pictures/h264_chroma_benchmark.png

Each type (orig svn code, code fixed with punpck, code with table based
multiply) was run in a batch of 50 runs (with 2^22 calls per run).  Then
this was repeated three times.  ie.  first 50 runs of the original code,
then 50 runs the punpck code, 50 table, 50 more orig, 50 more punpck, etc.

The code with the table multiply is clearly faster than what's in svn now.
At least for my source file and my processor.

How to interpret the box plots:

The dark line in the box is the median value.  The upper and lower bounds
of the box are the first and third quartile, which is roughly the 25th and
75th percentile.  The whiskers on the box extend to the most extreme data
point that is no more than 1.5 times the box's width away from the box.
Outliers farther than 1.5 box widths from the box are drawn as circles.




More information about the ffmpeg-devel mailing list