[FFmpeg-devel] [PATCH] Faster VP3 coeff decoding

David Conrad lessen42
Sun Feb 21 01:52:23 CET 2010


Hi,

This changes the current scheme of reading the coefficient tokens and creating a linked list from them to a two pass scheme where the tokens are read into a linear array, keeping pointers to the start of each level of each plane. Then, the sorting of coefficients into blocks pulls from the necessary lists as determined by the token.

This necessitates that IDCT be done in coding (hilbert) order, which quadruples the slice size from 16 to 64 pixels. This also requires that each slice be decoded in order, negating any possible slice threading (e.g. http://github.com/astrange/ffmpeg/commit/1b66f4a0ad812e546c40d2760834b3a03aafceae ). However, that threading method gets about 11% speedup with 2 cores, which is slightly less than I get with the same sample with this patch.

Anyway, I benchmarked both the slowdown from doing IDCT in coding order as well as the total speedup including the new coeff decoding with several different videos and sizes and cpus (time relative to current svn head, best of 3):

               coding   new
               order   decode
penryn (T9300):
1920x1080       1.04    0.88
1280x720        1.01    0.90
854x480         1.00    0.88
720x480         1.00    0.91
640x272         1.00    0.90

yonah (T1300):
1920x1080       1.01    0.84
1280x720        1.02    0.89
854x480         1.01    0.84
720x480         1.01    0.87
640x272         1.00    0.87

g4 (7447):
1920x1080       1.00    0.72
1280x720        1.01    0.87
854x480         1.00    0.84
720x480         1.01    0.88
640x272         1.00    0.87

cortex-a8 (omap3530):
1920x1080       1.06    0.84
1280x720        1.06    0.91
854x480         1.03    0.87
720x480         1.03    0.91
640x272         1.02    0.88
512x384         1.03    0.92

arm1176jzf-s:
720x480         1.00    0.81
640x272         1.00    0.79
512x384         1.01    0.85

So the worst case the loss from the larger slices is about 6%, and usually not above 2%, and in all cases the gain from the two pass coefficient decoding more than offsets the loss.

With this, ffmpeg is slightly faster than libtheora on my penryn.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Explictly-separate-decoding-whether-fragments-are-co.patch
Type: application/octet-stream
Size: 4295 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100220/d49af605/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-Do-MC-and-IDCT-in-coding-hilbert-order.patch
Type: application/octet-stream
Size: 6028 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100220/d49af605/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0003-Delay-translating-DCT-tokens-into-coefficients-until.patch
Type: application/octet-stream
Size: 24098 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100220/d49af605/attachment-0002.obj>



More information about the ffmpeg-devel mailing list