[FFmpeg-devel] Fw: [foms] Paper submissions to LCA

Fri Jul 17 00:12:41 CEST 2009

2009/7/16 M?ns Rullg?rd <mans at mansr.com>:
> Jason Garrett-Glaser <darkshikari at gmail.com> writes:
>
>> On Thu, Jul 16, 2009 at 6:37 AM, Mike Melanson<mike at multimedia.cx> wrote:
>>> Frank Barchard wrote:
>>>>
>>>> I don't feel qualified to speak for ffmpeg, but 2 potential topics
>>>> would be Chrome, and subtitles:
>>>>
>>>> 1. The Chrome topic, because Chrome/Chromium use ffmpeg to implement
>>>> html5 video tag. ?We could talk about whats great, and not so great,
>>>> about using ffmpeg, which would hopefully lead to improvements.
>>>
>>> That's easy enough: The H.264 decoder is great; the Theora decoder sucks. :)
>>
>> The H.264 decoder isn't great because CoreAVC is a crapton faster,
>> primarily due to better architecture, despite the fact that ffmpeg's
>> assembly is significantly superior.
>
> Could we improve this?

Yes.  Doing the following would make ffmpeg faster than CoreAVC for
progressive decoding (interlaced/MBAFF is harder, and I don't want to
get into that).  Some of these would be useful for x264, but I don't
do them because they would only help at the fastest encoding modes
(and I don't want to redesign the encoder around such useless modes):

1.  Template the code twice, once for CABAC, once for CAVLC.
Interleave entropy decoding and MC/idct.  This means, for example,
decoding an MV, and immediately performing motion compensation with
that MV.
2.  Write paranoid-schizophrenic entropy decoder; separate load_bits
and get_bits into two functions and only call load_bits when one knows
that the bit buffer needs to be reloaded.
3.  Use a constant-stride instead of variable stride (a'la x264).  Use
ring buffers instead of full-frame data for syntax elements.  Never
load any pixel data from the frame itself, only from the ring buffer
and from the left side of the previous macroblock to fill the right
side of the current one, and so forth.
4.  Frame-based multithreading (obviously).
5.  Eliminate fill_caches.  Split it into a few separate functions,
which are only called when needed.  For example, caching intra pred
data is only called before decoding an i4x4 macroblock, after the
macroblock header is parsed.
6.  Use a better compiler.  MSVC gave me a 10% performance boost on
CoreAVC; this might just be because it was optimized from the ground
up for it, I don't know.  Maybe ICC with profiling will do better for
ffmpeg.
7.  Template everything you can get your hands on.  Motion
compensation functions should be templated for weighted pred, implicit
weighted pred, bipred, non-bipred, etc.  Decoding functions should be
templated based on frametype (I, P, B).
8.  Borrow every bit of assembly you can get your hands on from x264
to squeeze out as much performance as possible.

Many of these changes would involve a great deal of refactoring, both
in h264.c and dsputil.  Some would probably be completely impossible
to get past a patch review, particularly 2).

Dark Shikari