[FFmpeg-devel] [HACK] 50% faster H.264 decoding

Wed Aug 18 18:42:11 CEST 2010

Hi,

On Tue, Aug 17, 2010 at 1:35 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Tue, Aug 17, 2010 at 11:01:03AM -0400, Ronald S. Bultje wrote:
>> On Mon, Aug 16, 2010 at 6:40 PM, Jason Garrett-Glaser
>> <darkshikari at gmail.com> wrote:
>> > On Mon, Aug 16, 2010 at 3:35 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
>> >> Hi,
>> >>
>> >> On Wed, Aug 11, 2010 at 5:32 PM, Jason Garrett-Glaser
>> >> <darkshikari at gmail.com> wrote:
>> >>> 13. Use MPEG-2 MC for chroma MC, since we know that MVs are
>> >>> fullpel-only. ?Simplify edge emulation stuff accordingly too.
>> >>
>> >> Does h264 chroma subpel actually use a memcpy shortcut if it's
>> >> fullpel? I don't remember exactly, but I don't think it has such a
>> >> shortcut for chroma, only for luma.
>> >
>> > It doesn't. ?It should at least have a shortcut for the 0,0 motion
>> > vector because its very high probability (relative to other fullpel
>> > motion vectors that result in no chroma interpolation). ?For other
>> > cases, it might or might not be worthwhile to add a branch in the asm
>> > to the 1D-only case.
>>
>> Attached sets up framework for that. The [0] functions can be copied
>> straight from VP8 (they are pixel_copy functions, with very fast
>> aligned implementations for all relevant archs) and others, and should
>> make VC-1, RV3/4, h264, H264/MPEG etc. significantly faster for the
>> MVxy==0 case. The [1]/[2] functions are probably going to be faster as
>> well but that would need some testing to see how big the effect is.
>> [3] is the function as-is now, which should obviously stay the way it
>> is.
>>
>> Michael, OK to apply this? It's mostly just changing all kind of files
>
> if its not slower ...

Same speed. Attached is an updated version that fixes a bug in one of
the fate samples where mx gets changed and thus we called the wrong
version.

I've tested this version with a semi-finished patch that splits up the
h264 chroma MC functions (particularly the mc8 ones) into smaller
ones, thus having cleaner (and unbranched) handling of mx==0/my==0.
This will remove most (if not all) of the branching, which might give
a minor speedup, and also removes a little duplicate code (in the
binary, not source), e.g. the fullpel handling between
mmx/3dnow/mmx2/ssse3 rv40/h264/vc1 mc8 is identical (it's all
put_pixels8_mmx) and only needs a single function. I'm only doing this
for the C and x86 ones because I can't test any of the others.

After that's done, I plan to do a third patch which will add fullpel
or 1D-filter versions for mc4/mc2 as well, which should actually
provide a speedup for code on our desktops, as we saw for Jason's
hackpatch.

Ronald
-------------- next part --------------
A non-text attachment was scrubbed...
Name: h264-chroma-mvzero-shortcut.patch
Type: application/octet-stream
Size: 38965 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100818/415cab10/attachment.obj>