[FFmpeg-devel] [HACK] 50% faster H.264 decoding

Michael Niedermayer michaelni
Fri Aug 20 01:00:11 CEST 2010


On Thu, Aug 19, 2010 at 06:36:56PM -0400, Ronald S. Bultje wrote:
> Hi,
> 
> On Thu, Aug 19, 2010 at 4:46 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> > On Thu, Aug 19, 2010 at 4:34 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> >> On Thu, Aug 19, 2010 at 12:55 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> >>> On Thu, Aug 19, 2010 at 9:56 AM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> >>>> On Wed, Aug 18, 2010 at 6:44 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> >>>>> On Wed, Aug 18, 2010 at 6:28 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> >>>>>> On Wed, Aug 18, 2010 at 12:42:11PM -0400, Ronald S. Bultje wrote:
> >>>>>>> On Tue, Aug 17, 2010 at 1:35 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> >>>>>>> > On Tue, Aug 17, 2010 at 11:01:03AM -0400, Ronald S. Bultje wrote:
> >>>>>>> >> On Mon, Aug 16, 2010 at 6:40 PM, Jason Garrett-Glaser
> >>>>>>> >> <darkshikari at gmail.com> wrote:
> >>>>>>> >> > On Mon, Aug 16, 2010 at 3:35 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> >>>>>>> >> >> Hi,
> >>>>>>> >> >>
> >>>>>>> >> >> On Wed, Aug 11, 2010 at 5:32 PM, Jason Garrett-Glaser
> >>>>>>> >> >> <darkshikari at gmail.com> wrote:
> >>>>>>> >> >>> 13. Use MPEG-2 MC for chroma MC, since we know that MVs are
> >>>>>>> >> >>> fullpel-only. ?Simplify edge emulation stuff accordingly too.
> >>>>>>> >> >>
> >>>>>>> >> >> Does h264 chroma subpel actually use a memcpy shortcut if it's
> >>>>>>> >> >> fullpel? I don't remember exactly, but I don't think it has such a
> >>>>>>> >> >> shortcut for chroma, only for luma.
> >>>>>>> >> >
> >>>>>>> >> > It doesn't. ?It should at least have a shortcut for the 0,0 motion
> >>>>>>> >> > vector because its very high probability (relative to other fullpel
> >>>>>>> >> > motion vectors that result in no chroma interpolation). ?For other
> >>>>>>> >> > cases, it might or might not be worthwhile to add a branch in the asm
> >>>>>>> >> > to the 1D-only case.
> >>>>>>> >>
> >>>>>>> >> Attached sets up framework for that. The [0] functions can be copied
> >>>>>>> >> straight from VP8 (they are pixel_copy functions, with very fast
> >>>>>>> >> aligned implementations for all relevant archs) and others, and should
> >>>>>>> >> make VC-1, RV3/4, h264, H264/MPEG etc. significantly faster for the
> >>>>>>> >> MVxy==0 case. The [1]/[2] functions are probably going to be faster as
> >>>>>>> >> well but that would need some testing to see how big the effect is.
> >>>>>>> >> [3] is the function as-is now, which should obviously stay the way it
> >>>>>>> >> is.
> >>>>>>> >>
> >>>>>>> >> Michael, OK to apply this? It's mostly just changing all kind of files
> >>>>>>> >
> >>>>>>> > if its not slower ...
> >>>>>>>
> >>>>>>> Same speed. Attached is an updated version that fixes a bug in one of
> >>>>>>> the fate samples where mx gets changed and thus we called the wrong
> >>>>>>> version.
> >>>>>>>
> >>>>>>> I've tested this version with a semi-finished patch that splits up the
> >>>>>>> h264 chroma MC functions (particularly the mc8 ones) into smaller
> >>>>>>> ones, thus having cleaner (and unbranched) handling of mx==0/my==0.
> >>>>>>> This will remove most (if not all) of the branching, which might give
> >>>>>>> a minor speedup, and also removes a little duplicate code (in the
> >>>>>>> binary, not source), e.g. the fullpel handling between
> >>>>>>> mmx/3dnow/mmx2/ssse3 rv40/h264/vc1 mc8 is identical (it's all
> >>>>>>> put_pixels8_mmx) and only needs a single function. I'm only doing this
> >>>>>>> for the C and x86 ones because I can't test any of the others.
> >>>>>>>
> >>>>>>> After that's done, I plan to do a third patch which will add fullpel
> >>>>>>> or 1D-filter versions for mc4/mc2 as well, which should actually
> >>>>>>> provide a speedup for code on our desktops, as we saw for Jason's
> >>>>>>> hackpatch.
> >>>>>>>
> >>>>>>> Ronald
> >>>>>>
> >>>>>>> ?arm/dsputil_init_neon.c | ? 32 ++++++++++---
> >>>>>>> ?cavs.c ? ? ? ? ? ? ? ? ?| ? 13 ++---
> >>>>>>> ?dsputil.c ? ? ? ? ? ? ? | ? 40 +++++++++++++---
> >>>>>>> ?dsputil.h ? ? ? ? ? ? ? | ? 12 ++--
> >>>>>>> ?h264.c ? ? ? ? ? ? ? ? ?| ? 24 +++++----
> >>>>>>> ?mpegvideo.c ? ? ? ? ? ? | ? 28 ++++++-----
> >>>>>>> ?ppc/h264_altivec.c ? ? ?| ? 20 ++++++--
> >>>>>>> ?rv34.c ? ? ? ? ? ? ? ? ?| ? ?9 ++-
> >>>>>>> ?rv40dsp.c ? ? ? ? ? ? ? | ? 20 ++++++--
> >>>>>>> ?sh4/dsputil_align.c ? ? | ? 30 +++++++++---
> >>>>>>> ?vc1dec.c ? ? ? ? ? ? ? ?| ? 33 +++++++------
> >>>>>>> ?vp6.c ? ? ? ? ? ? ? ? ? | ? ?6 +-
> >>>>>>> ?x86/dsputil_mmx.c ? ? ? | ?118 +++++++++++++++++++++++++++++++++++++-----------
> >>>>>>> ?13 files changed, 272 insertions(+), 113 deletions(-)
> >>>>>>> 183027123a1213b2e037504a01d87c9c0678c1db ?h264-chroma-mvzero-shortcut.patch
> >>>>>>
> >>>>>> no objections
> >>>>>
> >>>>> Attached are the follow-up patches, C-only for now (still working on the asm).
> >>>>>
> >>>>> Patch #1 splits the H264 macro function creation macros into two, and
> >>>>> makes vc1_no_rnd use this macro instead of re-doing its own version of
> >>>>> it. Patch somehow thinks I changed mc2 into mc8, mc4 into mc2 and mc8
> >>>>> into mc4, rather than seeing I moved mc8 up from below, but the patch
> >>>>> should be readable nevertheless.
> >>>>>
> >>>>> Patch #2 then splits the C functions into 3: one each for x=0 or y=0,
> >>>>> and the remaining one for 2D bilinear filtering. It also adds one for
> >>>>> the case where x=0 AND y=0 (direct copy). Make fate has no objections.
> >>>>> There is no speed change for 1D/2D. The direct copy would be expected
> >>>>> to be faster but I didn't test because the C code isn't that relevant.
> >>>>> I can test if you prefer, but I'd rather focus on the asm functions
> >>>>> and make sure every change there is speed-tested. If you want, I can
> >>>>> move the adding of the direct copy functions to a separate patch, but
> >>>>> I didn't think that was necessary.
> >>>>>
> >>>>> I will do similar splits to the asm code
> >>>> [..]
> >>>>
> >>>> And these can be found in attached. Iv'e checked make fate for MMX,
> >>>> MMX2 and SSSE3 and all is identical. I will do some basic performance
> >>>> checks to make sure I didn't screw up anything, but speed should be
> >>>> identical except maybe for MMX avg_mc8 for x=0&&y=0, which is added by
> >>>> this patch (it was pretty much a one-liner). This is generally not
> >>>> used since MMX2/3DNOW versions are available also. If wanted, I can
> >>>> separate this or remove it.
> >>>>
> >>>> Next step is to actually implement new functions for 1D/no-filter
> >>>> mc4/mc2 which leads to the actually wanted speedup.
> >>>
> >>> Example of such an optimization attached, so we can start applying
> >>> this whole thing (now that I'm showing an actual improvement in
> >>> performance :-) ).
> >>>
> >>> START/STOP_TIMER around chroma_op[]() in h264.c, measuring only the
> >>> case where mx=0, my=0 and chroma_function_index=1 (local hack). CPU is
> >>> Intel Core i7 (Macbook Pro, OSX 10.6.4). GCC:
> >>> i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5664).
> >>> Sample: /Users/ronaldbultje/Movies/fate-suite/h264-conformance/MR3_TANDBERG_B.264
> >>>
> >>> after:
> >>> 1925 dezicycles in w=4,mx=0,my=0, 2 runs, 0 skips
> >>> 2075 dezicycles in w=4,mx=0,my=0, 4 runs, 0 skips
> >>> 2445 dezicycles in w=4,mx=0,my=0, 8 runs, 0 skips
> >>> 1903 dezicycles in w=4,mx=0,my=0, 16 runs, 0 skips
> >>> 1792 dezicycles in w=4,mx=0,my=0, 32 runs, 0 skips
> >>> 1609 dezicycles in w=4,mx=0,my=0, 64 runs, 0 skips
> >>>
> >>> before (here it would use the 2D filter ssse3 code):
> >>> 2990 dezicycles in w=4,mx=0,my=0, 2 runs, 0 skips
> >>> 2850 dezicycles in w=4,mx=0,my=0, 4 runs, 0 skips
> >>> 2917 dezicycles in w=4,mx=0,my=0, 8 runs, 0 skips
> >>> 2623 dezicycles in w=4,mx=0,my=0, 16 runs, 0 skips
> >>> 2505 dezicycles in w=4,mx=0,my=0, 32 runs, 0 skips
> >>> 2518 dezicycles in w=4,mx=0,my=0, 64 runs, 0 skips
> >>>
> >>> C-only (the version after my patches applied, so the 32-bit direct
> >>> read/write loop):
> >>> 5230 dezicycles in w=4,mx=0,my=0, 2 runs, 0 skips
> >>> 5215 dezicycles in w=4,mx=0,my=0, 4 runs, 0 skips
> >>> 5755 dezicycles in w=4,mx=0,my=0, 8 runs, 0 skips
> >>> 4255 dezicycles in w=4,mx=0,my=0, 16 runs, 0 skips
> >>> 3819 dezicycles in w=4,mx=0,my=0, 32 runs, 0 skips
> >>> 3772 dezicycles in w=4,mx=0,my=0, 64 runs, 0 skips
> >>
> >> By popular request, here's one that adds the new code to
> >> dsputil_yasm.asm instead of dsputil_mmx.c. Now I can actually read my
> >> own code, too. make fate-h264 didn't complain about this change.
> >
> > Now with actual new patch. Thanks Alex for noticing.
> 
> And same patch for mc2,x=0,y=0.
> 
> after (direct copy mc2):
> 2178 dezicycles in w=2,mx=0,my=0, 512 runs, 0 skips
> 2096 dezicycles in w=2,mx=0,my=0, 1023 runs, 1 skips
> 2109 dezicycles in w=2,mx=0,my=0, 2047 runs, 1 skips
> 
> before (ssse3 2D filter):
> 2493 dezicycles in w=2,mx=0,my=0, 511 runs, 1 skips
> 2469 dezicycles in w=2,mx=0,my=0, 1022 runs, 2 skips
> 2477 dezicycles in w=2,mx=0,my=0, 2046 runs, 2 skips
> 
> c (note how this is actually faster than ssse3 2D filter):
> 2329 dezicycles in w=2,mx=0,my=0, 512 runs, 0 skips
> 2354 dezicycles in w=2,mx=0,my=0, 1024 runs, 0 skips
> 2334 dezicycles in w=2,mx=0,my=0, 2048 runs, 0 skips

can you show benchmarks of w=2 without limiting to mx/my=0
we know the 00 case will be faster if its optimized by adding a special
case but we dont know if the additional branch mispredictions and code
cache pressure will be less than that gain so i think this should be
tested

if its faster the patches should be ok if you looked at what you
did before commit

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Let us carefully observe those good qualities wherein our enemies excel us
and endeavor to excel them, by avoiding what is faulty, and imitating what
is excellent in them. -- Plutarch
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100820/652292d6/attachment.pgp>



More information about the ffmpeg-devel mailing list