[FFmpeg-devel] [PATCH] Further optimization of base64 decode using AV_WB32.

Michael Niedermayer michaelni at gmx.at
Sat Jan 21 20:47:59 CET 2012


On Sat, Jan 21, 2012 at 06:53:53PM +0100, Reimar Döffinger wrote:
> On Sat, Jan 21, 2012 at 06:39:29PM +0100, Reimar Döffinger wrote:
> > On Sat, Jan 21, 2012 at 06:30:48PM +0100, Reimar Döffinger wrote:
> > > On Sat, Jan 21, 2012 at 06:13:19PM +0100, Reimar Döffinger wrote:
> > > > On Sat, Jan 21, 2012 at 05:56:32PM +0100, Michael Niedermayer wrote:
> > > > > On Sat, Jan 21, 2012 at 05:52:27PM +0100, Reimar Döffinger wrote:
> > > > > > This is somewhat questionable.
> > > > > > The biggest issue is that av_bswap32 is not replaced
> > > > > > with our asm version on gcc 4.5 or newer.
> > > > > > This causes gcc to generate horrible code that is slower
> > > > > > than the unoptimized variant.
> > > > > > Old:                                  248852 decicycles
> > > > > > New with gcc's attempt at av_bswap32: 256576 decicycles
> > > > > > New with our bswap32:                 200260 decicycles
> > > > > [...]
> > > > > > diff --git a/libavutil/x86/bswap.h b/libavutil/x86/bswap.h
> > > > > > index 52ffb4d..aa39d97 100644
> > > > > > --- a/libavutil/x86/bswap.h
> > > > > > +++ b/libavutil/x86/bswap.h
> > > > > > @@ -37,7 +37,7 @@ static av_always_inline av_const unsigned av_bswap16(unsigned x)
> > > > > >  }
> > > > > >  #endif /* !AV_GCC_VERSION_AT_LEAST(4,1) */
> > > > > >  
> > > > > > -#if !AV_GCC_VERSION_AT_LEAST(4,5)
> > > > > > +#if 1 || !AV_GCC_VERSION_AT_LEAST(4,5)
> > > > > >  #define av_bswap32 av_bswap32
> > > > > >  static av_always_inline av_const uint32_t av_bswap32(uint32_t x)
> > > > > >  {
> > > > > 
> > > > > also make sure -cpu/arch/tune is set so gcc is allowed to use bswap
> > > > > (its 486+) so not possible for gcc to use on strict x86
> > > > 
> > > > It is a x86_64 build, so I'd hope that gcc will not try to "optimize"
> > > > of 486 on that...
> > > 
> > > gcc version is actually 4.6.2 and it fails to use the bswap instruction
> > > regardless whether I use no extra options, -march=native, -m32, -m32
> > > -march=native.
> > > In all cases the code without our inline bswap is significantly slower
> > > (ca. 20%).
> > > I have no idea where the claim that gcc would recognize the bswap comes
> > > from (hm, I haven't tested if the << 8 confuses it though, will now).
> > 
> > Yes, only completely removing the shift fixes it.
> > One option would be to make the table 16 bit to avoid that shift.
> > However my tests show that even though this saves the shift instruction
> > the code does not become any faster in 64 bit mode and only maybe 2% in
> > 32 bit mode (except of course for unbreaking the compiler), so it
> > seems quite wasteful.
> 
> This works, too, it actually seems about 2% faster than with out bswap
> asm (I assume better scheduling):

what effect does this have on ARM and or (lets randomly pick) mpeg2
decoding speed ?


[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

No human being will ever know the Truth, for even if they happen to say it
by chance, they would not even known they had done so. -- Xenophanes
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20120121/4d3a4cd2/attachment.asc>


More information about the ffmpeg-devel mailing list