[FFmpeg-devel] [PATCH] Move MLP's dot product to DSPContext

Thu Apr 30 02:58:46 CEST 2009

On Wed, Apr 29, 2009 at 01:15:14AM -0300, Ramiro Polla wrote:
> Hi,
> 
> On Tue, Apr 21, 2009 at 12:31 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
> > On Mon, Apr 20, 2009 at 11:01:10PM -0300, Ramiro Polla wrote:
> >> On Mon, Apr 20, 2009 at 9:40 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
> >> > On Mon, Apr 20, 2009 at 02:29:09AM -0300, Ramiro Polla wrote:
> >> >> On Mon, Apr 20, 2009 at 12:14 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
> >> >> > On Sun, Apr 19, 2009 at 10:10:05PM -0300, Ramiro Polla wrote:
> >> >> >> Attached file move MLP's dot product to DSPContext. The filter order
> >> >> >> is a maximum of 8, and in the rematrix stage it's a maximum of 5+2
> >> >> >> channels for MLP and 7+0 channels for TrueHD, so it all fits in 8
> >> >> >> (hopefully) optimized functions.
> >> >> >
> >> >> > the functions are too small, the call overhead is too much
> >> >> > 1-8 multiplicatons and 1-8 additions is not enough ...
> >> >>
> >> >> I thought that would happen too, but strangely there was a speedup.
> >> >
> >> > you wrote the whole function in asm() and that was slower?
> >>
> >> Attached are three asm variants: sse2, sse4, and altivec.
> >
> > 1. i meant non SIMD asm :)
> > If one wanted to do this in SIMD, it should do several channels
> > at once, or FIR & IIR at once or several blocks at once, then
> > SIMD should be faster but as is its not SIMD friendly
> >
> > 2. i mean the whole outer function not the 1-8 arithemtic ops one
> > this one as said is too small, the call overhead will kill it when
> > you try the same code (that is asm not gcc deoptiranomized C)
> >
> > ahh and note, gcc and 64 operations -> very poor code, naive asm
> > will be much faster at least it was that way in the past ...
> 
> After some of my latest commits (specially the one that cut down the
> buffer sizes), mlp in general got faster, but gcc decided to vectorize
> filter_channels() in x86_64, so the current speeds for 7.1.thd are
> now:
> ppc   : 9750 ms
> x86_32: 3585 ms
> x86_64: 2546 ms
> x86_64: 2142 ms (-fno-tree-vectorize)
> 
> After 0001-mlpdec-Move-MLP-s-filter_channel-to-dsputils.patch:
> 
> ppc   : 9220 ms
> x86_32: 3046 ms
> x86_64: 2504 ms
> x86_64 with -fno-tree-vectorize was not measured because of a harddisk
> failure (and I won't be able to test again for a while), but IIRC it
> was something around 2100 ms.
> 
> After 0002-mlpdec-x86_-32-64-optimized-mlp_filter_channel.patch:
> 
> x86_32: 2951 ms
> x86_64: 1897 ms
> 
> After 0003-Check-for-SSE4.patch and
> 0004-mlpdec-sse4-optimized-mlp_filter_channel.patch:
> x86_32: 2985 ms
> 
> The sse4 code is around the same speed as x86_32 (and slower than
> x86_64). Unless it can be well optmized further I don't think it's
> worth spending more time on it. Also I decided to drop the mmx2 code
> since it was always slower. Unless someone has a good idea to optimize
> it much more.
> 
> rematrix_channels() is slower when moved into dsputils, and more
> annoying to optimize in assembly because of if (matrix_noise_shift).
> It showed a great speedup when optimized with altivec though. (This
> part is not done yet).
> 
> Ramiro Polla - looking for a way to bring a harddisk back to life

>  Makefile  |    4 ++--
>  dsputil.c |    8 ++++++++
>  dsputil.h |    6 ++++++
>  mlpdec.c  |   34 ++++++++++------------------------
>  mlpdsp.c  |   56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 82 insertions(+), 26 deletions(-)
> 8a0d237b76f1e656c084d0d921fbd92a96fcf02b  0001-mlpdec-Move-MLP-s-filter_channel-to-dsputils.patch
> From d2cfa257f0c5a7a4b42314c23d3bdfe4be49d319 Mon Sep 17 00:00:00 2001
> From: Ramiro Polla <ramiro at lisha.ufsc.br>
> Date: Tue, 28 Apr 2009 23:54:49 -0300
> Subject: [PATCH] mlpdec: Move MLP's filter_channel to dsputils.

ok

[...]

> +void ff_mlp_filter_channel_x86_64(int32_t *firbuf, const int32_t *fircoeff, int firorder,
> +                                  int32_t *iirbuf, const int32_t *iircoeff, int iirorder,
> +                                  unsigned int filter_shift, int32_t mask, int blocksize,
> +                                  int32_t *sample_buffer)
> +{
> +    void *firjump = ff_mlp_firtable_x86_64[firorder];
> +    void *iirjump = ff_mlp_iirtable_x86_64[iirorder];
> +
> +    blocksize = -blocksize;
> +
> +    __asm__ volatile(
> +        "1:                        \n\t"
> +        "xor     %%rsi      , %%rsi\n\t"
> +        "jmp    *%[firjump]        \n\t"
> +        MUL64("%[firbuf]", "%[fircoeff]", 0x1c, ff_mlp_firorder_x86_64_8)
> +        MUL64("%[firbuf]", "%[fircoeff]", 0x18, ff_mlp_firorder_x86_64_7)
> +        MUL64("%[firbuf]", "%[fircoeff]", 0x14, ff_mlp_firorder_x86_64_6)
> +        MUL64("%[firbuf]", "%[fircoeff]", 0x10, ff_mlp_firorder_x86_64_5)
> +        MUL64("%[firbuf]", "%[fircoeff]", 0x0c, ff_mlp_firorder_x86_64_4)
> +        MUL64("%[firbuf]", "%[fircoeff]", 0x08, ff_mlp_firorder_x86_64_3)
> +        MUL64("%[firbuf]", "%[fircoeff]", 0x04, ff_mlp_firorder_x86_64_2)
> +        MUL64("%[firbuf]", "%[fircoeff]", 0x00, ff_mlp_firorder_x86_64_1)
> +        MANGLE(ff_mlp_firorder_x86_64_0)":\n\t"
> +        "jmp    *%[iirjump]        \n\t"
> +        MUL64("%[iirbuf]", "%[iircoeff]", 0x0c, ff_mlp_iirorder_x86_64_4)
> +        MUL64("%[iirbuf]", "%[iircoeff]", 0x08, ff_mlp_iirorder_x86_64_3)
> +        MUL64("%[iirbuf]", "%[iircoeff]", 0x04, ff_mlp_iirorder_x86_64_2)
> +        MUL64("%[iirbuf]", "%[iircoeff]", 0x00, ff_mlp_iirorder_x86_64_1)

you probably could put some of the coeffs in registers

> +        MANGLE(ff_mlp_iirorder_x86_64_0)":\n\t"

> +        "mov     %%rsi      , %%rax\n\t"

useless

> +        "shr     %%cl       , %%rax\n\t"
> +
> +        "mov     %%rax      , %%rdx\n\t"
> +        "add    (%[sample]) , %%rax\n\t"
> +        "and     %[mask]    , %%rax\n\t"
> +        "sub              $4,  %[firbuf]\n\t"
> +        "sub              $4,  %[iirbuf]\n\t"

these 2 buffers can apparently be merged simplifying addressing

> +        "mov     %%eax      , (%[firbuf])\n\t"
> +        "mov     %%eax      , (%[sample])\n\t"

this looks mildly redundant ...

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Many things microsoft did are stupid, but not doing something just because
microsoft did it is even more stupid. If everything ms did were stupid they
would be bankrupt already.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20090430/0061ee6a/attachment.pgp>