[FFmpeg-devel] [PATCH] split-radix FFT

Michael Niedermayer michaelni
Tue Jul 29 14:47:12 CEST 2008


On Tue, Jul 29, 2008 at 02:26:08PM +0200, Michael Niedermayer wrote:
> On Tue, Jul 29, 2008 at 02:05:46PM +0200, Michael Niedermayer wrote:
> > On Tue, Jul 29, 2008 at 01:43:43PM +0200, Michael Niedermayer wrote:
> > > On Tue, Jul 29, 2008 at 12:22:59AM -0600, Loren Merritt wrote:
> > > > AOn Tue, 29 Jul 2008, Michael Niedermayer wrote:
> > > > > On Fri, Jul 25, 2008 at 08:14:00PM -0600, Loren Merritt wrote:
> > > > >
> > > > >> +#ifdef EMULATE_3DNOWEXT
> > > > >> +#define PSWAPD(s,d)\
> > > > >> +    "movq "#s","#d"\n"\
> > > > >> +    "psrlq $32,"#d"\n"\
> > > > >> +    "punpckldq "#s","#d"\n"
> > > > >
> > > > >> +#define PSWAPD_UNARY(s)\
> > > > >> +    "sub $8, %%"REG_SP"\n"\
> > > > >> +    "movd "#s", 4(%%"REG_SP")\n"\
> > > > >> +    "punpckhdq (%%"REG_SP"), "#s"\n"\
> > > > >> +    "add $8, %%"REG_SP"\n"
> > > > >
> > > > > Gcc failed with a "+m" ?
> > > > 
> > > > No, I just designed the 3dn1 emulation of 3dn2 for simplicity (including 
> > > > code locality) rather than speed. I wouldn't have written it at all 
> > > > except that then I wouldn't be able to delete the radix-2 init code. 
> > > > (I still can't delete it until someone ports split-radix to altivec, 
> > > > but I assume that'll happen.)
> > > > 
> > > > >> +static void fft4(FFTComplex *z)
> > > > >>  {
> > > > >> -    int ln = s->nbits;
> > > > >> -    long j;
> > > > >> -    x86_reg i;
> > > > >> -    long nblocks, nloops;
> > > > >> -    FFTComplex *p, *cptr;
> > > > >> +    T2(z[0], z[1], %%mm0, %%mm1);
> > > > >> +    LOAD(z[2], %%mm2);
> > > > >> +    LOAD(z[3], %%mm3);
> > > > >> +    T4(%%mm0, %%mm1, %%mm2, %%mm3, %%mm4, %%mm5);
> > > > >> +    PUNPCK(%%mm0, %%mm1, %%mm4);
> > > > >> +    PUNPCK(%%mm2, %%mm3, %%mm5);
> > > > >> +    SAVE(z[0], %%mm0);
> > > > >> +    SAVE(z[1], %%mm4);
> > > > >> +    SAVE(z[2], %%mm2);
> > > > >> +    SAVE(z[3], %%mm5);
> > > > >> +}
> > > > >
> > > > > is there any reason why seperate asm() are chained? I think a single
> > > > > asm block, or even nasm/yasm if you prefer would be better.
> > > > 
> > > > Because it works for me, and I don't see any alternatives that are as 
> > > > concise.
> > > > yasm, ok.
> > > 
> > > I prefer code that is easy to maintain over concise code, and code that gcc
> > > can silently pessimize is not easy to maintain IMHO. It easily can cost
> > > someone quite some time to debug why some codec is slower on some gcc
> > > version or compiled with different flags ...
> > > 
> > > It would of course be different if such "silent pessimization" where just
> > > hypothetical but it isnt, gcc is really following murphis law here, if it
> > > can mess up it does.
> > > Thats why i would strogly prefer if gcc couldnt put anything at all between
> > > the asm parts ...
> > > 
> > > 
> > > > 
> > > > > The way its written is almost asking for gcc to put something in between,
> > > > > iam especially concerned about the -fPIC case and gcc putting all the GOT
> > > > > "magic" in between the asms ...
> > > > 
> > > > Is gcc so stupid as to emit GOT stuff when dereferencing a pointer that's 
> > > > already in a register, no global variables involved?
> > > 
> > > yes, examples below with gcc 4.3.1
> > 
> > after re-reading your question, i realize that you didnt ask for stuff
> > involving globals. So my reply didnt match it, sorry ive read over
> > it too quickly.
> > The GOT stuff quoted though still is not strictly needed. Even though
> > for some parts thats not related to the asm() being split, so again
> > sorry for talking nonsense ill try to avoid writing replies when iam
> > tired.
> > The 2 leal though while they are use to access globals are not needed
> > at all and added by gcc between asm()
> 
> Also if i use MANGLE to access the 2 root2 globals things clear up a bit in
> the code generated, so i would suggest that to be done at least, even if
> the asm stays split up.

And with m1m1m1m1/p1m1p1m1/ff_cos_16 also accessed through MANGLE
The remainder of what i complained about dissapears

@@ -42,8 +42,6 @@
        .type   fft8, @function
 fft8:
        .loc 1 97 0
-       call    __i686.get_pc_thunk.cx  # 30    set_got [length = 12]
-       addl    $_GLOBAL_OFFSET_TABLE_, %ecx
        .loc 1 97 0
        movl    4(%esp), %eax   # 2     *movsi_1/1      [length = 4]
        .loc 1 98 0
...
 fft16:
        .loc 1 111 0
-       call    __i686.get_pc_thunk.cx  # 38    set_got [length = 12]
-       addl    $_GLOBAL_OFFSET_TABLE_, %ecx
        .loc 1 111 0
        movl    4(%esp), %eax   # 2     *movsi_1/1      [length = 4]
        .loc 1 112 0


and a few things that i did not realize where there in the first place:
@@ -1678,10 +1662,9 @@
 shufps $0xdd, %xmm0, %xmm7 
 
        .loc 1 129 0
-       movl    ff_cos_16 at GOT(%ebx), %edx       # 130   *movsi_1/1      [length = 6]
-       movaps          (%edx), %xmm0 
+       movaps      ff_cos_16, %xmm0 
 movaps      %xmm4, %xmm2 
-movaps       16+(%edx), %xmm1 
+movaps       16+ff_cos_16, %xmm1 
 movaps      %xmm5, %xmm3 
 mulps       %xmm0, %xmm2 
 mulps       %xmm1, %xmm3 
...

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Its not that you shouldnt use gotos but rather that you should write
readable code and code with gotos often but not always is less readable
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080729/35aff048/attachment.pgp>



More information about the ffmpeg-devel mailing list