[FFmpeg-devel] [PATCH] split-radix FFT

Michael Niedermayer michaelni
Tue Jul 29 14:26:08 CEST 2008


On Tue, Jul 29, 2008 at 02:05:46PM +0200, Michael Niedermayer wrote:
> On Tue, Jul 29, 2008 at 01:43:43PM +0200, Michael Niedermayer wrote:
> > On Tue, Jul 29, 2008 at 12:22:59AM -0600, Loren Merritt wrote:
> > > AOn Tue, 29 Jul 2008, Michael Niedermayer wrote:
> > > > On Fri, Jul 25, 2008 at 08:14:00PM -0600, Loren Merritt wrote:
> > > >
> > > >> +#ifdef EMULATE_3DNOWEXT
> > > >> +#define PSWAPD(s,d)\
> > > >> +    "movq "#s","#d"\n"\
> > > >> +    "psrlq $32,"#d"\n"\
> > > >> +    "punpckldq "#s","#d"\n"
> > > >
> > > >> +#define PSWAPD_UNARY(s)\
> > > >> +    "sub $8, %%"REG_SP"\n"\
> > > >> +    "movd "#s", 4(%%"REG_SP")\n"\
> > > >> +    "punpckhdq (%%"REG_SP"), "#s"\n"\
> > > >> +    "add $8, %%"REG_SP"\n"
> > > >
> > > > Gcc failed with a "+m" ?
> > > 
> > > No, I just designed the 3dn1 emulation of 3dn2 for simplicity (including 
> > > code locality) rather than speed. I wouldn't have written it at all 
> > > except that then I wouldn't be able to delete the radix-2 init code. 
> > > (I still can't delete it until someone ports split-radix to altivec, 
> > > but I assume that'll happen.)
> > > 
> > > >> +static void fft4(FFTComplex *z)
> > > >>  {
> > > >> -    int ln = s->nbits;
> > > >> -    long j;
> > > >> -    x86_reg i;
> > > >> -    long nblocks, nloops;
> > > >> -    FFTComplex *p, *cptr;
> > > >> +    T2(z[0], z[1], %%mm0, %%mm1);
> > > >> +    LOAD(z[2], %%mm2);
> > > >> +    LOAD(z[3], %%mm3);
> > > >> +    T4(%%mm0, %%mm1, %%mm2, %%mm3, %%mm4, %%mm5);
> > > >> +    PUNPCK(%%mm0, %%mm1, %%mm4);
> > > >> +    PUNPCK(%%mm2, %%mm3, %%mm5);
> > > >> +    SAVE(z[0], %%mm0);
> > > >> +    SAVE(z[1], %%mm4);
> > > >> +    SAVE(z[2], %%mm2);
> > > >> +    SAVE(z[3], %%mm5);
> > > >> +}
> > > >
> > > > is there any reason why seperate asm() are chained? I think a single
> > > > asm block, or even nasm/yasm if you prefer would be better.
> > > 
> > > Because it works for me, and I don't see any alternatives that are as 
> > > concise.
> > > yasm, ok.
> > 
> > I prefer code that is easy to maintain over concise code, and code that gcc
> > can silently pessimize is not easy to maintain IMHO. It easily can cost
> > someone quite some time to debug why some codec is slower on some gcc
> > version or compiled with different flags ...
> > 
> > It would of course be different if such "silent pessimization" where just
> > hypothetical but it isnt, gcc is really following murphis law here, if it
> > can mess up it does.
> > Thats why i would strogly prefer if gcc couldnt put anything at all between
> > the asm parts ...
> > 
> > 
> > > 
> > > > The way its written is almost asking for gcc to put something in between,
> > > > iam especially concerned about the -fPIC case and gcc putting all the GOT
> > > > "magic" in between the asms ...
> > > 
> > > Is gcc so stupid as to emit GOT stuff when dereferencing a pointer that's 
> > > already in a register, no global variables involved?
> > 
> > yes, examples below with gcc 4.3.1
> 
> after re-reading your question, i realize that you didnt ask for stuff
> involving globals. So my reply didnt match it, sorry ive read over
> it too quickly.
> The GOT stuff quoted though still is not strictly needed. Even though
> for some parts thats not related to the asm() being split, so again
> sorry for talking nonsense ill try to avoid writing replies when iam
> tired.
> The 2 leal though while they are use to access globals are not needed
> at all and added by gcc between asm()

Also if i use MANGLE to access the 2 root2 globals things clear up a bit in
the code generated, so i would suggest that to be done at least, even if
the asm stays split up.


-       pushl   %esi    # 76    *pushsi2        [length = 1]
-.LCFI20:
-       pushl   %ebx    # 77    *pushsi2        [length = 1]
-.LCFI21:
-       movl    12(%esp), %eax  # 2     *movsi_1/1      [length = 4]
-       call    __i686.get_pc_thunk.bx  # 78    set_got [length = 12]
-       addl    $_GLOBAL_OFFSET_TABLE_, %ebx
+       call    __i686.get_pc_thunk.cx  # 70    set_got [length = 12]
+       addl    $_GLOBAL_OFFSET_TABLE_, %ecx
+       .loc 1 327 0
+       movl    4(%esp), %eax   # 2     *movsi_1/1      [length = 4]
 .LBB21:
 .LBB22:
        .loc 1 112 0
@@ -1963,24 +1944,6 @@
        movaps 48(%eax), %xmm3
 # 0 "" 2
        .loc 1 117 0
-#NO_APP
-       leal    root2 at GOTOFF, %esi      # 14    *lea_1  [length = 6]
-       leal    root2mppm at GOTOFF, %ecx  # 15    *lea_1  [length = 6]
-#APP
 # 117 "libavcodec/i386/fft_sse.c" 1
        movaps        %xmm2, %xmm4 
 shufps $0x44, %xmm3, %xmm2 
@@ -1990,8 +1953,8 @@
 addps         %xmm4, %xmm5 
 movaps        %xmm2, %xmm4 
 shufps $0xb1, %xmm4, %xmm4 
-mulps           (%ebx,%ecx),  %xmm2 
-mulps           (%ebx,%esi),  %xmm4 
+mulps         root2mppm,  %xmm2 
+mulps         root2,  %xmm4 
 addps         %xmm4, %xmm2 
 movaps        %xmm5, %xmm4 
 shufps $0x36, %xmm2, %xmm5 
...
-       movl    $4, 12(%esp)    # 61    *movsi_1/2      [length = 8]
-       popl    %ebx    # 80    popsi1  [length = 1]
-       popl    %esi    # 81    popsi1  [length = 1]
+       movl    $4, 4(%esp)     # 55    *movsi_1/2      [length = 8]


[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

In a rich man's house there is no place to spit but his face.
-- Diogenes of Sinope
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080729/8604b47f/attachment.pgp>



More information about the ffmpeg-devel mailing list