Loren Merritt lorenm
Fri Aug 8 04:22:35 CEST 2008

```On Thu, 7 Aug 2008, Michael Niedermayer wrote:
>
> iam not sure if its worth it to simplify this, but i think if we dont attempt
> to mask of the high bits inside the function then the following might work:
>
> if(!(i & m))          return  split_radix_permutation(i, m, inverse)<<1;
> m >>= 1;
> if(inverse == !(i&m)) return (split_radix_permutation(i, m, inverse)<<2) + 1;
> else                  return (split_radix_permutation(i, m, inverse)<<2) - 1;

done

> s->revtab[(-split_radix_permutation(i, n, s->inverse)) & (n-1)] = i;

done

> It would be nice if the forced duplication could be limited to
> #ifndef CONFIG_SMALL unless its significantly slower that way

I tried several combinations of recursive fft##n and/or re-rolling
pass{,_big} and/or re-rolling fft16 and/or removing pass or pass_big.
I can make it smaller and retain speed on core2 or prescott, but not both
cpus at once.
k8 is equally happy with any version.

2^4  2^5  2^6  2^7   2^8  2^9   2^10   2^11  2^12         code_size
penryn:
142  417 1120 2837  6589 14935 33433  74609 164273  fft.00  4070
142  418 1132 2863  6662 15108 33844  74712 165418  fft.11  3189
142  417 1120 2838  6590 14938 46809 114069 282947  fft.10  3133
142  462 1231 3011  6982 15769 35297  78270 170920  fft.05  2572
142  462 1194 2997  6947 15780 48557 117461 289381  fft.01  2516
175  516 1396 3338  7673 17166 51432 123494 301169  fft.03  1652
180  542 1411 3414  7853 17452 51895 124489 304666  fft.04  1175

prescott:
423 1122 2854 7044 16366 37274 84451 187963 418948  fft.10  2414
423 1120 2855 7056 16390 37437 87674 196322 442723  fft.00  3176
420 1162 2972 7082 16693 38034 85973 189885 421885  fft.01  1745
466 1235 3149 7451 17410 39395 89301 202842 447159  fft.03  1162
472 1209 3130 7543 17438 40310 91024 206670 456248  fft.04  830
425 1227 3217 8032 18968 43605 98880 219511 487624  fft.11  2532
421 1286 3316 8082 19250 44563 99940 223647 495350  fft.05  1872

.00 is the previous patch, all compiled with -Os
fft.10 (that's removing pass_big) might be a decent compromise if you
don't care about a huge speed regression in cases that aren't currently
used by any audio codec.

>> +    int n = 1<<s->nbits;
>> +    int i;
>> +    ff_fft_dispatch_3dn2(z, s->nbits);
>>      asm volatile("femms");
>> +    for(i=0; i<n; i+=2)
>> +        FFSWAP(FFTSample, z[i].im, z[i+1].re);
>>  }
>
> could you elaborate on why this FFSWAP pass is needed?

Intermediate results are not arrays of complex numbers, but rather group
reals and imaginaries into blocks according to the simd register size. I
suppose I could merge the swap pass into the last fft pass, like I did for
sse.
This is only needed in plain fft. My next commit after split-radix will be
to update imdct to take unswapped output from fft.

> position independant code right after a table that needs relocations ...
> no complaint i just find it ironic

Blame gnu for allowing 64bit textrels but not 32bit textrels in x86_64
shared libs.

--Loren Merritt

```