[FFmpeg-devel] pre discussion around Blackfin dct_quantize_bfin routine

Marc Hoffman mmhoffm
Tue Jun 12 15:24:22 CEST 2007


On 6/12/07, Siarhei Siamashka <siarhei.siamashka at gmail.com> wrote:
> On 6/12/07, Michael Niedermayer <michaelni at gmx.at> wrote:
> > Hi
> >
> > On Tue, Jun 12, 2007 at 05:49:19AM -0400, Marc Hoffman wrote:
> > > Does these allow me to ignore the DCT permutation?
> >
> > no it would still break if the user selected the other integer idct
>
> Is it possible to add a configure option to be able to compile ffmpeg
> only with IDCT that do not need permutation (and do not allow the user
> to select other idct)? At least it would eliminate table lookups in
> many places (replace table lookups with a macro which expands either
> to table lookup or the value itself). The point is that ARM devices
> are heavily CPU limited and ARMv5TE optimized IDCT does not use
> permutation. Blackfin powered devices may be CPU limited too (Marc can
> probably privide more information about blackfin performance). I'll
> try to do some benchmarks on ARM and post some results later.
>

On Blackfin you want to elliminate those permutations they are costly.
 Basically, something like:

    j=scantable[i];
    x=data[j];

expands into:

    p0=[p1++];
    3 cycle delay waiting for p0 to validate.  Thank god its interlocked.
    r0=[p0];

you don't really want to do this very often.  The execution pipeline
looks something like this

    IF0 IF1 IF2 ID AC M0 M1 M2 EX WB

AC is where addresses are computed before they are feed into the memory pipe.
Mx are memory access stages they overlap with other things not needed
for this discussion.
IFx instruction fetch
ID instruction decode
WB write back
EX execute, actually Blackfin has two stages of execution the other
one overlaps with M2.

There are 3 stages of execution in the pipeline for accessing the
memory on the parts and the feed back of the load into the register p0
needs to wait until the end of the pipeline before its used.

This is what I/we have to work with on these lighter weight embedded
processors.  We are talking about fairly simple micro architectures in
comparison to things like PPC and X86.  Actually, this pipeline layout
works very well for numerical calculations that don't require
permutations :).
#include <stdio.h>
main ()
{
  int clk;
  int mem[10];
  while (1) {
  asm (
       "%0=cycles;\n\t"
       "p0=[%1];\n\t"
       "r0=[p0];\n\t"
       "r0=cycles;\n\t"
       "%0=r0-%0 (ns);\n\t"
       : "=d" (clk) : "a" (mem) : "R0","P0");
  printf ("%d\n", clk);
  }
}

results in 6.... subtract 1 for the last read of cycles we get 5, and
the two instructions which execute gives you 3 dead cycles.  What is
it on the ARM?

Marc




More information about the ffmpeg-devel mailing list