[FFmpeg-devel] IDCT permutation (was: pre discussion around Blackfin dct_quantize_bfin routine)

Siarhei Siamashka siarhei.siamashka
Wed Jun 13 09:57:26 CEST 2007

On Wednesday 13 June 2007 02:39, Michael Niedermayer wrote:

> On Tue, Jun 12, 2007 at 09:24:22AM -0400, Marc Hoffman wrote:
> > On 6/12/07, Siarhei Siamashka <siarhei.siamashka at gmail.com> wrote:
> > > On 6/12/07, Michael Niedermayer <michaelni at gmx.at> wrote:
> > > > Hi
> > > >
> > > > On Tue, Jun 12, 2007 at 05:49:19AM -0400, Marc Hoffman wrote:
> > > > > Does these allow me to ignore the DCT permutation?
> > > >
> > > > no it would still break if the user selected the other integer idct
> > >
> > > Is it possible to add a configure option to be able to compile ffmpeg
> > > only with IDCT that do not need permutation (and do not allow the user
> > > to select other idct)? At least it would eliminate table lookups in
> > > many places (replace table lookups with a macro which expands either
> > > to table lookup or the value itself). The point is that ARM devices
> > > are heavily CPU limited and ARMv5TE optimized IDCT does not use
> > > permutation. Blackfin powered devices may be CPU limited too (Marc can
> > > probably privide more information about blackfin performance). I'll
> > > try to do some benchmarks on ARM and post some results later.

Benchmarks done, see results at the end of this message...

> > On Blackfin you want to elliminate those permutations they are costly.
> >  Basically, something like:
> >
> >     j=scantable[i];
> >     x=data[j];
> >
> > expands into:
> >
> >     p0=[p1++];
> >     3 cycle delay waiting for p0 to validate.  Thank god its interlocked.
> >     r0=[p0];
> >
> > you don't really want to do this very often.  The execution pipeline
> > looks something like this
> >
> >     IF0 IF1 IF2 ID AC M0 M1 M2 EX WB
> >
> > AC is where addresses are computed before they are feed into the memory
> > pipe. Mx are memory access stages they overlap with other things not
> > needed for this discussion.
> > IFx instruction fetch
> > ID instruction decode
> > WB write back
> > EX execute, actually Blackfin has two stages of execution the other
> > one overlaps with M2.

Table lookups are also somewhat expensive on ARM. If you try to use register
immediately after it gets loaded (from cache), you get a pipeline stall. On
ARM9 it is 1 cycle penalty for loading aligned 32-bit data and 2 cycles
penalty for loading 8-bit values which are not at 32-bit boundary (extra
pipeline stage is required for internally shifting data). On ARM11 load-use
penalty is always 2 or 3 cycles depending on the type of instruction which
processes data. It is possible to insert some other instructions between data
loading and use to avoid pipeline interlock.

> > There are 3 stages of execution in the pipeline for accessing the
> > memory on the parts and the feed back of the load into the register p0
> > needs to wait until the end of the pipeline before its used.
> why not read 3 into 3 registers and then write them, doesnt this avoid
> the delays?

Yes, it is possible unroll some loops such at this one to avoid pipeline
interlock on data loading (a clever compiler might do this job):

                for(i=1;i<8;i++) {
                    block[s->dsp.idct_permutation[i]] += ac_val[i + 8];

But some other code fragments are slightly harder to optimize.


> also decoding involes a mandatory permutation
> so no matter what idct_permutation is set to it will be the same speed and
> wisely setting the idct permutation can simplify the idct and thus speed
> it up, this is a high level optimization and wont make code slower no
> matter how expensive the permutation is as there arent more permutations
> done

By looking at ffmpeg code, this does not seem to be absolutely true...

> the extra cost is just on the encoder side, where its just a single if()
> if its the no permutation case ...

Please check the patch which is attached. It was generated by a ruby script
which is also attached:
'ruby replace_permutate.rb libavcodec'

It results in identical ffmpeg regression tests output 
(with ./configure --enable-gpl) when IDCT_PERMUTATION is defined 
as follows (defined in the patch):
#define IDCT_PERMUTATION(lookup_tbl, index) lookup_tbl[index]

If we change this define to 
#define IDCT_PERMUTATION(lookup_tbl, index) (index)
it still provides the same output with 'simple' idct but is ~2% faster on
Nokia 770 (ARM9).

Before patch:

$ ./mplayer.orig  -nosound -quiet -benchmark -vo null -loop 
3 /media/mmc1/Video/MissionImpossible3_Trailer4.divx | grep BENCHMARKs
BENCHMARKs: VC:  89.976s VO:   0.034s A:   0.000s Sys:   1.089s =   91.098s
BENCHMARKs: VC:  93.419s VO:   0.033s A:   0.000s Sys:   1.069s =   94.521s
BENCHMARKs: VC:  93.307s VO:   0.032s A:   0.000s Sys:   1.078s =   94.418s

After patch:

~ $ ./mplayer.patched -nosound -quiet -benchmark -vo null -loop 
3 /media/mmc1/Video/MissionImpossible3_Trailer4.divx | grep BENCHMARKs
BENCHMARKs: VC:  87.998s VO:   0.036s A:   0.000s Sys:   1.086s =   89.120s
BENCHMARKs: VC:  91.074s VO:   0.035s A:   0.000s Sys:   1.069s =   92.177s
BENCHMARKs: VC:  91.377s VO:   0.036s A:   0.000s Sys:   1.069s =   92.482s

I did not have time to review the changes in the code that were generated by
the script, so it still needs to carefully verified (though regression tests
pass). Usage of one or another IDCT_PERMUTATION macro variant could be
probably selected at configure time. But it at least proves that such
optimization can be quite useful. By the way, probably a similar effect can be
achieved by combining 'idct_permutation' and 'scantable' tables in code
fragments such as this one (a comment probably implies this optimization
    j= s->dsp.idct_permutation[ scantable[i] ]; //FIXME optimize

Marc, can you check this patch (with IDCT_PERMUTATION macro changed 
to return just a value instead of table lookup) to see if it provides a
noticeable performance improvement for blackfin too?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: removable-idct-permutate.diff
Type: text/x-diff
Size: 17872 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070613/e7ac6255/attachment.diff>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: replace_permutate.rb
Type: application/x-ruby
Size: 596 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070613/e7ac6255/attachment.rb>

More information about the ffmpeg-devel mailing list