[FFmpeg-devel] [RFC] An improved implementation?of?ARMv5TE?IDCT (simple_idct_armv5te.S)

Siarhei Siamashka siarhei.siamashka
Mon Sep 17 02:47:41 CEST 2007

Hello Michael,

Well, the current status can be tracked at:

After adding the optimizations you have suggested, now 'simple_idct_armv5te'
is a bit faster than the current version of 'simple_idct_armv6' even without
having special SIMD and result saturation instructions. That's a very nice

With the latest changes, a few more options for further optimizations apeared:
1. As +W6 constant is not needed anymore, we get a free half of a register
which could be probably useful.
2. With the addition of a few condition checks and branches, these extra
instructions can be probably scheduled in a more favourable way and let us
make use of more 64-bit loads.

I'll try to finetune the code in the next few days. Until then, it would be
nice to make a decision about what to do with the stack alignment issue.

Changelog with the fixes done since the last review:
r257 | serge | 2007-09-17 02:13:12 +0300 (???, 17 ??? 2007) | 3 lines
Changed paths:
   M /trunk/libavcodec/armv4l/simple_idct_armv5te.S

Added one more optimization suggested by Michael Niedermayer -
skip over blocks of multiplications for empty column elements
3, 5, 7.
r255 | serge | 2007-09-16 21:37:30 +0300 (???, 16 ??? 2007) | 6 lines
Changed paths:
   M /trunk/libavcodec/armv4l/simple_idct_armv5te.S

Added optimization suggested by Michael Niedermayer in
It saves 4 cycles in full row processing code (32 total) and 8 cycles in
column processing macro (also 32 cycles total). So overall improvement is 32
cycles for all zero except first row idct coefficients case, and 64 cycles for
full idct calculation.
r254 | serge | 2007-09-15 13:46:59 +0300 (???, 15 ??? 2007) | 2 lines
Changed paths:
   M /trunk/libavcodec/armv4l/simple_idct_armv5te.S

Optimized empty rows processing loop, overall 16 cycles should be saved on
ARM9E and XScale

More information about the ffmpeg-devel mailing list