[FFmpeg-devel] [RFC] 'dsputil_arm_s.S' code review for possible improvements
Fri Sep 14 23:53:29 CEST 2007
While waiting for a new armv5te idct patch review, took a look at
'dsputil_arm_s.S' source today.
There are a number of problems and possible ways for improvements:
1. This source uses '.align 8' directive everywhere. Actually '.align'
is a not portable way to specify alignment and gnu assembler may interpret
it either as an alignment in bytes or as an alignment of power of 2 of the
specified value depending on the target platform. The latter is true for
ARM and '.align 8' actually results in 256 bytes alignment! Such alignment
blows up code size a lot and makes rather big holes in it, this is definitely
not very good for cache.
2. An indirect jump is used to detect alignment of the source buffer in all
functions. This is not very fast as it is a complex instruction which requires
a lot of cycles to execute, especially on long pipeline cores (it also may
sometimes cause data cache misses on fetching label addresses further slowing
down the code).
The following code may be used as a faster alternative and has the same
number of instructions (so the code size will remain the same). It will
work faster on all ARM cores if we try to calculate cpu clock cycles, even
not taking into account potential data cache misses of the old implementation
that uses indirect jumps:
movs ip, r1, lsl #31 /* inverted bit 0 of r1 register gets into Z flag */
/* bit 1 of r1 register gets into C flag */
bic r1, #3 /* clear lower 2 bits of r1 register */
bhi 4f /* lower two bits of r1 were '11' */
bcs 3f /* lower two bits of r1 were '10' */
bne 2f /* lower two bits of r1 were '01' */
/* lower two bits of r1 were '00' */
It should help to provide some speedup and eliminate the need to have some
data space occupied by label addresses. All the data variables would be also
better to be grouped together for better data cache utilization.
3. This source uses PLD (cache preload) instruction which is ARMv5TE+. This
issue was already raised on the mailing list earlier, though nobody seems to
actively complain or send patches to fix it :) Older ARM cores may have
troubles with it. In addition, judging from the look of it, this PLD
instruction might be not very useful - prefetch distance is generally very
short. In addition, this instruction may pull into the cache (and
unnecessarily pollute it) some extra unneeded data on the last iteration.
I guess extensive benchmarks in some simulated environment (various
cache hit/miss combinations for source and destination buffers) are needed
to find the best prefetching strategy. Maybe just removing these PLD
instructions will appear to be the best choice in the end :)
4. Some instructions are not scheduled in an optimal way and may cause
pipeline stalls resulting in some minor performance penalty.
What would be the best way to address all these problems? Is it worth fixing
them one at a time or should I provide a cumulative patch?
More information about the ffmpeg-devel