[FFmpeg-devel] [RFC] Loop unrolling in C code for 'vector_fmul_*' functions

Siarhei Siamashka siarhei.siamashka
Tue Jan 8 01:20:07 CET 2008


Hello,

I tried to figure out why 'ffvorbis' is somewhat slower than fixed
point 'tremor' decoder on Nokia N800 in spite of having hardware 
floating point support onboard (VFP coprocessor).

Some noticeable amount of time is used by 'vector_fmul_*' functions:

# opreport -l ./mplayer
CPU: ARM V6 PMU, speed 0 MHz (estimated)
Counted CPU_CYCLES events (clock cycles counter) with a unit mask of 0x00 (No 
unit mask) count 100000
samples  %        symbol name
50133    20.7036  ff_fft_calc_c
43325    17.8921  ff_imdct_calc
31860    13.1574  vorbis_decode_frame
30818    12.7270  ff_vector_fmul_add_add_c
18704     7.7243  vorbis_inverse_coupling
18465     7.6256  vector_fmul_c
18057     7.4571  vector_fmul_reverse_c
11140     4.6005  ff_float_to_int16_c
8429      3.4810  render_line
4228      1.7461  vorbis_floor1_decode
1188      0.4906  ogg_page_checksum_set

And 'vector_fmul_*' functions look like a 'low hanging fruit' in the sense
that they seem to be quite easy to optimize :) But there is another
interesting thing, C implementation of these functions is very straightforward
and it does not even unroll loops. But assembly or other SIMD optimizations
exist only for x86 and ppc at the moment for these functions. Is it
intentional and code readability is the main priority for them? Or some tweaks
could be added to improve 'generic C' code performance?

I have written and attached a simple test program which benchmarks
various 'vector_fmul' implementations and tried it on N800 and my desktop 
PC. The results are listed at the bottom of this message. I wonder what would
be the results from the other platforms which do not have optimized
implementations for these functions yet (non-x86 and non-ppc)?

But at least for ARM, looks like the compiler is quite stupid and can't
schedule instructions properly as seen from the benchmark results (just
unrolling loop is not enough and some extra tweaks are needed
in 'vector_fmul_c_other_unrolled'). VFP coprocessor has a high result latency
(8 cycles), though throughput is quite good (1 cycle) and some other nice
features which can improve performance exist (documantation for VFP can be
found at http://www.arm.com). The compiler (gcc) does not even try to reorder
instructions and pipeline is just stalled most of the time. I would not be
surprised if the compiler screwed up and generated something suboptimal on
more complicated floating point stuff as well (fft and imdct).

Tweaking C code, performance can be improved quite a lot
('vector_fmul_c_other_unrolled' vs. 'vector_fmul_c_unrolled').
But such unnesessarily cluttering code because of inefficient compilers is not
a good option. Anyway, probably at least just loops can be unrolled to help
the compiler do its job? The compiler itself does not know that 'len is a 
multiple of 8' and manual loops unrolling seems to be reasonable.

Well, I will do the rest of ARM VFP optimizations for all
these 'vector_fmul_*' functions anyway :)

I hope this post was not a completely useless spam.

------------------------------
Nokia N800 (ARM11 400MHz) 
gcc 4.2.2, '-O3 -fomit-frame-pointer -mcpu=arm1136jf-s -mfloat-abi=softfp'

$ ./vector_fmul_test 400
Function: 'vector_fmul_c', time=34.439 (cycles/element=26.905)
Function: 'vector_fmul_c_unrolled', time=22.801 (cycles/element=17.813)
Function: 'vector_fmul_c_other_unrolled', time=7.012 (cycles/element=5.479)
Function: 'vector_fmul_c_simd', time=20.049 (cycles/element=15.664)
Function: 'vector_fmul_vfp', time=2.628 (cycles/element=2.053)

------------------------------
AthlonXP 2400+ (2000MHz)
gcc 4.1.2 '-O3 -fomit-frame-pointer -march=athlon-xp'

$ ./vector_fmul_test 2000
Function: 'vector_fmul_c', time=0.893 (cycles/element=3.486)
Function: 'vector_fmul_c_unrolled', time=0.565 (cycles/element=2.207)
Function: 'vector_fmul_c_other_unrolled', time=0.574 (cycles/element=2.242)
Function: 'vector_fmul_c_simd', time=0.310 (cycles/element=1.210)

------------------------------
AthlonXP 2400+ (2000MHz)
gcc 4.1.2 '-O3 -fomit-frame-pointer -march=athlon-xp -mfpmath=sse'

$ ./vector_fmul_test 2000
Function: 'vector_fmul_c', time=0.810 (cycles/element=3.166)
Function: 'vector_fmul_c_unrolled', time=0.803 (cycles/element=3.137)
Function: 'vector_fmul_c_other_unrolled', time=0.801 (cycles/element=3.131)
Function: 'vector_fmul_c_simd', time=0.312 (cycles/element=1.219)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vector_fmul_test.c
Type: text/x-csrc
Size: 6426 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080108/7476ce7e/attachment.c>



More information about the ffmpeg-devel mailing list