[FFmpeg-devel] [RFC/PATCH] More flexible variant of float_to_int16, WMA optimization, Vorbis

Siarhei Siamashka siarhei.siamashka
Mon Jul 7 09:39:32 CEST 2008


Here is a patch which adds a bit more flexible variant of 'float_to_int16'
function ('more_flexible_variant_of_float_to_int16.diff').

It can be used for quite a noticeable WMA decoding performance improvement
('float_to_int16_wma.diff'), which is at least ~15% in my tests. Using current
'float_to_int16' is hard for WMA without introducing unnecessary intermediate
operations involving interleaving samples in temporary buffer.

Currently 'dca.c', 'ac3dec.c' use extra code for interleaving samples and can 
be optimized.

Also 'float_to_int16_vorbis.diff' contains a patch which moves channel
interleaving logic from 'vector_*' function to 'float_to_int16_*'. It
simplifies the logic in 'vorbis_parse_audio_packet' and creates 
opportunities for further optimizations. Also it makes vorbis decoding
a bit faster (something like ~1.5% in my tests on Pentium-M) because of 
using 'step' argument set to 1 for vector_fmul_add_add. Also I would
suppose that it should result in less L1 cache misses because of better
data accesses locality, but callgrind results are somewhat weird (test
script 'benchffmpeg.rb' is attached). Pentium-M has a large 64K L1 data cache,
so benchmarking on cores with less L1 cache would be very interesting (P3, P4,
Core2) to see how performance difference between patched/unpatched ffmpeg
would change for vorbis decoding on other cores.

In the next step, I would like to improve vorbis decoding performance,
making it fit L1 cache better (mostly needed for the devices with ARM11 
core and 32K of L1 cache on-board, having no L2 cache). It seems that many
various improvements are possible.

Regarding the subject, does it make sense to completely replace current 
'float_to_int16' function and use a new one instead? Using new function
instead of old one is simple (though a bit cumbersome because it would 
require creating a temporary array with a single entry, holding a pointer to
samples). And using old function is problematic for at least WMA, DCA, AC3.

Problems to solve are efficient handling of non 1 or 2 channels case. It needs
to be investigated if a generic variant can be optimized well (at least it
should be faster than manual interleaving of samples) and what other special
number of channels cases should be handled. Also I can do ARM VFP 
optimizations, but 3NOW and Altivec versions would be needed.

The patches are tested and seem to work fine, but they are not intended to be
committed yet. I'm mostly interested in feedback before doing any further

Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: benchffmpeg.rb
Type: application/x-ruby
Size: 4139 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080707/ec6dc5b4/attachment.rb>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: float_to_int16_vorbis.diff
Type: text/x-diff
Size: 4126 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080707/ec6dc5b4/attachment.diff>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: float_to_int16_wma.diff
Type: text/x-diff
Size: 1095 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080707/ec6dc5b4/attachment-0001.diff>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: more_flexible_variant_of_float_to_int16.diff
Type: text/x-diff
Size: 5547 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080707/ec6dc5b4/attachment-0002.diff>

More information about the ffmpeg-devel mailing list