[FFmpeg-devel] Review request - ra288.{c,h} ra144.{c,h}

Siarhei Siamashka siarhei.siamashka
Thu Sep 18 11:00:42 CEST 2008


On Wednesday 17 September 2008, Michael Niedermayer wrote:
> On Wed, Sep 17, 2008 at 02:48:41AM +0300, Siarhei Siamashka wrote:
[...]
> > Well, the summary of my message was the following: "if you want to find
> > an easy optimization target, grep ffmpeg sources for lrintf". One of such
> > targets is WMA decoder, using 'float_to_int16_interleave' in it is quite
> > trivial and provides a very noticeable performance improvement, there
> > were even several patches floating around which can be used with minor
> > changes.
>
> yes, and no ....
> yes lrintf() is a easy optimization target, but its not
> float_to_int16_interleave() that should be used, instead the floats
> themselfs should be returned.
>
> > It would not be very nice to use lrintf in new code and dsputil functions
> > should be preferred. For the targets (if such targets exist) where lrintf
> > is the fastest way to convert float to integer, the dsputil function
> > should be implemented using lrintf. Otherwise it should use SIMD
> > instructions available on the target system, or unrolled/pipelined
> > sequence of instructions.
>
> as said above, IMO the floats should be returned, but it for some reason
> didnt work when vitor tried so i suggested lrintf() until floats can be
> returned. dsputil would be IMHO overkill as its not supposed to stay like
> that when float output is fixed.
> If it was supposed tp stay then yes id agree with you that dsputil should
> be used.

Moving conversion outside of the decoder makes everything more modular. But
modularity also sometimes has a cost. Let's consider the following example:

decoder X: A -> B -> C -> D ...
decoder Y: E -> B -> F -> G ...
decoder Z: A -> B -> H -> I ...

Let's suppose that B and C operations can be fused together to make them
faster (for example merging postrotation in IMDCT and applying windowing 
function for vorbis). In this case keeping these operations separate impacts
performance. Introducing alternative fast path for using fused BC operation
can be a solution. Merging some operations may be beneficial for some
platforms but bad for the others (the number of registers in x86 is small,
preventing some optimizations), so this selection must be somewhat
intelligent.

I hope that making use of C version of float_to_int16 (it is to some extent an
example of 'fused' operation now) will not make much troubles when it is moved
out of dsputil.

Moreover, ARM VFP code could use one more optimization trick. VFP has 
instruction which can convert float to int32_t with saturation 
built-in. But there is no direct way to convert to int16_t, so the 
current code must do saturation of values to int16_t separately. If 
decoder output is premultiplied by 64K, it should be possible to 
avoid this extra saturation operation and improve performance (just 
take high 16-bits of the converted values instead of lower 16-bits).
This also has an issue with rounding precision, but performance 
improvement may be worth it.

Will it be still possible to use such optimizations in a redesigned ffmpeg?

I'm not trying to argue. I just want to know what's going to happen and what
troubles/benefits we can expect and ensure that nothing was overlooked.

-- 
Best regards,
Siarhei Siamashka




More information about the ffmpeg-devel mailing list