[FFmpeg-devel] [RFC/PATCH] More flexible variafloat_to_int16 , WMA optimization, Vorbis

Siarhei Siamashka siarhei.siamashka
Tue Jul 15 02:05:24 CEST 2008

On Tuesday 15 July 2008, Loren Merritt wrote:
> On Mon, 14 Jul 2008, Siarhei Siamashka wrote:
> > For example, it is possible to get rid of "memcpy(saved, buf+blocksize/4,
> > blocksize/4*sizeof(float))" and probably "vc->buf", performing output
> > directly to "vc->ret" and "vc->saved" from "fft.imdct_half".
> > It should further improve both performance and L1 cache use, making
> > vorbis decoder even better than it is now.
> It's not that clear cut. I can remove vc->buf (overwriting some other
> buffer that's not used at the time, like channel_residues). 

Yes, this optimization is quite obvious.

> But eliminating the memcpy requires increasing the amount of memory used,
> since you then need to keep one saved array per channel plus one for the
> current block to be pointer-swapped. This is faster if the data still
> fits in L1 after that expansion, but slower if you have an old cpu with
> a small cache.

Why increasing memory? We still keep "vc->saved" buffer and all the needed
data ends up in it after each iteration. Maybe imdct_half could produce 
not contiguous output, but store part of the data directly to "vc->ret" and
part of the data directly to "saved" in order to avoid moving bytes around
later. Is it possible to perform "vector_fmul_window" in-place in "ret"
buffer? I just suggest trying to make "vc->ret", "vc->buf" and
channel_residues reuse the same buffer, if it is possible of course. But I
feel that memory footprint can be reduced quite significantly, better fitting
L1 cache.

Regarding, "float_to_int16_interleave" function, it would be nice to also 
add at least "stride" argument in addition to "len". That would make it usable
for WMA. And it could be still possibly useful for vorbis (with some changes
to code, stride might become needed).

By the way, have you benchmarked SSE2 optimized "float_to_in16_*" functions?
On what kind of CPU they should be faster than SSE versions? And SSE version
looks very suspicious, is it really correct?

Best regards,
Siarhei Siamashka

More information about the ffmpeg-devel mailing list