[FFmpeg-devel] [RFC/PATCH] More flexible variafloat_to_int16 , WMA optimization, Vorbis

Loren Merritt lorenm
Tue Jul 15 06:26:54 CEST 2008


On Tue, 15 Jul 2008, Siarhei Siamashka wrote:
> On Tuesday 15 July 2008, Loren Merritt wrote:
>
>> But eliminating the memcpy requires increasing the amount of memory used,
>> since you then need to keep one saved array per channel plus one for the
>> current block to be pointer-swapped. This is faster if the data still
>> fits in L1 after that expansion, but slower if you have an old cpu with
>> a small cache.
>
> Why increasing memory? We still keep "vc->saved" buffer and all the needed
> data ends up in it after each iteration. Maybe imdct_half could produce
> not contiguous output, but store part of the data directly to "vc->ret" and
> part of the data directly to "saved" in order to avoid moving bytes around
> later.

Before: the right half of the current imdct'ed block can be in any old 
temp buffer, and is copied into saved[] after we're done using the 
previous value of saved[].

After: the right half of the current imdct'ed block must be in a buffer of 
size blocksize/4, which can be swapped with the previous saved[]. We can't 
write the imdct'ed block directly into saved[], since we need both values 
at the same time. There aren't any other arrays of exactly the right size 
to cannibalize, and we can't re-use something bigger or we're wasting even 
more memory due to increased size of the other saved[] entries.

See patch (which won't apply to svn, since it depends on other patches I 
haven't committed yet, but the strategy should be clear).

> Is it possible to perform "vector_fmul_window" in-place in "ret"
> buffer? I just suggest trying to make "vc->ret", "vc->buf" and
> channel_residues reuse the same buffer, if it is possible of course. But I
> feel that memory footprint can be reduced quite significantly, better fitting
> L1 cache.

ok

> Regarding, "float_to_int16_interleave" function, it would be nice to also
> add at least "stride" argument in addition to "len". That would make it usable
> for WMA. And it could be still possibly useful for vorbis (with some changes
> to code, stride might become needed).

ok

> By the way, have you benchmarked SSE2 optimized "float_to_in16_*" functions?
> On what kind of CPU they should be faster than SSE versions?

(cycles)
k8:
4676 float_to_int16_c
  818 float_to_int16_3dnow
  698 float_to_int16_sse
  691 float_to_int16_sse2
6654 float_to_int16_interleave_c
1965 float_to_int16_interleave_3dnow
1161 float_to_int16_interleave_sse
1304 float_to_int16_interleave_sse2

conroe:
3040 float_to_int16_c
  457 float_to_int16_sse
  356 float_to_int16_sse2
4586 float_to_int16_interleave_c
1030 float_to_int16_interleave_sse
1071 float_to_int16_interleave_sse2

penryn:
3164 float_to_int16_c
  505 float_to_int16_sse
  324 float_to_int16_sse2
4910 float_to_int16_interleave_c
1062 float_to_int16_interleave_sse
  782 float_to_int16_interleave_sse2

prescott-celeron:
8770 float_to_int16_c
1596 float_to_int16_sse
  738 float_to_int16_sse2
3670 float_to_int16_interleave_c
3500 float_to_int16_interleave_sse
2219 float_to_int16_interleave_sse2

> And SSE version looks very suspicious, is it really correct?

fixed.

--Loren Merritt




More information about the ffmpeg-devel mailing list