[FFmpeg-devel] [RFC/PATCH] More flexible variafloat_to_int16 , WMA optimization, Vorbis

Siarhei Siamashka siarhei.siamashka
Fri Jul 18 09:07:18 CEST 2008


On Wednesday 16 July 2008, Loren Merritt wrote:
> On Wed, 16 Jul 2008, Siarhei Siamashka wrote:
> > But could you also benchmark SSE version of float_to_int16_interleave
> > from my original submission on the cores where SSE2 was winning? It is
> > quite a bit faster than the code from SVN in my tests:
> > FLOAT_TO_INT16_INTERLEAVE(sse,
> >    "1:                              \n"
> >    "cvtps2pi  (%2,%0), %%mm0        \n"
> >    "cvtps2pi 8(%2,%0), %%mm2        \n"
> >    "cvtps2pi  (%3,%0), %%mm1        \n"
> >    "cvtps2pi 8(%3,%0), %%mm3        \n"
> >    "add         $16,   %0           \n"
> >    "packssdw    %%mm1, %%mm0        \n"
> >    "packssdw    %%mm3, %%mm2        \n"
> >    "pshufw      $0xD8, %%mm0, %%mm0 \n"
> >    "pshufw      $0xD8, %%mm2, %%mm2 \n"
> >    "movq        %%mm0, -16(%1,%0)   \n"
> >    "movq        %%mm2, -8(%1,%0)    \n"
> >    "js 1b                           \n"
> >    "emms                            \n"
>
> k8:
> 1139 float_to_int16_interleave_siarhei
> 1161 float_to_int16_interleave_sse
> 1304 float_to_int16_interleave_sse2
>
> conroe:
>   978 float_to_int16_interleave_siarhei
> 1030 float_to_int16_interleave_sse
> 1071 float_to_int16_interleave_sse2
>
> penryn:
>   997 float_to_int16_interleave_siarhei
> 1062 float_to_int16_interleave_sse
>   782 float_to_int16_interleave_sse2
>
> prescott-celeron:
> 3846 float_to_int16_interleave_siarhei
> 3500 float_to_int16_interleave_sse
> 2219 float_to_int16_interleave_sse2

Thanks, this is interesting. In the long run it would be natural to expect
SSE2 implementation to be faster for new cores. Anyway, currently some cores
have both SSE and SSE2 support with SSE version of float_to_int16_interleave
more preferable for them. Does it make sense trying to add some additional
check to SSE vs. SSE2 selection (based on extra bits of information from CPUID
or by running some short benchmark with RDTSC in 'dsputil_mmx.c' on DSPContext
initialization)? Or would it be just the source of extra complexity not worth
bothering?

Nevertheless, it would be nice to benchmark SSE implementations of
float_to_int16_interleave on the cores that do not support SSE2, such 
as P3 and K7. Unfortunately I don't have access to any of these anymore.

-- 
Best regards,
Siarhei Siamashka




More information about the ffmpeg-devel mailing list