[FFmpeg-devel] [RFC] SSE3/4 implementation of flac_encode_residual_lpc
Mon May 4 06:46:15 CEST 2009
On Sun, 3 May 2009 21:21:19 -0700
Jason Garrett-Glaser <darkshikari at gmail.com> wrote:
> On Sun, May 3, 2009 at 8:39 PM, Bobby Bingham <uhmmmm at gmail.com>
> > On Sat, 25 Apr 2009 03:03:30 +0000 (UTC)
> > Loren Merritt <lorenm at u.washington.edu> wrote:
> >> On Fri, 24 Apr 2009, Bobby Bingham wrote:
> >> > Attached are patches to move flac_encode_residual_lpc to
> >> > dsputils, and to add SSE3 and SSE4 implementations. ?I wrote the
> >> > SSE3 first, but since it doesn't have signed 32x32
> >> > multiplication AFAICT, I ended up using double precision floats
> >> > for it, and the result is code that's slower than the C version.
> >> > ?Unless somebody has a suggestion of how to fix this, I'll drop
> >> > the SSE3 version.
> >> >
> >> > I tried an SSE4 version because it does have signed 32x32->32
> >> > multiplication, like the C version uses. ?Unfortunately, I don't
> >> > have an SSE4-capable processor to test it with, so I can't check
> >> > its speed or even its correctness. ?Benchmarks welcome.
> >> fails regression test on my Penryn.
> >> > +// TODO: look into palignr?
> >> Yea, do that. It should be possible to load each sample just once
> >> (aligned), and do all other manipulation in registers.
> >> There are no cpus with both lddqu and sse4, so you're paying the
> >> full cost of unaligned loads.
> > I've changed the code to use palignr, and hopefully fixed it to work
> > correctly now. ?I've also removed the SSE3 code from this patch as I
> > haven't managed to get it any faster by using integer arithmetic
> > yet.
> >"movdqu -16(%3,%0), %%xmm4 \n\t" // xmm4 = smp [i-4 ..
> >i-1] "movdqu -12(%3,%0), %%xmm6 \n\t" // xmm6 = smp
> >[i-3 .. i ]
> Any reason you didn't use palignr here?
Because it slipped my mind?
> >"movdqu %%xmm5, %2 \n\t"
> Is there a good reason why this store has to be unaligned?
Not if the calling code is changed to ensure the input and output
arrays have the same alignment.
> > "phaddd %%xmm1, %%xmm0 \n\t"
> > "phaddd %%xmm3, %%xmm2 \n\t"
> > "phaddd %%xmm2, %%xmm0 \n\t" // xmm0 = [p0, p1, p2,
> > p3]
> Did you not find a better way of doing this without PHADD, given how
> slow it is?
Also slipped my mind.
> pmulld is really really slow (6 clocks on Nehalem!). If you make
> certain assumptions about the nature of the input data (say, restrict
> your code to only 16-bit samples), you might be able to use a faster
Well, when trying to get rid of the floating point conversions in the
SSE3 version, I tried using 16 bit multiplies along with the fact that
the lpc coefficients are <16 bits, but I didn't manage to get it even to
the speed of the floating point code because like Loren said, the
samples might be 17 bits after stereo decorrelation, and there doesn't
seem to be any single instruction signed 16x16->32 multiply instruction
that I've seen. I expect that pmulld is faster than trying to
implement something similar on top of other instructions, but as I
don't have an SSE4 capable CPU, I can't really benchmark it.
> >"movdqa %%xmm5, %%xmm9 \n\t"
> Does this asm really need to be x86_64-only? If so, how about an
> x86_32 version?
I'll look into it.
More information about the ffmpeg-devel