[FFmpeg-devel] [PATCH] SSE3/4 implementation of flac_encode_residual_lpc
Sat May 30 18:40:01 CEST 2009
On Fri, 29 May 2009 17:00:12 +0000 (UTC)
Loren Merritt <lorenm at u.washington.edu> wrote:
> On Thu, 28 May 2009, Bobby Bingham wrote:
> > Attached is a version I hope is about ready for inclusion.
> > Provides an overall encoding speedup of ~30% at
> > compression_level=12.
> > "movdqa %%xmm3, %%xmm6 \n\t" // verify that 16 bits is enough
> > "movdqa %%xmm5, %%xmm7 \n\t"
> > "pslld $16, %%xmm6 \n\t"
> > "pslld $16, %%xmm7 \n\t"
> > "psrad $16, %%xmm6 \n\t"
> > "psrad $16, %%xmm7 \n\t"
> > "pcmpeqd %%xmm3, %%xmm6 \n\t"
> > "pcmpeqd %%xmm5, %%xmm7 \n\t"
> > "pand %%xmm6, %%xmm7 \n\t"
> > "pmovmskb %%xmm7, %2 \n\t"
> > "cmp $0xffff, %2 \n\t"
> > "jne 2f \n\t"
> About half of the invocations to flac_encode_residual_lpc will know
> in advance that all of the samples fit in 16bit, so those shouldn't
> check this at all.
I've made this change in the attached patch. But in my testing, any
speed difference is so small as to get lost in the noise, and I think
it makes it less readable, so I'm tempted to revert.
> For the remainder, this logic should be doable
> with just 1 paddd and 1 por per vector. Merge several vectors before
I'm afraid I don't quite see what you mean by using 1 paddd and 1 por.
The attached patch does have a slight improvement in this piece of
code, but I doubt it's what you meant.
> The double branch is inelegant. It could be removed if you either
> wrote the whole loop in asm, or split the asm block and branched in
> C. Especially if the 16bit checking is moved to a separate loop as
> appropriate for not always needing to run it.
Split asm and branched in C.
> With 6 "r" constraints, you need #if HAVE_6REGS.
Splitting the asm also means that I'm down to 5 "r" constraints.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 12321 bytes
Desc: not available
More information about the ffmpeg-devel