[FFmpeg-devel] [PATCH 04/10] lavc/flacenc: add sse4 version of the lpc encoder

James Darnley james.darnley at gmail.com
Thu Feb 13 14:07:38 CET 2014


On 2014-02-12 12:41, Christophe Gisquet wrote:
> If I'm not mistaken and x264asm isn't already brighter than me, you're
> forcing the loading of shift into a gpr, while you really never have
> to.
> This 6th register will always be on stack, so you need one less gpr in
> all cases.
> 
> I'm not sure, but is it possible to leave order or len wherever they
> are for x86, so as to save another gpr? That may require to manually
> load the args.

I managed to reduce the function to 5 auto-load args.  It doesn't much
matter where r5mp (shift) really is as I only use it once then I can use
r5 as I want.  That means I don't need r7 on x64 so I have dropped that
down to 7 registers.

More reductions don't seem worth the amount of code *I think* I would
have to add (it is a lot!) to ensure correct loading on all 3 platforms.
 With the length of these functions I don't think it would save much
time at all to avoid storing 1 more register

>> +.looplen:
>> +    pxor m0,  m0
>> +    xor posj, posj
>> +    xor negj, negj
>> +    .looporder:
>> +        movd   m2, [coefsq+posj*4] ; c = coefs[j]
>> +        SPLATD m2
>> +        movu   m1, [smpq+negj*4-4] ; s = smp[i-j-1]
>> +        pmulld m1,  m2
>> +        paddd  m0,  m1             ; p += c * s
>> +
>> +        add posj, 1
>> +        sub negj, 1
>> +        cmp posj, ordermp
>> +    jne .looporder
> 
> Potentially stupid question: do the add and sub gets compiled to
> inc/dec ? Is there a benefit compared to adding/subtracting 4? (I
> guess it does)
> Also, maybe not worthwhile, coefsq could be incremented by orderq*4,
> posj set to -orderq, and then you would do:
> dec negj
> inc posj
> jl/jnz .looporder

Changing to inc/dec and dropping the cmp saves a 8 bytes worth of
instructions so I'll make the change.

>> +    movu  [resq], m1               ; res[i] = smp[i] - (p >> shift)
>> +
>> +    add resq, mmsize
>> +    add smpq, mmsize
>> +    sub lenmp, mmsize/4
>> +jg .looplen
> 
> Equivalent trick here if len is in a reg: add 4*len*mmsize to resq,
> neg lenq then:
> movu  [resq+4*lenq], m1
> add smpq, mmsize
> add lenq, mmsize/4
> jg .looplen
> There are probably errors in what I gave, but this should be
> sufficient to give you the idea.

Actually I re-use r2 (len) on x86 for holding one of my j offsets above
so I couldn't do this trick without working around this.  I will have a
go at it and see how much extra code requires.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 683 bytes
Desc: OpenPGP digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20140213/b01cdcb7/attachment.asc>


More information about the ffmpeg-devel mailing list