[FFmpeg-devel] [PATCH 04/10] lavc/flacenc: add sse4 version of the lpc encoder
James Darnley
james.darnley at gmail.com
Thu Feb 13 14:07:38 CET 2014
On 2014-02-12 12:41, Christophe Gisquet wrote:
> If I'm not mistaken and x264asm isn't already brighter than me, you're
> forcing the loading of shift into a gpr, while you really never have
> to.
> This 6th register will always be on stack, so you need one less gpr in
> all cases.
>
> I'm not sure, but is it possible to leave order or len wherever they
> are for x86, so as to save another gpr? That may require to manually
> load the args.
I managed to reduce the function to 5 auto-load args. It doesn't much
matter where r5mp (shift) really is as I only use it once then I can use
r5 as I want. That means I don't need r7 on x64 so I have dropped that
down to 7 registers.
More reductions don't seem worth the amount of code *I think* I would
have to add (it is a lot!) to ensure correct loading on all 3 platforms.
With the length of these functions I don't think it would save much
time at all to avoid storing 1 more register
>> +.looplen:
>> + pxor m0, m0
>> + xor posj, posj
>> + xor negj, negj
>> + .looporder:
>> + movd m2, [coefsq+posj*4] ; c = coefs[j]
>> + SPLATD m2
>> + movu m1, [smpq+negj*4-4] ; s = smp[i-j-1]
>> + pmulld m1, m2
>> + paddd m0, m1 ; p += c * s
>> +
>> + add posj, 1
>> + sub negj, 1
>> + cmp posj, ordermp
>> + jne .looporder
>
> Potentially stupid question: do the add and sub gets compiled to
> inc/dec ? Is there a benefit compared to adding/subtracting 4? (I
> guess it does)
> Also, maybe not worthwhile, coefsq could be incremented by orderq*4,
> posj set to -orderq, and then you would do:
> dec negj
> inc posj
> jl/jnz .looporder
Changing to inc/dec and dropping the cmp saves a 8 bytes worth of
instructions so I'll make the change.
>> + movu [resq], m1 ; res[i] = smp[i] - (p >> shift)
>> +
>> + add resq, mmsize
>> + add smpq, mmsize
>> + sub lenmp, mmsize/4
>> +jg .looplen
>
> Equivalent trick here if len is in a reg: add 4*len*mmsize to resq,
> neg lenq then:
> movu [resq+4*lenq], m1
> add smpq, mmsize
> add lenq, mmsize/4
> jg .looplen
> There are probably errors in what I gave, but this should be
> sufficient to give you the idea.
Actually I re-use r2 (len) on x86 for holding one of my j offsets above
so I couldn't do this trick without working around this. I will have a
go at it and see how much extra code requires.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 683 bytes
Desc: OpenPGP digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20140213/b01cdcb7/attachment.asc>
More information about the ffmpeg-devel
mailing list