[FFmpeg-devel] [PATCH 2/2] x86/dsputilenc: implement SSE2 versions of pix_{sum16, norm1}
James Almer
jamrial at gmail.com
Tue May 27 18:34:30 CEST 2014
On 27/05/14 5:09 AM, Christophe Gisquet wrote:
> Hi,
>
> 2014-05-27 9:27 GMT+02:00 James Almer <jamrial at gmail.com>:
>> - mov r2, r1
>> - neg r2
>> - shl r2, 4
>> - sub r0, r2
> [...]
>> - movd eax, m6
>> - and eax, 0xffff
>> + paddw m4, m3
>> +%if mmsize == 8
>> + add r0, r1
>> +%else
>> + lea r0, [r0+r1*2]
>> +%endif
>> + dec r2
>> + jne .loop
>
> Sorry for not doing my work of sufficiently studying these changes, but...
> Are you removing some corner cases handling? The top part is maybe a
> better loop control in your code, but the second part you remove looks
> like some overflow prevention.
>
Ah, I did not intend to remove the "and eax, 0xffff" at all. Good thing you
noticed, thanks.
>
>> INIT_MMX mmx
>> +PIX_SUM16 0, 16
>> +INIT_XMM sse2
>> +PIX_SUM16 6, 8
>> +
> [...]
>> +INIT_MMX mmx
>> +PIX_NORM1 0, 16
>> +INIT_XMM sse2
>> +PIX_NORM1 6, 8
>
> PHADDW may map to an ssse3 instruction. I suspect it's probably not
> worth the 3% cycle count decrease, but do you have plans about it? (eg
> in a later patch)
The HADDW macro does not use phaddw if ssse3 is available.
And as you said that instruction is slower in most cases. To horizontally add
the values in a single reg pshufd+paddd will always be faster.
IMO There's no point in adding an SSSE3 version that's going to be slower.
I however might add a XOP version, since its horizontal add instructions are
single reg and fast.
>
>> --- a/libavutil/x86/x86util.asm
>> +++ b/libavutil/x86/x86util.asm
>> @@ -288,7 +288,12 @@
>> paddd %1, %2
>> %endif
>> %if notcpuflag(xop) || sizeof%1 != 16
>> +%if cpuflag(mmxext)
>> PSHUFLW %2, %1, q0032
>> +%else ; mmx
>> + mova %2, %1
>> + psrlq %2, 32
>> +%endif
>> paddd %1, %2
>> %endif
>> %undef %1
>
> If I'm not mistaken, x264 folks don't care about mmx, so no need to
> wait for their opinion on this change. This part OK then.
Actually, they do use mmx. And good think you mention this because now i see
they have been using HADD* macros in mmx functions for a while now.
Guess nobody tried x264 encoding with pre-SSE2 CPUs to experience a crash.
> Not tested but looks good to me otherwise.
Will send an updated patch in a moment. Thanks for reviewing.
>
> Best regards,
>
More information about the ffmpeg-devel
mailing list