[FFmpeg-devel] [PATCH 2/2] x86/dsputilenc: implement SSE2 versions of pix_{sum16, norm1}

Tue May 27 18:34:30 CEST 2014

On 27/05/14 5:09 AM, Christophe Gisquet wrote:
> Hi,
> 
> 2014-05-27 9:27 GMT+02:00 James Almer <jamrial at gmail.com>:
>> -    mov          r2, r1
>> -    neg          r2
>> -    shl          r2, 4
>> -    sub          r0, r2
> [...]
>> -    movd        eax, m6
>> -    and         eax, 0xffff
>> +    paddw        m4, m3
>> +%if mmsize == 8
>> +    add          r0, r1
>> +%else
>> +    lea          r0, [r0+r1*2]
>> +%endif
>> +    dec r2
>> +    jne .loop
> 
> Sorry for not doing my work of sufficiently studying these changes, but...
> Are you removing some corner cases handling? The top part is maybe a
> better loop control in your code, but the second part you remove looks
> like some overflow prevention.
> 

Ah, I did not intend to remove the "and eax, 0xffff" at all. Good thing you 
noticed, thanks.

> 
>>  INIT_MMX mmx
>> +PIX_SUM16 0, 16
>> +INIT_XMM sse2
>> +PIX_SUM16 6, 8
>> +
> [...]
>> +INIT_MMX mmx
>> +PIX_NORM1 0, 16
>> +INIT_XMM sse2
>> +PIX_NORM1 6, 8
> 
> PHADDW may map to an ssse3 instruction. I suspect it's probably not
> worth the 3% cycle count decrease, but do you have plans about it? (eg
> in a later patch)

The HADDW macro does not use phaddw if ssse3 is available.
And as you said that instruction is slower in most cases. To horizontally add 
the values in a single reg pshufd+paddd will always be faster.
IMO There's no point in adding an SSSE3 version that's going to be slower.

I however might add a XOP version, since its horizontal add instructions are 
single reg and fast.

> 
>> --- a/libavutil/x86/x86util.asm
>> +++ b/libavutil/x86/x86util.asm
>> @@ -288,7 +288,12 @@
>>      paddd   %1, %2
>>  %endif
>>  %if notcpuflag(xop) || sizeof%1 != 16
>> +%if cpuflag(mmxext)
>>      PSHUFLW %2, %1, q0032
>> +%else ; mmx
>> +    mova    %2, %1
>> +    psrlq   %2, 32
>> +%endif
>>      paddd   %1, %2
>>  %endif
>>  %undef %1
> 
> If I'm not mistaken, x264 folks don't care about mmx, so no need to
> wait for their opinion on this change. This part OK then.

Actually, they do use mmx. And good think you  mention this because now i see 
they have been using HADD* macros in mmx functions for a while now.
Guess nobody tried x264 encoding with pre-SSE2 CPUs to experience a crash.

> Not tested but looks good to me otherwise.

Will send an updated patch in a moment. Thanks for reviewing.

> 
> Best regards,
>