[FFmpeg-devel] [PATCH 4/5] avcodec/h264: add avx 8-bit h264_idct_add

Thu Apr 6 19:06:45 EEST 2017

On 4/6/2017 12:34 PM, James Darnley wrote:
> On 2017-04-05 05:44, James Almer wrote:
>> On 4/4/2017 10:53 PM, James Darnley wrote:
>>> Haswell:
>>>  - 1.11x faster (522±0.4 vs. 469±1.8 decicycles) compared with mmxext
>>>
>>> Skylake-U:
>>>  - 1.21x faster (671±5.5 vs. 555±1.4 decicycles) compared with mmxext
>>
>> Again, you should add an SSE2 version first, then an AVX one if it's
>> measurably faster than the SSE2 one.
> 
> On a Yorkfield sse2 is barely faster: 1.02x faster (728±2.1 vs. 710±3.9
> decicycles).  So 1 or 2 cycles
> 
> On a Skylake-U sse2 is most of the speedup: 1.15x faster (661±2.2 vs
> 573±1.9).  Then avx gains a mere 3 cycles: 547±0.5
> 
> On a Haswell sse2 provides only half the speedup:
>  - sse2: 1.06x faster (525±2.5 vs 497±1.0 decicycles)
>  - avx:  1.06x faster (497±1.0 vs 468±1.2 decicycles)
> 
> (All on 64-bit Linux)
> 
> On Nehalem and 64-bit Windows sse2 is slower:  0.92x faster (597±3.0 vs.
> 650±9.3 decicycles)

Slower than what? MMX or AVX?

Your numbers are really confusing. Could you post the actual numbers for
each function instead of doing comparisons?

ff_h264_idct_add_8_mmx  = ??? cycles
ff_h264_idct_add_8_sse2 = ??? cycles
ff_h264_idct_add_8_avx  = ??? cycles

Does checkasm support these functions? Maybe you could just run that and
paste the results you get, which would be easier and faster.

> 
> And on that note I should probably recheck the deblock patches I pushed
> a little while ago.
> 
> So...  SSE2 for this function, yay or nay?

By default, always add SSE2 if it's measurably faster than MMX, especially
when you take advantage of the wider XMM regs, like you're doing here.
What you need to check is if adding AVX is worth it on top of SSE2 when
you're not taking advantage of the wider YMM regs, like it happens here.