[FFmpeg-devel] [PATCH] Add x86-optimized versions of exponent_min().

Fri Feb 4 01:13:39 CET 2011

On 02/03/2011 06:47 PM, Loren Merritt wrote:

> On Thu, 3 Feb 2011, Justin Ruggles wrote:
>> On 02/03/2011 12:05 AM, Loren Merritt wrote:
>>> On Wed, 2 Feb 2011, Justin Ruggles wrote:
>>>>
>>>> Thanks for the suggestion.  Below is a chart of the results for
>>>> adding ALIGN 8 and ALIGN 16 before each of the 2 loops.
>>>>
>>>> LOOP1/LOOP2   MMX   MMX2   SSE2
>>>> -------------------------------
>>>> NONE/NONE :  5270   5283   2757
>>>>    NONE/8 :  5200   5077   2644
>>>>   NONE/16 :  5723   3961   2161
>>>>    8/NONE :  5214   5339   2787
>>>>       8/8 :  5198*  5083   2722
>>>>      8/16 :  5936   3902   2128
>>>>   16/NONE :  6613   4788   2580
>>>>      16/8 :  5490   3702   2020
>>>>     16/16 :  5474   3680*  2000*
>>>
>>> Other things that affect instruction size/count and therefore alignment
>>> include:
>>> * compiling for x86_32 vs x86_64-unix vs win64
>>> * register size (d vs q as per my previous patch)
>>> * whether PIC is enabled (not relevant this time because this function
>>> doesn't use any static consts)
>>
>> Doesn't yasm take these into account when using ALIGN?
> 
> ALIGN computes the number of NOPs to add, into order to result in some 
> address aligned by the requested amount. But that isn't necessarily 
> solving the right problem. If align16 is in some cases slower than align8, 
> then clearly it isn't just a case of being slow when it doesn't have 
> "enough" alignment.

Indeed. I thought that was strange.

> One possible cause of such effects is that which instructions are packed 
> into a 16byte aligned window affects the number of instructions that can 
> be decoded at once. This applies to every instruction everywhere (if 
> decoding is the bottleneck), not just at branch targets. Adding alignment 
> at one place can bump some later instruction across a decoding window, and 
> whether it does so depends on all of the size factors I mentioned.

Ok, that makes sense.

>>> * and sometimes not only the mod16 or mod64 alignment matters, but also
>>> the difference in memory address between this function and the rest of the
>>> library.
>>>
>>> While this isn't as bad as gcc's random code generator, don't assume 
>>> that the optimum you found in one configuration will be non-pessimal in 
>>> the others.
>>> If there is a single optimal place to add a single optimal number of NOPs, 
>>> great. But often when I run into alignment weirdness, there is no such 
>>> solution, and the best I can do is poke it with a stick until I find some 
>>> combination of instructions that isn't so sensitive to alignment.
>>
>> I don't have much to poke around with as far as using different 
>> instructions in this case.
> 
> One stick to poke with is unrolling.
> 
>> So should we just accept what is an obvious bad case on one 
>> configuration because there is a chance that fixing it is worse 
>> in another?
> 
> My expectation of the effect of this fix on the performance of the 
> configurations you haven't benchmarked, is positive. If you don't want to 
> benchmark them, I won't reject this patch on those grounds.
> 
> I am merely saying that as long as you haven't identified the actual 
> cause of the slowdowns, as long as performance is still random unto you, 
> making decisions based on a thorough benchmark of only one compiler 
> configuration is generalizing from one data point.
> 
>> Even the worst case versions are 80-90% faster than the C version in the 
>> tested configuration (x86_64 unix). Is it likely that the worst case 
>> will be much slower in another?
> 
> Not more than 40% slower. (Some confidence since on this question your 
> benchmark counts as 24 data points, not 1.)

I can recompile with "--extra-cflags=-m32 --extra-ldflags=-m32" and add
24 more data points if you think this would be useful.

-Justin