[FFmpeg-devel] [PATCH] libavutil: add x86 optimized av_popcount

Reimar Döffinger Reimar.Doeffinger at gmx.de
Wed Feb 25 22:09:41 CET 2015


On 25.02.2015, at 19:29, "Ronald S. Bultje" <rsbultje at gmail.com> wrote:
> Hi,
> 
> On Wed, Feb 25, 2015 at 1:21 PM, James Almer <jamrial at gmail.com> wrote:
> 
>> On 25/02/15 2:58 PM, Ronald S. Bultje wrote:
>>> Hi,
>>> 
>>> On Wed, Feb 25, 2015 at 12:52 PM, James Almer <jamrial at gmail.com> wrote:
>>> 
>>>> On 25/02/15 2:36 PM, Ronald S. Bultje wrote:
>>>>> Hi James,
>>>>> 
>>>>> On Wed, Feb 25, 2015 at 12:25 PM, James Almer <jamrial at gmail.com>
>> wrote:
>>>>> 
>>>>>> On 25/02/15 9:41 AM, Ronald S. Bultje wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> On Tue, Feb 24, 2015 at 8:05 PM, James Almer <jamrial at gmail.com>
>>>> wrote:
>>>>>>>> 
>>>>>>>> +#if HAVE_FAST_POPCNT
>>>>>>>> +#if AV_GCC_VERSION_AT_LEAST(4,5)
>>>>>>>> +#ifndef av_popcount
>>>>>>>> +    #define av_popcount   __builtin_popcount
>>>>>>>> +#endif /* av_popcount */
>>>>>>>> +#if HAVE_FAST_64BIT
>>>>>>>> +#ifndef av_popcount64
>>>>>>>> +    #define av_popcount64 __builtin_popcountll
>>>>>>>> +#endif /* av_popcount64 */
>>>>>>>> +#endif /* HAVE_FAST_64BIT */
>>>>>>>> +#endif /* AV_GCC_VERSION_AT_LEAST(4,5) */
>>>>>>>> +#endif /* HAVE_FAST_POPCNT */
>>>>>>>> 
>>>>>>> 
>>>>>>> Is this just to get the sse4 popcnt instruction if we compile with
>>>>>>> -mcpu=sse4? The slightly odd thing is that we're using a built-in,
>> yet
>>>>>>> configure still does an arch/cpu check. I'd expect the
>>>> built-in/compiler
>>>>>> to
>>>>>>> do that for us based on -mcpu, and we could always unconditionally
>> use
>>>>>> this
>>>>>>> (as long as gcc >= 4.5); alternatively, you could use inline asm and
>>>> then
>>>>>>> have the configure check (HAVE_FAST_POPCNT). But doing both seems a
>>>>>> little
>>>>>>> odd. I have no objection to it, patch is still fine, just odd.
>>>>>>> 
>>>>>>> Ronald
>>>>>> 
>>>>>> I purposely made the checks for gcc 4.5 and in configure for cpus that
>>>>>> support popcnt
>>>>>> because otherwise __builtin_popcount (at least gcc's) is slower than
>> our
>>>>>> generic
>>>>>> av_popcount_c function from lavu/common.h.
>>>>>> When the CPU supports popcnt the builtin becomes a single inlined
>>>>>> instruction.
>>>>>> 
>>>>>> I tried the __asm__ approach, but the code generated by the builtin
>>>> seemed
>>>>>> better.
>>>>> 
>>>>> 
>>>>> That's interesting, can you show the code you tried?
>>>>> 
>>>>> Ronald
>>>> 
>>>> If instead of builtins i use
>>>> 
>>>> +#if HAVE_INLINE_ASM
>>>> +
>>>> +#ifdef __POPCNT__
>>>> +
>>>> +#define av_popcount av_popcount_abm
>>>> +static av_always_inline av_const int av_popcount_abm(uint32_t a)
>>>> +{
>>>> +    int x;
>>>> +
>>>> +    __asm__ ("popcnt %1, %0" : "=r" (x) : "rm" (a));
>>>> +    return x;
>>>> +}
>>>> +
>>>> +#if ARCH_X86_64
>>>> +#define av_popcount64 av_popcount64_abm
>>>> +static av_always_inline av_const int av_popcount64_abm(uint64_t a)
>>>> +{
>>>> +    uint64_t x;
>>>> +
>>>> +    __asm__ ("popcnt %1, %0" : "=r" (x) : "rm" (a));
>>>> +    return x;
>>>> +}
>>>> +#endif
>>>> +
>>>> +#endif /* __POPCNT__ */
>>>> +
>>>> +#endif /* HAVE_INLINE_ASM */
>>>> 
>>>> in the x86/ header from the second version i sent, on x86_32
>> av_cpu_count
>>>> from
>>>> libavutil/cpu.o on Windows (Will not work with other OSes because of
>>>> HAVE_GETPROCESSAFFINITYMASK) generates
>>>> 
>>>> 1af:   f3 0f b8 5c 24 18       popcnt ebx,DWORD PTR [esp+0x18]
>>>> 1b5:   31 c0                   xor    eax,eax
>>>> 1b7:   f3 0f b8 c0             popcnt eax,eax
>>>> 1bb:   01 c3                   add    ebx,eax
>>>> 
>>>> Whereas with the builtins i instead get
>>>> 
>>>> 1af:   f3 0f b8 5c 24 18       popcnt ebx,DWORD PTR [esp+0x18]
>>>> 
>>>> This is probably because of av_popcount64_c (Used unconditionally on
>>>> x86_32) calling
>>>> av_popcount twice, and proc_aff being DWORD_PTR type means it's 4 bytes
>> in
>>>> x86_32.
>>>> The builtin seems to know proc_aff can't be right shifted 32 bits so it
>>>> generates
>>>> a single popcnt.
>>> 
>>> 
>>> It seems to me rather than the second version is defined as "pure" (or
>>> whatever the keyword was), i.e. "we don't depend on external state" ->
>> "if
>>> you call this with a constant, we can calculate this at compile-time".
>>> 
>>> Ronald
>> 
>> Do you know how to get the asm version working right for this one case?
>> 
> 
> No, const should do it but looks like inline asm breaks that assumption.

There are gcc builtins to check if something is a constant and use e.g. the c version then.
Not sure if it works here or makes sense, just in principle.


More information about the ffmpeg-devel mailing list