[FFmpeg-devel] [PATCH] libavutil: add x86 optimized av_popcount

Ronald S. Bultje rsbultje at gmail.com
Wed Feb 25 18:58:18 CET 2015


Hi,

On Wed, Feb 25, 2015 at 12:52 PM, James Almer <jamrial at gmail.com> wrote:

> On 25/02/15 2:36 PM, Ronald S. Bultje wrote:
> > Hi James,
> >
> > On Wed, Feb 25, 2015 at 12:25 PM, James Almer <jamrial at gmail.com> wrote:
> >
> >> On 25/02/15 9:41 AM, Ronald S. Bultje wrote:
> >>> Hi,
> >>>
> >>> On Tue, Feb 24, 2015 at 8:05 PM, James Almer <jamrial at gmail.com>
> wrote:
> >>>>
> >>>> +#if HAVE_FAST_POPCNT
> >>>> +#if AV_GCC_VERSION_AT_LEAST(4,5)
> >>>> +#ifndef av_popcount
> >>>> +    #define av_popcount   __builtin_popcount
> >>>> +#endif /* av_popcount */
> >>>> +#if HAVE_FAST_64BIT
> >>>> +#ifndef av_popcount64
> >>>> +    #define av_popcount64 __builtin_popcountll
> >>>> +#endif /* av_popcount64 */
> >>>> +#endif /* HAVE_FAST_64BIT */
> >>>> +#endif /* AV_GCC_VERSION_AT_LEAST(4,5) */
> >>>> +#endif /* HAVE_FAST_POPCNT */
> >>>>
> >>>
> >>> Is this just to get the sse4 popcnt instruction if we compile with
> >>> -mcpu=sse4? The slightly odd thing is that we're using a built-in, yet
> >>> configure still does an arch/cpu check. I'd expect the
> built-in/compiler
> >> to
> >>> do that for us based on -mcpu, and we could always unconditionally use
> >> this
> >>> (as long as gcc >= 4.5); alternatively, you could use inline asm and
> then
> >>> have the configure check (HAVE_FAST_POPCNT). But doing both seems a
> >> little
> >>> odd. I have no objection to it, patch is still fine, just odd.
> >>>
> >>> Ronald
> >>
> >> I purposely made the checks for gcc 4.5 and in configure for cpus that
> >> support popcnt
> >> because otherwise __builtin_popcount (at least gcc's) is slower than our
> >> generic
> >> av_popcount_c function from lavu/common.h.
> >> When the CPU supports popcnt the builtin becomes a single inlined
> >> instruction.
> >>
> >> I tried the __asm__ approach, but the code generated by the builtin
> seemed
> >> better.
> >
> >
> > That's interesting, can you show the code you tried?
> >
> > Ronald
>
> If instead of builtins i use
>
> +#if HAVE_INLINE_ASM
> +
> +#ifdef __POPCNT__
> +
> +#define av_popcount av_popcount_abm
> +static av_always_inline av_const int av_popcount_abm(uint32_t a)
> +{
> +    int x;
> +
> +    __asm__ ("popcnt %1, %0" : "=r" (x) : "rm" (a));
> +    return x;
> +}
> +
> +#if ARCH_X86_64
> +#define av_popcount64 av_popcount64_abm
> +static av_always_inline av_const int av_popcount64_abm(uint64_t a)
> +{
> +    uint64_t x;
> +
> +    __asm__ ("popcnt %1, %0" : "=r" (x) : "rm" (a));
> +    return x;
> +}
> +#endif
> +
> +#endif /* __POPCNT__ */
> +
> +#endif /* HAVE_INLINE_ASM */
>
> in the x86/ header from the second version i sent, on x86_32 av_cpu_count
> from
> libavutil/cpu.o on Windows (Will not work with other OSes because of
> HAVE_GETPROCESSAFFINITYMASK) generates
>
>  1af:   f3 0f b8 5c 24 18       popcnt ebx,DWORD PTR [esp+0x18]
>  1b5:   31 c0                   xor    eax,eax
>  1b7:   f3 0f b8 c0             popcnt eax,eax
>  1bb:   01 c3                   add    ebx,eax
>
> Whereas with the builtins i instead get
>
>  1af:   f3 0f b8 5c 24 18       popcnt ebx,DWORD PTR [esp+0x18]
>
> This is probably because of av_popcount64_c (Used unconditionally on
> x86_32) calling
> av_popcount twice, and proc_aff being DWORD_PTR type means it's 4 bytes in
> x86_32.
> The builtin seems to know proc_aff can't be right shifted 32 bits so it
> generates
> a single popcnt.


It seems to me rather than the second version is defined as "pure" (or
whatever the keyword was), i.e. "we don't depend on external state" -> "if
you call this with a constant, we can calculate this at compile-time".

Ronald


More information about the ffmpeg-devel mailing list