[FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

wm4 nfxjfg at googlemail.com
Tue Jun 16 14:32:40 CEST 2015


On Tue, 16 Jun 2015 14:16:11 +0200
Gwenole Beauchesne <gb.devel at gmail.com> wrote:

> Hi,
> 
> 2015-06-16 14:03 GMT+02:00 Michael Niedermayer <michaelni at gmx.at>:
> > On Tue, Jun 16, 2015 at 10:35:52AM +0200, Stefano Sabatini wrote:
> >> On date Tuesday 2015-06-16 10:20:31 +0200, wm4 encoded:
> >> > On Mon, 15 Jun 2015 17:55:35 +0200
> >> > Stefano Sabatini <stefasab at gmail.com> wrote:
> >> >
> >> > > On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded:
> >> > > [...]
> >> > > > From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001
> >> > > > From: Stefano Sabatini <stefasab at gmail.com>
> >> > > > Date: Mon, 15 Jun 2015 11:02:50 +0200
> >> > > > Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 optimizations
> >> > > >
> >> > > > Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
> >> > > > <fenrir at videolan.org>.
> >> > > >
> >> > > > TODO: bump minor, update APIchanges
> >> > > > ---
> >> > > >  libavutil/mem.c          |  9 +++++
> >> > > >  libavutil/mem.h          | 14 ++++++++
> >> > > >  libavutil/mem_internal.h | 26 +++++++++++++++
> >> > > >  libavutil/x86/Makefile   |  1 +
> >> > > >  libavutil/x86/mem.c      | 85 ++++++++++++++++++++++++++++++++++++++++++++++++
> >> > > >  5 files changed, 135 insertions(+)
> >> > > >  create mode 100644 libavutil/mem_internal.h
> >> > > >  create mode 100644 libavutil/x86/mem.c
> >> > > >
> >> > > > diff --git a/libavutil/mem.c b/libavutil/mem.c
> >> > > > index da291fb..0e1eb01 100644
> >> > > > --- a/libavutil/mem.c
> >> > > > +++ b/libavutil/mem.c
> >> > > > @@ -42,6 +42,7 @@
> >> > > >  #include "dynarray.h"
> >> > > >  #include "intreadwrite.h"
> >> > > >  #include "mem.h"
> >> > > > +#include "mem_internal.h"
> >> > > >
> >> > > >  #ifdef MALLOC_PREFIX
> >> > > >
> >> > > > @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size)
> >> > > >      ff_fast_malloc(ptr, size, min_size, 0);
> >> > > >  }
> >> > > >
> >> > > > +void av_memcpynt(void *dst, const void *src, size_t size, int cpu_flags)
> >> > > > +{
> >> > > > +#if ARCH_X86
> >> > > > +    ff_memcpynt_x86(dst, src, size, cpu_flags);
> >> > > > +#else
> >> > > > +    memcpy(dst, src, size, cpu_flags);
> >> > > > +#endif
> >> > > > +}
> >> > >
> >> > > Alternatively, what about something like:
> >> > >
> >> > > av_memcpynt_fn av_memcpynt_get_fn(void);
> >> > >
> >> > > modeled after av_pixelutils_get_sad_fn()? This would skip the need for
> >> > > a wrapper calling the right function.
> >> >
> >>
> >> > I don't see much value in this, unless determining the right function
> >> > causes too much overhead.
> >>
> >> I see two advantages, 1. no branch and function call when the function
> >> is called, 2. the cpu_flags must not be passed around, so it's somehow
> >> safer.
> >>
> >> I have no strong preference though, updated (untested patch) in
> >> attachment.
> >> --
> >> FFmpeg = Fierce and Forgiving Merciless Powered Extroverse Gargoyle
> >
> >>  mem.c          |    9 +++++
> >>  mem.h          |   13 +++++++
> >>  mem_internal.h |   26 +++++++++++++++
> >>  x86/Makefile   |    1
> >>  x86/mem.c      |   98 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  5 files changed, 147 insertions(+)
> >> f536b25834e0927b8cab5c996042aae697b8d773  0003-lavu-mem-add-av_memcpynt_get_fn.patch
> >> From c005ff5405dd48e6b0fed24ed94947f69bfe2783 Mon Sep 17 00:00:00 2001
> >> From: Stefano Sabatini <stefasab at gmail.com>
> >> Date: Mon, 15 Jun 2015 11:02:50 +0200
> >> Subject: [PATCH] lavu/mem: add av_memcpynt_get_fn()
> >>
> >> Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
> >> <fenrir at videolan.org>.
> >>
> >> TODO: remove use of inline assembly, bump minor, update APIchanges
> >> ---
> >>  libavutil/mem.c          |  9 +++++
> >>  libavutil/mem.h          | 13 +++++++
> >>  libavutil/mem_internal.h | 26 +++++++++++++
> >>  libavutil/x86/Makefile   |  1 +
> >>  libavutil/x86/mem.c      | 98 ++++++++++++++++++++++++++++++++++++++++++++++++
> >>  5 files changed, 147 insertions(+)
> >>  create mode 100644 libavutil/mem_internal.h
> >>  create mode 100644 libavutil/x86/mem.c
> >>
> >> diff --git a/libavutil/mem.c b/libavutil/mem.c
> >> index da291fb..325bfc9 100644
> >> --- a/libavutil/mem.c
> >> +++ b/libavutil/mem.c
> >> @@ -42,6 +42,7 @@
> >>  #include "dynarray.h"
> >>  #include "intreadwrite.h"
> >>  #include "mem.h"
> >> +#include "mem_internal.h"
> >>
> >>  #ifdef MALLOC_PREFIX
> >>
> >> @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size)
> >>      ff_fast_malloc(ptr, size, min_size, 0);
> >>  }
> >>
> >> +av_memcpynt_fn av_memcpynt_get_fn(void)
> >> +{
> >> +#if ARCH_X86
> >> +    return ff_memcpynt_get_fn_x86();
> >> +#else
> >> +    return memcpy;
> >> +#endif
> >> +}
> >> diff --git a/libavutil/mem.h b/libavutil/mem.h
> >> index 2a1e36d..d9f1b7a 100644
> >> --- a/libavutil/mem.h
> >> +++ b/libavutil/mem.h
> >> @@ -382,6 +382,19 @@ void *av_fast_realloc(void *ptr, unsigned int *size, size_t min_size);
> >>   */
> >>  void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size);
> >>
> >> +typedef void* (*av_memcpynt_fn)(void *dst, const void *src, size_t size);
> >> +
> >> +/**
> >> + * Return possibly optimized function to copy size bytes from from src
> >> + * to dst, using non-temporal copy.
> >> + *
> >> + * The returned function works as memcpy, but adopts non-temporal
> >> + * instructios when available. This can lead to better performances
> >> + * when transferring data from source to destination is expensive, for
> >> + * example when reading from GPU memory.
> >> + */
> >> +av_memcpynt_fn av_memcpynt_get_fn(void);
> >> +
> >>  /**
> >>   * @}
> >>   */
> >> diff --git a/libavutil/mem_internal.h b/libavutil/mem_internal.h
> >> new file mode 100644
> >> index 0000000..de61cba
> >> --- /dev/null
> >> +++ b/libavutil/mem_internal.h
> >> @@ -0,0 +1,26 @@
> >> +/*
> >> + * This file is part of FFmpeg.
> >> + *
> >> + * FFmpeg is free software; you can redistribute it and/or
> >> + * modify it under the terms of the GNU Lesser General Public
> >> + * License as published by the Free Software Foundation; either
> >> + * version 2.1 of the License, or (at your option) any later version.
> >> + *
> >> + * FFmpeg is distributed in the hope that it will be useful,
> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >> + * Lesser General Public License for more details.
> >> + *
> >> + * You should have received a copy of the GNU Lesser General Public
> >> + * License along with FFmpeg; if not, write to the Free Software
> >> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> >> + */
> >> +
> >> +#ifndef AVUTIL_MEM_INTERNAL_H
> >> +#define AVUTIL_MEM_INTERNAL_H
> >> +
> >> +#include "mem.h"
> >> +
> >> +av_memcpynt_fn ff_memcpynt_get_fn_x86(void);
> >> +
> >> +#endif /* AVUTIL_MEM_INTERNAL_H */
> >> diff --git a/libavutil/x86/Makefile b/libavutil/x86/Makefile
> >> index a719c00..171c351 100644
> >> --- a/libavutil/x86/Makefile
> >> +++ b/libavutil/x86/Makefile
> >> @@ -2,6 +2,7 @@ OBJS += x86/cpu.o                                                       \
> >>          x86/float_dsp_init.o                                            \
> >>          x86/imgutils.o                                                  \
> >>          x86/lls_init.o                                                  \
> >> +        x86/mem.o                                                       \
> >>
> >>  OBJS-$(CONFIG_PIXELUTILS) += x86/pixelutils_init.o                      \
> >>
> >> diff --git a/libavutil/x86/mem.c b/libavutil/x86/mem.c
> >> new file mode 100644
> >> index 0000000..6326c90
> >> --- /dev/null
> >> +++ b/libavutil/x86/mem.c
> >> @@ -0,0 +1,98 @@
> >> +/*
> >> + * This file is part of FFmpeg.
> >> + *
> >> + * FFmpeg is free software; you can redistribute it and/or
> >> + * modify it under the terms of the GNU Lesser General Public
> >> + * License as published by the Free Software Foundation; either
> >> + * version 2.1 of the License, or (at your option) any later version.
> >> + *
> >> + * FFmpeg is distributed in the hope that it will be useful,
> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >> + * Lesser General Public License for more details.
> >> + *
> >> + * You should have received a copy of the GNU Lesser General Public
> >> + * License along with FFmpeg; if not, write to the Free Software
> >> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> >> + */
> >> +
> >> +#include <inttypes.h>
> >> +#include "config.h"
> >> +#include "libavutil/avassert.h"
> >> +#include "libavutil/mem_internal.h"
> >> +
> >> +#if HAVE_SSE2
> >> +/* Copy 16/64 bytes from srcp to dstp loading data with the SSE>=2 instruction
> >> + * load and storing data with the SSE>=2 instruction store.
> >> + */
> >> +#define COPY16(dstp, srcp, load, store) \
> >> +    __asm__ volatile (                  \
> >> +        load "  0(%[src]), %%xmm1\n"    \
> >> +        store " %%xmm1,    0(%[dst])\n" \
> >> +        : : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1")
> >> +
> >> +#define COPY64(dstp, srcp, load, store) \
> >> +    __asm__ volatile (                  \
> >> +        load "  0(%[src]), %%xmm1\n"    \
> >> +        load " 16(%[src]), %%xmm2\n"    \
> >> +        load " 32(%[src]), %%xmm3\n"    \
> >> +        load " 48(%[src]), %%xmm4\n"    \
> >> +        store " %%xmm1,    0(%[dst])\n" \
> >> +        store " %%xmm2,   16(%[dst])\n" \
> >> +        store " %%xmm3,   32(%[dst])\n" \
> >> +        store " %%xmm4,   48(%[dst])\n" \
> >> +        : : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1", "xmm2", "xmm3", "xmm4")
> >> +#endif
> >> +
> >> +#define COPY_LINE(dstp, srcp, size, load)                               \
> >> +    const unsigned unaligned = (-(uintptr_t)srcp) & 0x0f;               \
> >> +    unsigned x = unaligned;                                             \
> >> +                                                                        \
> >> +    av_assert0(((intptr_t)dstp & 0x0f) == 0);                           \
> >> +                                                                        \
> >> +    __asm__ volatile ("mfence");                                        \
> >> +    if (!unaligned) {                                                   \
> >> +        for (; x+63 < size; x += 64)                                    \
> >> +            COPY64(&dstp[x], &srcp[x], load, "movdqa");                 \
> >> +    } else {                                                            \
> >> +        COPY16(dst, src, "movdqu", "movdqa");                           \
> >> +        for (; x+63 < size; x += 64)                                    \
> >> +            COPY64(&dstp[x], &srcp[x], load, "movdqu");                 \
> >
> > to use SSE registers in inline asm operands or clobber list you need
> > to build with -msse (which probably is default on on x86-64)
> >
> > files build with -msse will result in undefined behavior if anything
> > in them is executed on a pre SSE cpu, as these allow gcc to put
> > SSE instructions directly in the code where it likes
> >
> > The way out of this "design" is not to tell gcc that it passes
> > a string with SSE code to the assembler
> > that is not to use SSE registers in operands and not to put them
> > on the clobber list unless gcc actually is in SSE mode and can use and
> > need them there.
> > see XMM_CLOBBERS*
> 
> Well, from past experience, lying to gcc is generally not a good thing
> either. There are multiple interesting ways it could fail from time to
> time. :)
> 
> Other approaches:
> - With GCC >= 4.4, you can use __attribute__((target(T))) where T =
> "ssse3", "sse4.1", etc. This is the easiest way ;
> - Split into several separate files per target. Though, one would then
> argue that while we are at it why not just start moving to yasm.
> 
> The former approach looks more appealing to me, considering there may
> be an effort to migrate to yasm afterwards.

Or how about:
- use yasm, which is used for all other (new) x86 SIMD usage exactly
  for avoiding such issues


More information about the ffmpeg-devel mailing list