[FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

wm4 nfxjfg at googlemail.com
Thu May 14 15:32:03 CEST 2015


On Thu, 14 May 2015 14:52:29 +0200
Stefano Sabatini <stefasab at gmail.com> wrote:

> On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded:
> > On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
> [...]
> > > One limitation is as the manual said, it needs to be copied from the
> > > GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
> > > copy function for this, it uses plain old memcpy.
> > > Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
> > > is optimized for copying from USWC memory (Uncacheable Speculative
> > > Write Combining) to system memory. Using this may help speed up the
> > > process significantly, and VLC probably uses it.
> > 
> > Now the question is, how would be possible to optimize GPU to CPU copy
> > to get an overall performance gain? At least VLC seems able to get
> > better performances when using HW decoding, but I'm not sure it is
> > copying decoded data back to the CPU (indeed it may perform direct
> > rendering).
> 
> Self-reply:
> commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7
> Author: Laurent Aimar <fenrir at videolan.org>
> Date:   Tue Nov 17 01:09:43 2009 +0100
> 
>     Improved performance when copying video surface in dxva2.
> 
> That is, VLC is using optimized GPU->CPU copy when the relevant SSE2
> instructions are available.

Here's what lavfilters appears to use:

http://git.1f0.de/gitweb?p=lavfsplitter.git;a=blob;f=common/DSUtilLite/gpu_memcpy_sse4.h


More information about the ffmpeg-devel mailing list