[FFmpeg-trac] #7943(undetermined:reopened): Ffmpeg QSV backend uses >2x more GPU memory compared to VAAPI or MSDK

FFmpeg trac at avcodec.org
Tue Oct 29 04:03:32 EET 2019


#7943: Ffmpeg QSV backend uses >2x more GPU memory compared to VAAPI or MSDK
-------------------------------------+-------------------------------------
             Reporter:  eero-t       |                    Owner:
                 Type:  enhancement  |                   Status:  reopened
             Priority:  normal       |                Component:
                                     |  undetermined
              Version:  git-master   |               Resolution:
             Keywords:               |               Blocked By:
             Blocking:               |  Reproduced by developer:  0
Analyzed by developer:  0            |
-------------------------------------+-------------------------------------
Changes (by fulinjie):

 * type:  defect => enhancement


Comment:

 Based on the discussion with Eero:

 QSV uses a pre-allocation pool for memory allocation.
 One of the reasons why it needs a relatively large memory usage compared
 with VA-API is for look_ahead.

 Pre-allocation memory allows look_ahead to Analyze LookAheadDepth frames
 to find per-frame costs using a sliding window of DependencyDepth frames.


 https://github.com/Intel-Media-SDK/MediaSDK/blob/master/doc/mediasdk-
 man.md#figure-6-lookahead-brc-qp-calculation-algorithm

 If pre-allocation memory is set to similar value in VA-API, there'll be
 memory allocation error when look_ahead is enabled with a large
 look_ahead_depth.

 It could also be verified with sample_multi_transcode that MSDK does use
 more memory with LA.
 (However, it scales the memory usage according to the option)

 This could be improved in FFmpeg QSV backend to scale the memory
 allocation
 according to some options. And we need to find a proper method to handle
 this and
 make sure it won't introduce other concerns/regressions.
 Let's keep this issue open and maybe change into enhancement.


 Details:
 ----

 > >>>>1. Why QSV backend can't be made to do whatever MSDK sample
 transcode application does?
 > >>>
 > >>> Actually, QSV backend is able to match the behavior of sample
 transcode in MSDK.
 > >>>
 > >>> It *works well* if you set initial_pool_size to *the similar value*
 with VAAPI
 > >>>
 > >>> frames_ctx->initial_pool_size = 22 + s->extra_hw_frames;
 > >>>
 > >>> And similar GPU memory occupation could be observed.
 > >>>
 > >>> If you set the value too small, error may happen.
 > >>


 > >>2. If similar value to VA-API works well, why not use that as a fix?
 > >
 > > QSV and VA-API are not the same codec and they have different
 features, for example, look_ahead.  (for h264_qsv specific currently).
 >
 > Good point.
 >
 >
 > > One of the important reasons for the pre-allocation memory could be
 allow look_ahead to Analyze LookAheadDepth frames to find per-frame costs
 using a sliding window of DependencyDepth frames.
 > >

 > > Details:
 > > https://github.com/Intel-Media-SDK/MediaSDK/blob/master/doc/mediasdk-
 man.md#figure-6-lookahead-brc-qp-calculation-algorithm
 > >
 > > If pre-allocation memory is set to similar value in VA-API, there'll
 be memory allocation error when look_ahead is enabled.
 > >
 > > CMDLINE for testing:
 > > ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v hevc_qsv -i
 Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265 -vf
 scale_qsv=format=nv12 -c:v h264_qsv -b:v 20M -look_ahead 1
 -look_ahead_depth 40 -async_depth 4 output.h264
 >
 > Gives:

 {{{
 > ----------------
 > Error while filtering: Cannot allocate memory
 > Failed to inject frame into filter network: Cannot allocate memory
 > ----------------
 }}}

 >
 > With LA depth 20 still works, 30 or more doesn't.
 >
 >
 >
 > > Also,  would you please help to verify whether sample_multi_transcode
 uses more GEM memory when look_ahead is enabled?
 >
 > Same with MSDK:
 > sample_multi_transcode -i::h265
 > Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265 -ec::nv12
 > -o::h264 output.h264 -b 20000 -async 4 -hw
 >
 > MSDK does use more memory with LA:
 > * No LA:         0.67 GB
 > * LA depth 10:   0.89 GB  (options: -la -lad 10)
 > * LA depth 100:  2.12 GB
 >
 > But FFmpeg QSV uses much more:
 > * LA depth 10:   2.54 GB (!)
 >
 >
 > >>>> This workaround option seems to require very latest FFmpeg git
 > version,
 > >> but it doesn't help, neither with the original transcode command, nor
 > when
 > >> doing just decode. Memory usage is the same, and FFmpeg outputs
 > >>> this warning:
 > >>>
 > >>> GPU copy could only be used when copy data from video memory to
 > >> system
 > >>> memory, thus it won’t help in transcode pipeline.
 > >>>
 > >>> I tried to explain the memory usage in MSDK with GpuCopy as an
 > example.
 > >>
 > >> In case I was unclear, it fails also with decoding:
 > >>

 {{{
  ----------------------------------------------------------------
 > >> fmpeg -an -y -hwaccel qsv -qsv_device /dev/dri/renderD128 -gpu_copy
 > on
 > >> -c:v hevc_qsv -i
 > >> Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265
 > >> -vf hwdownload,format=p010 -f rawvideo /dev/null
 > >> ...
 > >> Input #0, hevc, from
 > >> 'input/Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265':
 > >>     Duration: N/A, bitrate: N/A
 > >>       Stream #0:0: Video: hevc (Main 10), yuv420p10le(tv), 4096x2160,
 60
 > >> fps, 60 tbr, 1200k tbn, 60 tbc
 > >> Stream mapping:
 > >>     Stream #0:0 -> #0:0 (hevc (hevc_qsv) -> rawvideo (native))
 > >> Press [q] to stop, [?] for help
 > >> [hevc_qsv @ 0x55f94862d840] GPU-accelerated memory copy only works
 > in
 > >> MFX_IOPATTERN_OUT_SYSTEM_MEMORY.
 > >>       Last message repeated 1 times
 > >> Output #0, rawvideo, to '/dev/null':
 > >> ----------------------------------------------------------------
 }}}

 > >
 > > There is something wrong with the test cmdline:
 > >
 > > "-hwaccel qsv" should be removed,
 >
 > Thanks, I hadn't noticed.  Without that it works!
 >
 >
 > > and "-vf hwdownload,format=p010 " is not needed.
 > > > Try the provided cmdline in:
 > > https://trac.ffmpeg.org/ticket/7943#comment:1
 > >
 > > or use the following cmd:
 > >
 > > ffmpeg -an -y -qsv_device /dev/dri/renderD128 -gpu_copy on
 > > -c:v hevc_qsv -I
 > Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265
 > > -f rawvideo /dev/null
 >
 > => 0.47GB
 >
 >
 > >>>> Why VA-API doesn't fail to those errors when using same sized
 initial
 > pool
 > >> size?
 > >>>
 > >>> Both VAAPI and QSV works well.
 > >>>
 > >>> 4.IMHO, the pre-allocation doesn’t mean QSV really uses all the
 memory
 > >>> allocated in the pool.
 > >>
 > >> If it shows up as GEM allocation in sysfs, I think it's really used
 in a
 > >> sense that it's away from everybody else in the system, but I don't
 know
 > >> for sure.
 > >>
 > >> (With normal memory, non-dirtied allocations are all mapped to same
 > >> zero-page, but GEM memory doesn't show up in SMAPS statistics, and I
 > >> don't know whether QSV dirties all of its allocations.)
 > >>
 > >> Btw. Because DRI/GEM allocations aren't visible to rest of the
 system,
 > >> they can't be controlled [1] and they can easily cause OOM-kills to
 be
 > >> done to innocent (other) processes.  I've seen this happen with 3D in
 X
 > >> desktop, and with Media services in Kubernetes environment (control
 > >> plane gets killed and node needs to be rebooted).
 > >
 > > Yes, GEM memory is allocated in the pre-allocation pool. So from
 > > this perspective, all allocated GEM memory is used by FFmpeg.
 > >
 > > And from another perspective, FFmpeg QSV get memory from the pre-
 > allocation
 > > Pool to "actually" use these memory.  So the exact memory used in QSV
 > may be
 > > smaller than GEM memory, but could not be observed through " GEM
 > object usage ".
 >
 > I assume that's what GPU cgroups support is going to be looking at when
 > it finally gets implemented (Cgroups is mandatory to have any reasonable
 > GPU sharing on container loads).  If it does, it would matter, a lot.
 >
 >
 > > Summary:
 > > 1.Larger memory pre-allocation for QSV is reasonable for some
 > features(look_ahead for example).
 > > 2.Gpucopy could work if cmdline is refined.
 >
 > It works, but IMHO isn't really usable, normal users aren't really going
 > to find out all these weird command line option combinations needed to
 > get perf & reasonable memory usage out of QSV (-gpu_copy option doesn't
 > even seem to be documented).

 GpuCopy mainly helps on improve the performance,(on APL, 4K nv12 decode,
 performance could be improved from 3.4 fps to 50 + fps), memory usage is
 kind of
 a side-benefit.

 And thanks for remind,  a doc for this optional is reasonable, I'll think
 about
 send a patch to add this.

 >
 > I think these kind of optimal encoding settings should be handled by the
 > FFmpeg backend automatically (at least for cases with >= 2x
 > differences).  It should be enough for user to tell it what to do, not
 how.
 >
 >
 > > 3. GEM memory is all allocated and used by FFmpeg, but QSV may only
 use
 > part of these memories
 > > allocated in the pre-allocation pool.
 > >
 > > I'm not sure whether there are other potential concerns if we
 initialize a
 > smaller pre-allocation pool.
 > > One possible solution is to leave it as an enhancement to enable large
 > allocation for specific feature only.
 >
 > MSDK sample application seems to scale its memory usage according to
 > what is actually needed.  Couldn't FFmpeg QSV backend do the same?

 Agree, this could be improved in FFmpeg QSV backend to scale the memory
 allocation
 according to some options. And we need to find a proper method to handle
 this and
 make sure it won't introduce other concerns/regressions.
 Let's keep this issue open and maybe change into enhancement.

--
Ticket URL: <https://trac.ffmpeg.org/ticket/7943#comment:6>
FFmpeg <https://ffmpeg.org>
FFmpeg issue tracker


More information about the FFmpeg-trac mailing list