[FFmpeg-user] FFmpeg single threaded bottleneck

Gabriel Balaich roderrooder at gmail.com
Thu May 14 12:32:30 EEST 2020

Thanks for the feedback.

On Thu, 14 May 2020 at 01:52, Edward Park <kumowoon1025 at gmail.com> wrote:

> Hi,
> Some values don't look right, try getting rid of them.
> -thread_queue_size 9999 seems arbitrary,

it is queue length, not bytes

I was getting the error "Thread message queue blocking; consider raising
the thread_queue_size option" when I left -thread_queue_size at default,
the reason I set it at 9999 is that that is the max it will let me set it
before the command just errors out. When I remove "-thread_queue_size 9999"
the errors come back and I drop a massive amount of frames even when doing
a single 4k60 input / output.

-indexmem 9999 seems arbitrary, pretty sure default value is bigger
-indexmem is one of the magical options I never really understood, I added
it at some point (over 2 years ago at least) hoping it would solve this
issue. I can't seem to find any information on what the default is, and
when I remove it from the command it doesn't change the results. That being
said any single option's relevancy in regards to my commands, at least as
far as I can tell, is pretty low considering that everything works just
fine with every option I have when I'm running each input(s) / output in
its own instance of FFmpeg yet simultaneously.

> -rtbufsize 2147.48M is kind of abusive, especially for the audio inputs
> I don't think you should be trying to buffer more, if the buffer keeps
> growing then it won't last.
I couldn't try more buffer if I wanted to, 2147.48M (max INT) is the
maximum buffer size allowed. But even then it only overfills if the
hardware can't keep up, which is only shown to be the case when transcoding
over 9K60 worth of video in a single FFmpeg instance.

> I can't really tell what the dshow input mapping looks like, but I think
> this is about the limit of your system.
> With a 6800K, assuming the GPU is full sized,  are there enough lanes left
> for 3 additional capture cards?
As seen in the screenshots my 6800k is only being overly taxed if I'm
running all the inputs / outputs in one command / one instance of FFmpeg, *and
only on one thread with plenty of headroom left on all other threads* (see
task manager screenshots). When I separate them into multiple commands
running in different processes, but still at the same time with all the
same options, the 6800k isn't even at 35% total usage with plenty of
headroom per thread. So it seems pretty clear to me that the 6800k is not
the bottleneck, even so, I'm replacing it with a Threadripper (1950x, 2.5
times as powerful as my 6800k) as described in the original message so I
can have headroom to run FFmpeg and OBS at the same time.

> Using the hardware encoder for so many streams at once might also have to
> do with it, you could try saving
> the raw input to fast enough scratch disk to check for that quickly.

I'm using a GTX 1080 which has dual NVENC processing chips (see NVIDIA
encode matrix:
https://developer.nvidia.com/video-encode-decode-gpu-support-matrix), as
can be seen in my screenshots the encoder is only at 40% usage, and while
Nvidia typically only allows you to do 2-3 encodes at once it's a
pseudo-limitation enforced by software which can be bypassed with a patch:

Just to further show that the hardware is not yet an issue in itself, I can
run 4 separate 4k60 transcodes simultaneously in real-time using just the
6800k and the GTX 1080 with 30% headroom left on the CPU, 20% headroom on
the GPUs encoding chips, multiple gigabytes of VRAM still available on the
GPU, over 12gb of available system memory, and below 30% SSD usage. The one
caveat being that each input / output has to be running in *separate
instances of FFmpeg*, as soon as I try to transcode more than 9K60 in a *single
FFmpeg command / instance* a single thread on my 6800K will reach 100%,
despite the rest of the chip having 70% headroom, and then the command gets
behind filling the buffer until there is no memory left.

Just to make it clear, from my extensive testing the issue only presents
itself when running massive commands with 3x or more 4K60 transcodes in *one
instance* of *FFmpeg*, *when I run them separately but still at the same
time I have zero issues*... Other than the fact that I have to run them in
separate instances which is what I'm trying to avoid due to synchronization
issues, among others. What I'm really trying to determine here is what part
of a single FFmpeg instance is being limited to 1 thread when transcoding
3+ 4k60 streams.

More information about the ffmpeg-user mailing list