[FFmpeg-devel] [PATCH] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

Mark Thompson sw at jkqxz.net
Wed Mar 27 00:44:12 EET 2019


On 26/03/2019 22:07, Shaofei Wang wrote:
> It enabled MULTIPLE SIMPLE filter graph concurrency, which bring above about
> 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration
> 
> ...
> ---
> The patch will only effect on multiple SIMPLE filter graphs pipeline,
> Passed fate and refine the possible data race,
> AFL tested, without introducing extra crashs/hangs
> 
>  fftools/ffmpeg.c | 172 +++++++++++++++++++++++++++++++++++++++++++++++++------
>  fftools/ffmpeg.h |  13 +++++
>  2 files changed, 169 insertions(+), 16 deletions(-)

On further thought, I don't think this patch should continue in its present form.

The fundamental problem is that the bolted-on parallelism renders it pretty much unreviewable - the codebase here was not designed with any parallelism in mind, so it does not follow the usual rules expected of such code.  In particular, there are global variables in many places accessed without regard for data races, and that becomes fatal undefined behaviour once parallelism is added.  While in theory every one of those can be fixed (by adding more locks, or one big lock covering many instances together), it will be very hard to verify that all cases have been covered and the code will suffer significantly in readability/maintainability for the change.

To give an explicit example of the sort of problems you are hitting, the -benchmark_all option is entirely broken in the current proposed version.  It invokes undefined behaviour by writing to the current_time global in every thread.  Even if that were avoided by moving the value to a thread-local structure, it also gives wrong results - getrusage(RUSAGE_SELF) returns values for the whole process, so it won't say anything useful once there are multiple threads looking at the value in parallel.  Perhaps that is fixable by using some sort of per-thread storage and RUSAGE_THREAD (and some equivalent on Windows), but whatever that is certainly requires more investigation and probably a separate patch.

Three possible thoughts on where to go from here:

* Start by refactoring the code to make it easier to verify.  This would entail something like removing all global variable accesses from parallel paths, and then ensuring that each thread only ever writes to its own local working set (e.g. OutputStream structure) while marking all other structures as const.  After such refactoring has happened, a new version of the present patch would then be possible to assess.

* Run exhaustively in tsan/valgrind/other race detectors and fix every problem it finds, then provide evidence from that to help with review.  Since these issues can be hidden behind specific option combinations (as the benchmark one is), this would need a lot of care to ensure full coverage, and some parts (like the benchmark one) would need rewriting anyway.  As noted above I'm not sure this would be very good for the code, but we could at least have some amount of confidence that the result was correct.

* Reconsider whether the patch is worth pursuing.  While I agree that a correct implementation of this would help in the specific use-cases you describe, I'm not sure that the set of users who gain from it is actually particularly large - most people wanting to maximise throughput in the way this change can help with already run multiple FFmpeg instances (or their own programs using the libraries) because the serial limitations of FFmpeg are well-known.


- Mark


More information about the ffmpeg-devel mailing list