[FFmpeg-devel] [PATCH] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

Wang, Shaofei shaofei.wang at intel.com
Wed Mar 27 11:51:46 EET 2019


> -----Original Message-----
> From: ffmpeg-devel [mailto:ffmpeg-devel-bounces at ffmpeg.org] On Behalf Of
> Mark Thompson
> Sent: Wednesday, March 27, 2019 6:44 AM
> To: ffmpeg-devel at ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH] Improved the performance of 1 decode +
> N filter graphs and adaptive bitrate.
>
It will always without an ending to work out a perfect solution, that' also the
current situation of the parallelism issue in ffmpeg.
No matter how users would like to resolved the "well-known" serial
limitation by them self in their app, but why not do some positive thing in
the big amount user used ffmpeg tool? Although it's codebase was not
designed with parallelism mind, it can also be paralleled at some key part
of pipeline as the patch shows with lightly modification. Otherwise, a brand
new parallel designed ffmpeg can be delivered by extra big effort, and
concerns, behaviour problems will no less than the current one.
Some thoughts:
- Provide the option -abr_pipeline to enable this multiple simple filter graph
parallel optimization instead of by default, which means users who enable
it knows what they are doing and performance pursuing. The feature/
optimization shouldn't cover all the situations in ffmpeg which is a cool box
provides huge amount of different features. And some cases need parallel,
some not.
- To fix those already known parallel problems such as -benchmark_all.
Those consideration are good to improve and refine the feature/solution,
, even some hidden/not easy triggered issue found in future, they can be
fixed by dedicated patch.

Thanks

> On further thought, I don't think this patch should continue in its present
> form.
> 
> The fundamental problem is that the bolted-on parallelism renders it pretty
> much unreviewable - the codebase here was not designed with any
> parallelism in mind, so it does not follow the usual rules expected of such
> code.  In particular, there are global variables in many places accessed
> without regard for data races, and that becomes fatal undefined behaviour
> once parallelism is added.  While in theory every one of those can be fixed
> (by adding more locks, or one big lock covering many instances together), it
> will be very hard to verify that all cases have been covered and the code will
> suffer significantly in readability/maintainability for the change.
> 
> To give an explicit example of the sort of problems you are hitting, the
> -benchmark_all option is entirely broken in the current proposed version.  It
> invokes undefined behaviour by writing to the current_time global in every
> thread.  Even if that were avoided by moving the value to a thread-local
> structure, it also gives wrong results - getrusage(RUSAGE_SELF) returns values
> for the whole process, so it won't say anything useful once there are multiple
> threads looking at the value in parallel.  Perhaps that is fixable by using some
> sort of per-thread storage and RUSAGE_THREAD (and some equivalent on
> Windows), but whatever that is certainly requires more investigation and
> probably a separate patch.
> 
> Three possible thoughts on where to go from here:
> 
> * Start by refactoring the code to make it easier to verify.  This would entail
> something like removing all global variable accesses from parallel paths, and
> then ensuring that each thread only ever writes to its own local working set
> (e.g. OutputStream structure) while marking all other structures as const.
> After such refactoring has happened, a new version of the present patch
> would then be possible to assess.
> 
> * Run exhaustively in tsan/valgrind/other race detectors and fix every
> problem it finds, then provide evidence from that to help with review.  Since
> these issues can be hidden behind specific option combinations (as the
> benchmark one is), this would need a lot of care to ensure full coverage, and
> some parts (like the benchmark one) would need rewriting anyway.  As
> noted above I'm not sure this would be very good for the code, but we could
> at least have some amount of confidence that the result was correct.
For the existing code, already got lots of issues detected by valgrind/tsan.
We can provide evidence to prove the problem have been covered.
> 
> * Reconsider whether the patch is worth pursuing.  While I agree that a
> correct implementation of this would help in the specific use-cases you
> describe, I'm not sure that the set of users who gain from it is actually
> particularly large - most people wanting to maximise throughput in the way
> this change can help with already run multiple FFmpeg instances (or their own
> programs using the libraries) because the serial limitations of FFmpeg are
> well-known.
To run multiple FFmpeg instances couldn't stand for 1:N cases. And of course
users can use their own program to call the libraries, but we also see a lot of
users who use ffmpeg tool to build up their solution, demo, App and etc.



More information about the ffmpeg-devel mailing list