[FFmpeg-devel] Parallelized h264 proof-of-concept
Wed Jun 6 12:03:58 CEST 2007
Here's an version rewritten from scratch.
Michael Niedermayer wrote:
> On Fri, May 18, 2007 at 11:00:57PM +0200, Andreas ?man wrote:
>> The issues left to fix are:
>> o The error resilience data structures are not protected (but
>> still shared). This usually manifests itself into:
>> [h264 @ 0xb7c64208]concealing 0 DC, 0 AC, 0 MV errors
>> because the s->error_count decrement races between
>> cpus. This is pretty easy to fix if the avcodec thread
>> implementations would expose a locking primitive.
> you dont need any locking, just n seperate error_counts, one for each
> thread, and then sum them at the end
>> o deblocking doesn't work correctly. When deblocking is enabled
>> the md5 sum output from my test program changes for every run.
>> I quite sure this is caused by the fact that deblocking is done
>> over the entire frame, not locally per slice, and thus, if
>> slices complete out-of-order, there will be errors.
>> I don't see any visual artifacts, but something is fishy for
>> sure. I'll need to nail the exact reason before i can be
>> more specific about problems / solutions here.
> this is serious, md5 must match ...
Notice that this patch does not enable multi-threading if
deblocking type == 1.
I'm gonna look into if it's worth postponing type 1 deblocking
to after the frame is decoded when running with multi-threading.
Also, take a look if it is possible to parallelize deblocking
itself (by doing it in diagonal strokes or somthing.. i donno yet)
>> o The SVQ3 decoding has not yet been adapted. (one need to configure
>> with --disable-decoder=svq3 to compile at all now)
> that too is serious, nothing may break ... though theres no need to make
> SVQ3 multithreaded too ...
>> Okay, a few words about the changes.
>> A new structure H264Thread (name suggestions very welcome) is
>> passed around to almost all functions. This structure is
>> local for every slice (perhaps H264Slice would be a better
>> name) and contains all members from H264Context that
>> changed during slice decode. I also moved a few things
>> (most notably mb_[xy]) from MpegEncContext here.
> what about copying the MpegEncContect & H264Context for each thread
> and using them, this should significantly reduce the changes
> needed (note i didnt look at your patch at all ...)
Yeps, thats how it'd done now.
There is some uglyness after MPV_common_init() since the
threads allocated are sizeof(MpegEncContext).
A simple av_realloc() dosent work since it does not correctly align
stuff when CONFIG_MEMALIGN
I see a few options here,
* Pass a second argument to MPV_common_init()
* Let MPV_common_init() look at some pre-initialized field in
's' (s->super_context_size) or somthing...
* "Fix" av_realloc to correctly align (by using free + memaling +
* Any other ideas?
> also look at how slice level multithreading is implemented for
> mpeg2/mpeg4 ...
>> If this is something that ffmpeg is willing to integrate
>> I'd like to get a few pointers, hints and answers on the
>> topics above before I continue with the stuff that's left.
> iam not against slice level threading support, though the
> implementation must be clean, simple and there must be no
> speedloss for the single threaded case (>1% is completely
This version is much cleaner, there are some "unrelated"
changes (border backup + copy stuff) that might be beneficial
to commit anyway (but the deblocking-type-2 conditional in xchg must
be there in order for deblocking to work correctly when run in parallel)
I've done a couple of tests which two streams.
There are no longer any slow-down compared to an unmodified version
of ffmpeg from head.
Each test was 1000 frames from the two streams, 10 tests were run
and the 6 best average times from av_decode_video() has been
averaged into the 'Time' column
File A, CABAC, 6 slices, deblocking type 2
File B, CAVLC, 8 slices, deblocking type 0 + 1
Content ffmpeg CPU Concurrency Time
File A unmodified 3GHz Xeon HT n/a 16211
File A patched 3GHz Xeon HT 1 16113
File A patched 3GHz Xeon HT 2 15594
File A patched 2.66GHz 4way Xeon HT 8 4401
File A unmodified 1.73GHz Pentium-M n/a 15609
File A patched 1.73GHz Pentium-M 1 15538
File A unmodified 2.13GHz Core2 duo n/a 11148
File A patched 2.13GHz Core2 duo 1 11019
File A patched 2.13GHz Core2 duo 2 7168
File B unmodified 3GHz Xeon HT n/a 30286
File B patched 3GHz Xeon HT 1 29993
File B patched 3GHz Xeon HT 2 25913
File B patched 2.66GHz 4way Xeon HT 8 5129
File B unmodified 1.73GHz Pentium-M n/a 26892
File B patched 1.73GHz Pentium-M 1 26777
File B unmodified 2.13GHz Core2 duo n/a 19938
File B patched 2.13GHz Core2 duo 1 19681
File B patched 2.13GHz Core2 duo 2 11458
MD5 sums matches from all tests. (If anyone want to, i can post
test output with md5 sums aswell)
I've also run some long-time tests on the 8way system to make
sure there are no race conditions around.
Comments are of course welcome...
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 30364 bytes
Desc: not available
More information about the ffmpeg-devel