[FFmpeg-devel] VP8 sliced threading

Tue Jul 10 20:19:15 CEST 2012

On Fri, Jul 06, 2012 at 05:36:56PM -0400, Daniel Kang wrote:
> On Fri, Jul 6, 2012 at 12:37 PM, Reimar Döffinger
> <Reimar.Doeffinger at gmx.de>wrote:
> 
> > On Fri, Jul 06, 2012 at 03:14:17AM -0400, Daniel Kang wrote:
> > > >
> > > > Also how did you get performance numbers? For low horizontal resolution
> > > >> I'd expect it to potentially get vastly slower on Windows when the
> > sleep
> > > >> comes in, since the default minimum granularity of the sleep is 10ms,
> > which
> > > >> should be longer than decoding a whole frame takes.
> > > >>
> > > >
> > > > I only tested HD clips, on Linux and Windows. I will test a low-res
> > clip
> > > > once I can find a suitable one.
> > > >
> > >
> > > Sorry for the second email. Where did you find the information on
> > > granularity on sleep?
> > >
> > http://msdn.microsoft.com/en-us/library/windows/desktop/ms686298(v=vs.85).aspxstates
> > > that a value of 0 will "[cause] the thread to relinquish the
> > > remainder of its time slice to any other thread that is ready to run." I
> > > cannot find information on implementation details.
> >
> > I'm afraid it is not documented.
> > Note I am not sure if things changed _after_ XP but my information
> > should at least be correct for Windows XP.
> >
> 
> My benchmarks were on Windows 7.

I don't know if it makes a difference. Does adding a timeBeginPeriod call
change anything in performance?

> I don't have access to an XP box. Can I get ssh/remote desktop to one for
> testing?

I'm afraid I do not have any XP at hand.

> Also the pause instruction is meant for spinlock cases. You are using
> > sleep in the same loop which will usually go into the kernel and in
> > general this is not really a spinlock, so I don't think it is helping
> > and I think it is not supposed to be used like this.
> 
> 
> Do you think implementing this as a spinlock will help in this case?

You can use a real spinlock only if you know for sure the process you
are waiting for is running on another CPU. You can't properly guarantee
that in a userspace application.

> I also think I read that using sched_yield this way is not portable and
> > thus very much discouraged (it is implemented as a NOP often).
> > The idea being that proper locking/signalling should be fast enough (and
> > actually can be quite faster if sched_yield is actually a NOP).
> 
> 
> On one Linux machine I tested on, using mutex locks and waits cause sliced
> threading to dramatically slow down.

Do you know what causes that? Is it overhead for the case where we do
not have to wait or is it the waiting code that costs?
Or is it maybe just the mutex around the assignment on its own already?
If it is the mutex alone, is the one for the assignment or for the check
significantly higher?
If it is not the mutexes, might it be that the broadcast results in a
thread switch, however most of the time the progress is actually not
sufficient to release any thread?
Maybe you should not check/update the position for every MB but only
for e.g. every 8th instead?
And can't this somehow be parallelized more efficiently, especially
in the case where the number of slices is much higher than the number of
threads?
However I just realize that your implementation is not going to work
reliably in some cases.
There is no "volatile" for the thread_mb_pos, so the compiler might be
moving your assignment to some completely other place (in particular, it
might move it far before the slice is actually decoded, leading to
higher speed but also wrong decoding if you are unlucky).
Of course even volatile does not really help on architectures like e.g. (I
believe) ARM where you will at least need some memory barrier to make
it work reliably (possibly you need that even for x86 when MMX writes
bypassing the cache are done).