[FFmpeg-devel] PATCH: allow load_input_picture, load_input_picture to be architecture dependent

Marc Hoffman mmhoffm
Tue Jul 24 20:08:25 CEST 2007


Hi

On 7/23/07, Michael Niedermayer <michaelni at gmx.at> wrote:
>
> Hi
>
> On Mon, Jul 23, 2007 at 06:06:23PM -0400, Robin Getz wrote:
> > On Thu 19 Jul 2007 09:11, Michael Niedermayer pondered:
> > > On Thu, Jul 19, 2007 at 07:35:55AM -0400, Marc Hoffman wrote:
> > > > We would be me ++ folks using Blackfin in real systems that are
> > > > waiting for better system performance.
> > >
> > > doing the copy in the background like you originally did requires
> > > a few more modifications than you did, that is you would have to add
> > > checks to several points so that we dont read the buffer before the
> > > specfic part has been copied, this sounds quite hackish and iam not
> > > happy about it
> >
> > architecture specific optimisations are never a happy thing.
>
> no, most of them are clean and well seperated but this dma memcpy thing
> is a mess and has no chance to reach svn unless someone shows first that
> all alternatives are worse (benchmarks absolutely required)
> alternatives are, using the preserve flag and changing ffmpeg.c
> and doing the dma copy but wait until its done


I have been thinking along these lines for the input image used in the
mpegvideo encode process. The patch would be pretty clean but we would
incurr a frame delay for it to work correctly.  When I get the data in a
easily reviewable format I will provide it to you.  I don't think Blackfin
is the only processor which would benifit from this type of system
optimization.

if these 2 are slower than a correct implementation with all the needed
> checks and locks in place than we can see if the gain (seen in the
> benchmark) is worth the mess (seen in the patch)
>
>
> >
> > I would think that with the proper defines
> >
> > #ifdef USE_NONBLOCKINGCPY
> > #extern non_blocking_memcpy(void *dest, const void *src, size_t n);
> > #extern non_blocking_memcpy_done(void *dest);
> > #else
> > #define non_blocking_memcpy(dest, src, n) memcpy(dest, src, n)
> > #define non_blocking_memcpy_done
> > #endif
> >
> > it could be made less "hackish" - and still provide the optimisation.
>
> the buffer is immedeatly needed after the copy, its just not the whole
> buffer which is, its rather used from top to bottom so with
> spin locks or equivalent placed all over the place and some way to figure
> out how much has been copied its possible
>
> also you could change the code more significantly to make the memcpy +
> done possible but it would add 1 frame delay and as said require some
> changes
> all in all i do not think this is wroth it ...


I'm looking at the mpeg encoder right now, specifically these functions.

load_input_picture
select_input_picture

which seem like relatively easy places to make the adjustment in a way that
is not too convoluted.   I'm also thinking I would like to add to the dsp
specifics the memory move primitives which might even include padding
optimizations.  This is all very blue sky right now, I have a few other
things I need to finish up before circling back to this particular issue.


>
> > > is mpeg4 encoding speed on blackfin really that important?
> >
> > There are lots of people waiting for it to get better than it is. (Like
> me)
> >
> > > cant you just optimize memcpy() in a compatible non background way?
> >
> > memcpy is already as optimized as it can be
> >   - it is already in assembly
> >   - doing int (32-bit) copies when possible.
> >   - The loop comes down to:
> >     MNOP || [P0++] = R3 || R3 = [I1++];
> >    Which is a read/write in a single instruction cycle (if things are
> all
> >     in cache). This coupled with zero overhead hardware loops makes
> things
> >     as fast as they can be.
> >
> > The things that slow this down are cache misses, cache flushes, external
> > memory page open/close - things you can't avoid. If we could be doing
> compute
> > at the same time - it could make up for some of these stalls.
>
> it should be faster to read several and then write several things instead
> of read 1 write it, read next write it, ...


This does 4 samples at a time, and the memory system brings 32bytes at a
wack into the L1 caches.  It doesn't matter if we moved the data byte wise
or in quads the problem is not here its in moving the data from external to
internal memory.  Now we have some nice tools for working around this on the
Blackfin one of them is this 2d DMA engine which can be thought of as a
separate processor who can rearrange and move data samples around in the
background.  I would like to and haven't done so yet do some benchmark
analysis as Michael points out.  The load_input_picture, and
select_input_picture API's would be I think the right place to take
advantage of this type of architectural feature.


also you can do the memcpy() with the DMA thing, you just have to wait for
> it to finish before returning


Yeah this does make things a little a faster but its very far from optimial.




More information about the ffmpeg-devel mailing list