[FFmpeg-devel] [PATCH 04/12] Add vector_fmul_matrix to dsputil

Michael Niedermayer michaelni
Wed Oct 21 20:09:47 CEST 2009


On Mon, Oct 19, 2009 at 01:00:24AM +0100, M?ns Rullg?rd wrote:
> Michael Niedermayer <michaelni at gmx.at> writes:
> 
> > On Sun, Oct 18, 2009 at 11:29:22PM +0100, M?ns Rullg?rd wrote:
> >> Michael Niedermayer <michaelni at gmx.at> writes:
> >> 
> >> > On Sun, Oct 18, 2009 at 10:13:20PM +0100, M?ns Rullg?rd wrote:
> >> >> Michael Niedermayer <michaelni at gmx.at> writes:
> >> >> 
> >> >> > On Sun, Oct 18, 2009 at 09:17:48PM +0100, M?ns Rullg?rd wrote:
> >> >> >> Michael Niedermayer <michaelni at gmx.at> writes:
> >> >> > [...]
> >> >> >> >> +        }
> >> >> >> >> +    } else {
> >> >> >> >> +        for (i = 0; i < len; i++) {
> >> >> >> >> +            const float *m = mtx;
> >> >> >> >> +            for (j = 0; j < w; j++) {
> >> >> >> >> +                float s = 0;
> >> >> >> >
> >> >> >> >> +                for (k = 0; k < w; k++)
> >> >> >> >> +                    s += v[k][i] * *m++;
> >> >> >> >
> >> >> >> > this is quite inefficient because for(k) v[k][i] needs 2
> >> >> >> > memory reads a flat 2d array would be better
> >> >> >> 
> >> >> >> And how will the data magically transform itself into such a layout?
> >> >> >
> >> >> > What is the a reason that the data is not in that layout?
> >> >> > If the awnser is that some decoder is implemenetd that way then my next
> >> >> > question is, would there be a disadvanatge in changing it?
> >> >> 
> >> >> Many of the audio decoders allocate the channels separately.  I didn't
> >> >> write them, so I can't say how difficult it would be to change that.
> >> >
> >> > for many channels it should even be faster to memcpy them instead of the
> >> > double dereferences
> >> > memcpy needs O(w*len)
> >> > the dereferences are O(w*w*len)
> >> 
> >> I don't expect w to be greater than 8.
> >> It will probably be 2 or 6 in most cases.
> >
> > for 6 channels we have 36 dereferences, a cpy copying just
> > 1 value at a time needs 6 reads and 6 writes to get rid of these 36
> > at that naive instruction counting level, it seems my suggesting
> > with copy is faster than yours without
> 
> Can you please explain exactly what you're thinking of.  I thought you
> were saying the audio channel data was to be moved such that all the
> channels would be contiguous in memory instead of passing pointer to
> each.  

> Copying it into such a layout would require w*len operations,

yes

the matrix multiplication requires as its implemented w*w*len operations
though, for large w (like 6) the copy might be faster (if its in cache
and all that) than 1 extra dereference for each w*w*len


> not w*w, and I still don't see how that would be massively more
> efficient.  

> I also don't understand what 36 unnecessary dereferences
> you're talking about.  

as its not a flat array, reading it top-down needs a extra dereference
for each than what a flat array would need unless the pointers are all
in registers ...


> The entire matrix must of course be read for
> each sample.  

> We are doing len [1 x w]*[w x w] matrix multiplications.

you can also see it as a single [len x w][w x w] matrix multiplication
and that may also allow faster matrix multiplication algos to be used ...


[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Into a blind darkness they enter who follow after the Ignorance,
they as if into a greater darkness enter who devote themselves
to the Knowledge alone. -- Isha Upanishad
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20091021/49eae8e0/attachment.pgp>



More information about the ffmpeg-devel mailing list