[FFmpeg-devel] [PATCH 04/12] Add vector_fmul_matrix to dsputil

Måns Rullgård mans
Wed Oct 21 03:00:25 CEST 2009


Loren Merritt <lorenm at u.washington.edu> writes:

> M?ns Rullg?rd <mans at mansr.com> writes:
>> Michael Niedermayer <michaelni at gmx.at> writes:
>>> On Sun, Oct 18, 2009 at 11:29:22PM +0100, M?ns Rullg?rd wrote:
>>>> Michael Niedermayer <michaelni at gmx.at> writes:
>>>>> On Sun, Oct 18, 2009 at 10:13:20PM +0100, M?ns Rullg?rd wrote:
>>>>>> Michael Niedermayer <michaelni at gmx.at> writes:
>>>>>>> On Sun, Oct 18, 2009 at 09:17:48PM +0100, M?ns Rullg?rd wrote:
>>>>>>>> Michael Niedermayer <michaelni at gmx.at> writes:
>>>>>>>>>> +        }
>>>>>>>>>> +    } else {
>>>>>>>>>> +        for (i = 0; i < len; i++) {
>>>>>>>>>> +            const float *m = mtx;
>>>>>>>>>> +            for (j = 0; j < w; j++) {
>>>>>>>>>> +                float s = 0;
>>>>>>>>>>
>>>>>>>>>> +                for (k = 0; k < w; k++)
>>>>>>>>>> +                    s += v[k][i] * *m++;
>>>>>>>>>
>>>>>>>>> this is quite inefficient because for(k) v[k][i] needs 2
>>>>>>>>> memory reads a flat 2d array would be better
>>>>>>>>
>>>>>>>> And how will the data magically transform itself into such a layout?
>>>>>>>
>>>>>>> What is the a reason that the data is not in that layout?  If
>>>>>>> the awnser is that some decoder is implemenetd that way then
>>>>>>> my next question is, would there be a disadvanatge in
>>>>>>> changing it?
>>>>>>
>>>>>> Many of the audio decoders allocate the channels separately.  I didn't
>>>>>> write them, so I can't say how difficult it would be to change that.
>>>>>
>>>>> for many channels it should even be faster to memcpy them instead of the
>>>>> double dereferences
>>>>> memcpy needs O(w*len)
>>>>> the dereferences are O(w*w*len)
>>>>
>>>> I don't expect w to be greater than 8.
>>>> It will probably be 2 or 6 in most cases.
>>>
>>> for 6 channels we have 36 dereferences, a cpy copying just
>>> 1 value at a time needs 6 reads and 6 writes to get rid of these 36
>>> at that naive instruction counting level, it seems my suggesting
>>> with copy is faster than yours without
>>
>> Can you please explain exactly what you're thinking of.  I thought you
>> were saying the audio channel data was to be moved such that all the
>> channels would be contiguous in memory instead of passing pointer to
>> each.  Copying it into such a layout would require w*len operations,
>> not w*w, and I still don't see how that would be massively more
>> efficient.  I also don't understand what 36 unnecessary dereferences
>> you're talking about.  The entire matrix must of course be read for
>> each sample.  We are doing len [1 x w]*[w x w] matrix multiplications.
>
> The unnecessary dereference is v[k][i]. Which is 2 memory loads as-is,
> but could be 1 memory load either if v[][] is a flat array,

But it's not, and I don't control that.

> or if the k loop is fully-unrolled and all the v[k] fit in registers
> at once.

We could make it an array of functions up to 8 or so.  Then the size
would be a constant, possibly allowing the compiler to do a better
job.

I still don't see what copying data to a temp buffer would achieve.

> Ideally the matrix coefficients would stay in regs too, but that's not
> going to happen in C for w=6, so save it for the simd version. Still,
> you could eliminate half of the m[] loads (and half of the
> aforementioned v[] loads, if you don't fix those), by 2x unrolling the
> i loop.

That's the job of the compiler.  If anyone cares, they can write asm
versions for the common cases.  That is the entire point of adding
this function in the first place.

-- 
M?ns Rullg?rd
mans at mansr.com



More information about the ffmpeg-devel mailing list