[FFmpeg-devel] Nellymoser encoder

Bartlomiej Wolowiec bartek.wolowiec
Sun Aug 31 13:07:15 CEST 2008


Saturday 30 August 2008 18:10:41 Michael Niedermayer napisa?(a):
> On Sat, Aug 30, 2008 at 03:42:37PM +0200, Bartlomiej Wolowiec wrote:
> > Friday 29 August 2008 22:36:10 Michael Niedermayer napisa?(a):
> > > > > > > > +
> > > > > > > > +void apply_mdct(NellyMoserEncodeContext *s, float *in, float
> > > > > > > > *coefs) +{
> > > > > > > > +    DECLARE_ALIGNED_16(float, in_buff[NELLY_SAMPLES]);
> > > > > > > > +
> > > > > > > > +    memcpy(&in_buff[0], &in[0], NELLY_SAMPLES *
> > > > > > > > sizeof(float)); +    s->dsp.vector_fmul(in_buff, ff_sine_128,
> > > > > > > > NELLY_BUF_LEN); +    s->dsp.vector_fmul_reverse(in_buff +
> > > > > > > > NELLY_BUF_LEN, in_buff + NELLY_BUF_LEN, ff_sine_128,
> > > > > > > > NELLY_BUF_LEN); +
> > > > > > > > ff_mdct_calc(&s->mdct_ctx, coefs, in_buff);
> > > > > > > > +}
> > > > > > >
> > > > > > > The data is copied once in encode_frame and twice here
> > > > > > > There is no need to copy the data 3 times.
> > > > > > > vector_fmul can be used with a singl memcpy to get the data
> > > > > > > into any destination, and vector_fmul_reverse doesnt even need
> > > > > > > 1 memcpy, so overall a single memcpy is enough
> > > > > >
> > > > > > Hope that you meant something similar to my solution.
> > > > >
> > > > > no, you still do 2 memcpy() but now the code is really messy as
> > > > > well.
> > > > >
> > > > > what you should do is, for each block of samples you get from the
> > > > > user 1. apply one half of the window onto it with
> > > > > vector_fmul_reverse and destination of some internal buffer
> > > > > 2. memcpy into the 2nd destination and apply the other half of the
> > > > >    window onto it with vector_fmul
> > > > > 3. run the mdct as appropriate on the internal buffers.
> > > >
> > > > Hmm, I considered it, but I don't understand exactly what should I
> > > > change... In the code I copy data two times:
> > > > a) in encode_frame - I convert int16_t to float and copy data to
> > > > s->buf - I need to do it somewhere because vector_mul requires float
> > > > *. Additionally, part of the data is needed to the next call of
> > > > encode_frame b) in apply_mdct - here I think that some additional
> > > > part of buffer is needed. If I understood correctly I have to get rid
> > > > of a), but how to get access to old data when the next call of
> > > > encode_frame is performed and how call vector_fmul on int16_t?
> > >
> > > have you tried setting AVCodec.sample_fmts to SAMPLE_FMT_FLT ?
> > > I think ffmpeg should support this already. If it does not work then we
> > > can keep int16 for now which would implicate more copying
> >
> > Hmm... I tried to use SAMPLE_FMT_FLT, but something doesn't work. I made
> > only that changes:
> >
> > float *samples = data;
> > ...
> > for (i = 0; i < avctx->frame_size; i++) {
> >     s->buf[s->bufsel][i] = samples[i]*(1<<15);
> > }
> > ...
> > .sample_fmts = (enum SampleFormat[]){SAMPLE_FMT_FLT,SAMPLE_FMT_NONE},
>
> hmm

Any idea? or should I leave it as it is?

> > [...]
> >
> > > > +
> > > > +#define find_best(val, table, LUT, LUT_add, LUT_size) \
> > > > +    best_idx = \
> > > > +        LUT[av_clip ((((int)val) >> 8) + LUT_add, 0, LUT_size - 1)];
> > > > \ +    if (abs(val - table[best_idx]) > abs(val - table[best_idx +
> > > > 1])) \ +        best_idx++;
> > >
> > > (int)some_float is slow, lrintf() should be faster
> >
> > yes, but here rounding down is needed. lrintf(x-0.5) ?
>
> ok but merge that 0.5 into LUT_add

val is divided by 256, so that it doesn't make difference. I change only (int) 
to lrintf()

> > > also if val is a float instead of an int then fabs() may actually be
> > > better than abs()
> >
> > ok
> >
> > > > +
> > > > +/**
> > > > + * Encodes NELLY_SAMPLES samples. It assumes, that samples contains
> > > > 3 * NELLY_BUF_LEN values + *  @param s               encoder context
> > > > + *  @param output          output buffer
> > > > + *  @param output_size     size of output buffer
> > > > + */
> > > > +static void encode_block(NellyMoserEncodeContext *s, unsigned char
> > > > *output, int output_size) +{
> > > > +    PutBitContext pb;
> > > > +    int i, band, block, best_idx, power_idx = 0;
> > > > +    float power_val, power_candidate, coeff, coeff_sum;
> > > > +    int band_start, band_end;
> > > > +    float pows[NELLY_FILL_LEN];
> > > > +    int bits[NELLY_BUF_LEN];
> > > > +
> > > > +    const float C = 1.0;
> > > > +    const float D = 2.0;
> > > > +
> > > > +    apply_mdct(s);
> > > > +
> > > > +    init_put_bits(&pb, output, output_size * 8);
> > > > +
> > > >
> > > > +    band_start = 0;
> > > > +    band_end = ff_nelly_band_sizes_table[0];
> > > > +    for (band = 0; band < NELLY_BANDS; band++) {
> > > > +        coeff_sum = 0;
> > > > +        for (i = band_start; i < band_end; i++) {
> > > > +            //coeff_sum += s->mdct_out[i                ] *
> > > > s->mdct_out[i                ] +            //           +
> > > > s->mdct_out[i + NELLY_BUF_LEN] * s->mdct_out[i + NELLY_BUF_LEN]; +   
> > > >         coeff_sum += pow(fabs(s->mdct_out[i]), D) +
> > > > pow(fabs(s->mdct_out[i +
> > > > NELLY_BUF_LEN]), D); +        }
> > > > +        power_candidate =
> > > > +            //log(FFMAX(1.0, coeff_sum /
> > > > (ff_nelly_band_sizes_table[band] << 7))) * 1024.0 / M_LN2; +         
> > > >   C * log(FFMAX(1.0, coeff_sum / (ff_nelly_band_sizes_table[band] <<
> > > > 7))) * 1024.0 / log(D); +
> > > > +        if (band) {
> > > > +            power_candidate -= power_idx;
> > > > +            find_best(power_candidate, ff_nelly_delta_table,
> > > > sf_delta_lut, 37, 78); +            put_bits(&pb, 5, best_idx);
> > > > +            power_idx += ff_nelly_delta_table[best_idx];
> > > > +        } else {
> > > > +            //base exponent
> > > > +            find_best(power_candidate, ff_nelly_init_table, sf_lut,
> > > > -20, 96); +            put_bits(&pb, 6, best_idx);
> > > > +            power_idx = ff_nelly_init_table[best_idx];
> > > > +        }
> > >
> > > the choice of power_idx/best_idx values could still be tried to be
> > > found with viterbi. Its somewhat similar (and simpler) than our
> > > viterbi/trellis ADPCM encoder
> >
> > Ok. I read some information about viterbi.
> > ( http://en.wikipedia.org/wiki/Viterbi_algorithm ;) ).
> > I see that it should be assumed:
> > states - values - there will be about 2^16 of them?
> > observations -following values of power_candidate
> > probabilities - e.g. |searched value - assumed value|^C
> > transition_probabilit - anything, e.g. 1
>
> Wiki is using the term "viterbi algorithm" in a very narrow sense,
> ive not seen anyone outside wiki do that.
> Wikipedians seem to love describing the same algorithm 50 times
> with different variables and attributed to different authors. Iam not
> disputing that it has been reinvented >50 times in various fields but
> its still the same algorithm.
> I suspect several points of what is under
> "Algorithms that use dynamic programming"

such name tells more :)

> are identical to viterbi just 
> with different variable names.
>
> Considering wikis narrow view, "viterbi" might not be the best choice as
> reference as its written with a specific use case in mind that is different
> from ours. Iam not sure if there is a abstract description outside of
> specific use cases on wiki.
>
> > and write it cleverer. It seems to me, that it will be a quite slow
> > algorithm.
>
> yes, if its not written clever it would be.
>
> lets first agree on a little random terminonlogy that is closer to our
> use case
> there are the power_candidate that are the ideal values we want to store
> but likely wont be able to store exactly.
>
> there is a graph with bands*2^16 nodes, a column for each band and row
> for each value.
> And we want to find the path with least "weight" through this graph from
> left to right. The weight of each node is simply the distance (squared)
> from the power_candidate. The steps we are allowed to take correspond to
> the 32 difference or 64 start values that could be stored.
>
> Without optimizations, we would thus have to keep track of up to 2^16
> paths and their weights and try each of these pathes with each of the
> possible 32 differences for each band, thats
>
> 23*2^16*32 ~ 50 million operations per frame
>
> If we now limit ourselfs to just 2000 nodes surrounding each
> power_candidate then there would be just 2000*32 transitions to consider
> but as only <=6 of the values from the difference table would fall within
> the next 2000 area it would be just 2000*6 thus making the overall
> operation count
>
> 23*2000*6 = 276000 operations per frame, which is IMHO acceptable speed
> for a high quality mode.
> And it is unlikely that the global optimal path will have nodes that are
> far away from the 2000 surrounding, besides its easy to adjust this for
> speed vs. quality.
>

Such a?version I tried to do

> An alternative would be to instead of keeping N pathes that are closest
> to the current power_candidate, keep the N so far overall best pathes,
> thats what adpcm.c does and it likely has better quality/per speed
> but is harder to implement
>
> > Maybe, only relying on this idea, it will be better to make possible
> > transition to e.g. 3 best states? (but personally I hardly see here a
> > connection with viterbi...).
>
> optimized viterbi :)

In enclosed patch there is an attempt of writing dynamic allocation of 
exponents. It isn't ideal, but I want to be sure if I don't go wrong. It's 
quite slow - and I don't know if it can be accelerated more than 2-3 times.? 
Because it's so slow and quality is often not really better - I suggest to 
leave both versions - how to allow user to choose ?


-- 
Bartlomiej Wolowiec
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nellymoser5.patch
Type: text/x-diff
Size: 9903 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080831/d2719b5a/attachment.patch>



More information about the ffmpeg-devel mailing list