[FFmpeg-devel] [RFC] AAC Encoder

Sat Aug 16 23:06:33 CEST 2008

On Sat, Aug 16, 2008 at 06:00:39PM +0300, Kostya wrote:
> On Sat, Aug 16, 2008 at 03:57:56AM +0200, Michael Niedermayer wrote:
> > On Fri, Aug 15, 2008 at 07:59:52PM +0300, Kostya wrote:
> > [...]
> > 
> > > /**
> > >  * Set window sequence and related parameters for channel element.
> > >  *
> > >  * @param ctx   model context
> > >  * @param audio samples for the current frame
> > >  * @param la    lookahead samples (NULL when unavailable)
> > >  * @param tag   number of channel element to analyze
> > >  * @param type  channel element type (e.g. ID_SCE or ID_CPE)
> > >  * @param cpe   pointer to the current channel element
> > >  */
> > > void ff_aac_psy_suggest_window(AACPsyContext *ctx, int16_t *audio, int16_t *la, int tag, int type, ChannelElement *cpe);
> > 
> > I really think this should return the window size as int
> > 0 could mean "dont know try both" possibly, of course it can be done
> > differently.
> 
> It also set window shape (sine/KBD) and does that for up to two channels.

ive missed that. In that case returning them as done currently is ok

> As for "don't know" window type, it will require some radical changes in
> encoder and model and may be considered only as a future thing.

:/
well but do not expect me to accept that awnser for everything!

>  
> > [...]
> > > /**
> > >  * Quantize one coefficient.
> > >  * @return absolute value of the quantized coefficient
> > >  * @see 3GPP TS26.403 5.6.2 "Scalefactor determination"
> > >  */
> > > static av_always_inline int quant(float coef, const float Q)
> > > {
> > >     return av_clip((int)(pow(fabsf(coef) * Q, 0.75) + 0.4054), 0, 8191);
> > > }
> > 
> > converting float to int by casting is rather slow on x86
> > also i do not see why the cliping against 0 is done
> > 
> > and where does the 0.4054 come from? How has this value been selected?
> 
> ask 3GPP folks, in their spec (there's a reference in the comment above)
> it's also called MAGIC_NUMBER.

ideg
morons
anyway, its
1.0 - 0.5^0.75

and i seriously doubt this is optimal in the psychoacoustic sense or any
rate distortion sense.
It IS optimal in the "least squares distortion but i dont care about the bits"
sense
please add a note that this constant needs to be finetuned with listening
tests or some more solid math!

> 
> as for clipping, it seemed more logical than applying FFMIN()

speaking of cliping, can this even overflow 8191? and if so is it even
correct to clip?
most signals do not like being cliped randomly

> 
> > > 
> > > /**
> > >  * Convert coefficients to integers.
> > 
> > >  * @return sum of coefficients
> > 
> > this is not true, its the sum of their absolute values
> 
> corrected 
>  
> > [...]
> > > 
> > > /**
> > >  * constants for 3GPP AAC psychoacoustic model
> > >  * @{
> > >  */
> > > #define PSY_3GPP_C1 3.0f                    // log2(8.0)
> > > #define PSY_3GPP_C2 1.32192809488736234787f // log2(2.5)
> > > #define PSY_3GPP_C3 0.55935730170421255071f // 1 - C2/C1
> > > 
> > > #define PSY_3GPP_SPREAD_LOW  1.5f // spreading factor for ascending threshold spreading  (15 dB/Bark)
> > > #define PSY_3GPP_SPREAD_HI   3.0f // spreading factor for descending threshold spreading (30 dB/Bark)
> > > 
> > 
> > > #define PSY_3GPP_RPEMIN      0.01f
> > > #define PSY_3GPP_RPELEV      2.0f
> > 
> > RPE ?
> > please document what that means, iam no psychoacoustic guru
> > Its easier to review code when one knows what something is.
> 
> Oh, please become one :)

:)

> Do you see any comments about them? That's because it's hard to say
> anything about them. All I know is that they have some relation to
> pre-echo control.

:/

[...]
> > >     Psy3gppBand band[2][128];               ///< bands information
> > >     Psy3gppBand prev_band[2][128];          ///< bands information from the previous frame
> > 
> > no next_band? yes it may be a naive question but my naive mind would consider
> > the next to be usefull as well when the previous is.
> 
> previous frame band data is used for pre-echo control,
> while next frame band data is nor known yet (but it may
> be used in the same way indeed)

It should be rather easy to delay things by one frame to make it available
if you have an idea of what could be done with it ...

>  
> > > 
> > >     float       win_nrg[2];                 ///< sliding average of channel energy
> > 
> > nrg is not an acceptable abbreviation for energy
> 
> renamed
> 
> > [...]
> > > /**
> > >  * Tell encoder which window types to use.
> > >  * @see 3GPP TS26.403 5.4.1 "Blockswitching"
> > >  */
> > > static void psy_3gpp_window(AACPsyContext *apc, int16_t *audio, int16_t *la, int tag, int type, ChannelElement *cpe)
> > > {
> > >     int ch;
> > >     int chans = type == ID_CPE ? 2 : 1;
> > >     int i, j;
> > >     int br = apc->avctx->bit_rate / apc->avctx->channels;
> > >     int attack_ratio = (br <= 16000 + 8000*chans) ? 18 : 10;
> > >     Psy3gppContext *pctx = (Psy3gppContext*) apc->model_priv_data;
> > >     Psy3gppChannel *pch = &pctx->ch[tag];
> > >     uint8_t grouping[2];
> > >     enum WindowSequence win[2];
> > > 
> > 
> > >     if(la && !(apc->flags & PSY_MODEL_NO_SWITCH)){
> > >         float s[8], v;
> > >         for(ch = 0; ch < chans; ch++){
> > >             enum WindowSequence last_window_sequence = cpe->ch[ch].ics.window_sequence[0];
> > >             int switch_to_eight = 0;
> > >             float sum = 0.0, sum2 = 0.0;
> > >             int attack_n = 0;
> > >             for(i = 0; i < 8; i++){
> > >                 for(j = 0; j < 128; j++){
> > >                     v = iir_filter(audio[(i*128+j)*apc->avctx->channels+ch], pch->iir_state[ch]);
> > 
> > this filter can be unrolled by a factor of 2 to avoid some moves in its
> > state (like the low pass one)
> 
> that's useless. It's order-1 filter after all. 

indeed ...

>  
> > >                     sum += v*v;
> > >                 }
> > >                 s[i] = sum;
> > >                 sum2 += sum;
> > >             }
> > >             for(i = 0; i < 8; i++){
> > >                 if(s[i] > pch->win_nrg[ch] * attack_ratio){
> > >                     attack_n = i + 1;
> > >                     switch_to_eight = 1;
> > >                     break;
> > >                 }
> > >             }
> > >             pch->win_nrg[ch] = pch->win_nrg[ch]*7/8 + sum2/64;
> > > 
> > >             switch(last_window_sequence){
> > >             case ONLY_LONG_SEQUENCE:
> > >                 win[ch] = switch_to_eight ? LONG_START_SEQUENCE : ONLY_LONG_SEQUENCE;
> > >                 grouping[ch] = 0;
> > >                 break;
> > >             case LONG_START_SEQUENCE:
> > >                 win[ch] = EIGHT_SHORT_SEQUENCE;
> > >                 grouping[ch] = pch->next_grouping[ch];
> > >                 break;
> > >             case LONG_STOP_SEQUENCE:
> > >                 win[ch] = ONLY_LONG_SEQUENCE;
> > >                 grouping[ch] = 0;
> > >                 break;
> > >             case EIGHT_SHORT_SEQUENCE:
> > >                 win[ch] = switch_to_eight ? EIGHT_SHORT_SEQUENCE : LONG_STOP_SEQUENCE;
> > >                 grouping[ch] = switch_to_eight ? pch->next_grouping[ch] : 0;
> > >                 break;
> > >             }
> > >             pch->next_grouping[ch] = window_grouping[attack_n];
> > >         }
> > 
> > How much quality is lost by using this compared to RD optimal switching?
> 
> err, is there such thing as quality?

yes, you take n people let them in random order listen to 2 files and ask
about the quality ...
n can be 1 and people={kostya}

> Per my understanding, 8 short windows sequence is used for better preserving
> of sudden change of signal. And the change decision is implemented after
> the same 3GPP spec.

thats also my understanding, but i have my doubts about the actual way
that is used to detect such changes.
I suspect that  doing the windowd MDCT for both cases and
picking the one which has a lower sum of absolute coeffs would work better.
of course this would absolutely NEED a listening test as i may be dead wrong
iam just applying what i know form video coding to audio ...

[...]
> > [...]
> > > /**
> > >  * Determine scalefactors and prepare coefficients for encoding.
> > >  * @see 3GPP TS26.403 5.4 "Psychoacoustic model"
> > >  */
> > > static void psy_3gpp_process(AACPsyContext *apc, int tag, int type, ChannelElement *cpe)
> > > {
> > >     int start;
> > >     int ch, w, wg, g, g2, i;
> > >     int prev_scale;
> > >     Psy3gppContext *pctx = (Psy3gppContext*) apc->model_priv_data;
> > >     float pe_target;
> > >     int bits_avail;
> > >     int chans = type == ID_CPE ? 2 : 1;
> > >     Psy3gppChannel *pch = &pctx->ch[tag];
> > > 
> > >     //calculate energies, initial thresholds and related values - 5.4.2 "Threshold Calculation"
> > >     memset(pch->band, 0, sizeof(pch->band));
> > >     for(ch = 0; ch < chans; ch++){
> > >         start = 0;
> > >         for(w = 0; w < cpe->ch[ch].ics.num_windows; w++){
> > >             for(g = 0; g < cpe->ch[ch].ics.num_swb; g++){
> > >                 g2 = w*16 + g;
> > >                 for(i = 0; i < cpe->ch[ch].ics.swb_sizes[g]; i++)
> > >                     pch->band[ch][g2].energy +=  cpe->ch[ch].coeffs[start+i] *  cpe->ch[ch].coeffs[start+i];
> > 
> > >                 pch->band[ch][g2].energy /= 262144.0f;
> > 
> > and this factor does what?
> > besides it should be *= 1.0/...
> 

> I hope gcc will optimize division by constant.

I think it can with the correct command line options but iam not sure if it
works in practice, gcc can be kinda lame ...

> And it's used for compensate effect of input coefficients
> being 512 times larger because of throwing out division
> from windowing+MDCT operation. The same thing is done
> to form factor calculation.

ahh ok,
but why not write 512*512 then ?

> 
> [...]
> > >         //determine scalefactors - 5.6.2 "Scalefactor determination"
> > >         for(ch = 0; ch < chans; ch++){
> > >             prev_scale = -1;
> > >             for(w = 0; w < cpe->ch[ch].ics.num_windows; w++){
> > >                 for(g = 0; g < cpe->ch[ch].ics.num_swb; g++){
> > >                     g2 = w*16 + g;
> > 
> > >                     cpe->ch[ch].zeroes[w][g] = pch->band[ch][g2].thr >= pch->band[ch][g2].energy;
> > 
> > how much quality is lost compared to full RD decission ? its just a matter of
> > checking how many bits this would need which is likely negligible speed wise.
> > (assuming you can unentangle the threshold check into a distortion
> >  computation)
>  
> well, energy < threshold means resulting band will be zero anyway,
> and without that check weird values for perceptual entropy start
> to appear

iam perfectly fine with making bands that would quantize to all zero, 
"zero bands" but this again isnt optimal because a band with just a single
+1 coeff almost certainly would do better as "zero band" as well.
Also what about these noise bands? i think they are never used ...

>  
> > >                     if(cpe->ch[ch].zeroes[w][g]) continue;
> > >                     //spec gives constant for lg() but we scaled it for log2()
> > >                     cpe->ch[ch].sf_idx[w][g] = (int)(2.66667 * (log2(6.75*pch->band[ch][g2].thr) - log2(pch->band[ch][g2].ffac)));
> > 
> > 2.666... * log2(6.75*pch->band[ch][g2].thr / pch->band[ch][g2].ffac)
> > 
> > 
> > 
> > >                     if(prev_scale != -1)
> > >                         cpe->ch[ch].sf_idx[w][g] = av_clip(cpe->ch[ch].sf_idx[w][g], prev_scale - SCALE_MAX_DIFF, prev_scale + SCALE_MAX_DIFF);
> > >                     prev_scale = cpe->ch[ch].sf_idx[w][g];
> > >                 }
> > >             }
> > >         }
> > >         break;
> > >     case PSY_MODE_QUALITY:
> > >         for(ch = 0; ch < chans; ch++){
> > >             start = 0;
> > >             for(w = 0; w < cpe->ch[ch].ics.num_windows; w++){
> > >                 for(g = 0; g < cpe->ch[ch].ics.num_swb; g++){
> > >                     g2 = w*16 + g;
> > >                     if(pch->band[ch][g2].thr >= pch->band[ch][g2].energy){
> > >                         cpe->ch[ch].sf_idx[w][g] = 0;
> > >                         cpe->ch[ch].zeroes[w][g] = 1;
> > >                     }else{
> > >                         cpe->ch[ch].zeroes[w][g] = 0;
> > >                         cpe->ch[ch].sf_idx[w][g] = (int)(2.66667 * (log2(6.75*pch->band[ch][g2].thr) - log2(pch->band[ch][g2].ffac)));
> > >                         while(cpe->ch[ch].sf_idx[ch][g] > 3){
> > >                             float dist = calc_distortion(cpe->ch[ch].coeffs + start, cpe->ch[ch].ics.swb_sizes[g], SCALE_ONE_POS + cpe->ch[ch].sf_idx[ch][g]);
> > >                             if(dist < pch->band[ch][g2].thr) break;
> > >                             cpe->ch[ch].sf_idx[ch][g] -= 3;
> > >                         }
> > >                     }
> > 
> > looks rather similar to the previous cases
> 
> partially - it determines primary scalefactor index from threshold
> in the same way, what it does to thresholds before that and index
> after that differs. 

if possible please factorize the code

[...]
> > > 
> > 
> > > #ifndef CONFIG_HARDCODED_TABLES
> > >    for (i = 0; i < 316; i++)
> > >         ff_aac_pow2sf_tab[i] = pow(2, (i - 200)/4.);
> > > #endif /* CONFIG_HARDCODED_TABLES */
> > 
> > this is likely duplicated
> 
> it is not. When table is not hardcoded, it should be initialized. 

so ff_aac_pow2sf_tab is used just by the psychoacoustic model?

>  
> > [...]
> > > void ff_aac_psy_preprocess(AACPsyContext *ctx, int16_t *audio, int16_t *dest, int tag, int type)
> > > {
> > >     int chans = type == ID_CPE ? 2 : 1;
> > >     const int chstride = ctx->avctx->channels;
> > >     int i, ch;
> > >     float t[2];
> > > 
> > >     if(chans == 1){
> > 
> > >         for(ch = 0; ch < chans; ch++){
> > >             for(i = 0; i < 1024; i++){
> > >                 dest[i * chstride + ch] = audio[i * chstride + ch];
> > >             }
> > >         }
> > 
> > memcpy
> 
> no, it copies with gaps 

and chans=1 which makes the loop rather bloated

[...]
> > >             }
> > >             for(ch = 0; ch < 2; ch++)
> > >                 dest[i * chstride + ch] = av_clip_int16(t[ch]);
> > 
> > we have optimized code for converting float to int16, please use it
>  
> 
> I may have overlooked some stuff for now but that's because it's too
> hot here to think properly.

come to austra we have 13 ?C here ATM in 2 days its probably 30 again though

ill review the psy model ASAP, just wanted to reply with the comments above
first

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

In a rich man's house there is no place to spit but his face.
-- Diogenes of Sinope
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080816/0a22789d/attachment.pgp>