[FFmpeg-devel] [RFC] AAC Encoder

Michael Niedermayer michaelni
Wed Aug 13 14:57:50 CEST 2008


On Wed, Aug 13, 2008 at 09:16:48AM +0300, Kostya wrote:
> On Tue, Aug 12, 2008 at 07:48:59PM +0200, Michael Niedermayer wrote:
> > On Tue, Aug 12, 2008 at 08:09:36PM +0300, Kostya wrote:
> > > On Tue, Aug 12, 2008 at 02:14:20PM +0200, Michael Niedermayer wrote:
> [...]
> > > > We have a problem here, because this isnt optimal
> > > > It seems we agree that each bit counts the same no matter what psy says.
> > > > Maybe a example will best show the problem
> > > > lets assume we have a coeff of 11.5, the psy model decides that a change
> > > > to 10 would be ok for the given audio quality/bitrate and thus outputs 10
> > > > let us assume that storing a coefficient of 10 and one of 11 both take
> > > > 7 bit, the decission to store 10 clearly was bad. OTOH it could have
> > > > been that storing 11 requires twice as many bits in which case the
> > > > decission would have been good. One simply cannot quantize values optimally
> > > > without considering the number of bits they need. This is even more true
> > > > for vector quantization based codecs than it is for scalar quantization.
> > > > it may very well be that psy thinks that both {-1,1} and {-2,0} are an
> > > > equally good representation of the exact {-1.5,0.5} but its not until
> > > > the encoding that it becomes known which of the two need fewer bits.
> > > > 
> > > > Id say the psy model should return an array of perceptual weights W[i]
> > > > and the bitstream encode should choose the (global) minimum of
> > > > bits[i] + distortion(W[i], coeff[i]-stored[i])
> > > > where distortion is a appropriate function whos output matches how audible
> > > > a change is, this may be a simple W[i]*(coeff[i]-stored[i])^2 but iam no
> > > > psychoacoustic expert so there may be better choices.
> > > > 
> > > > And of course the suggested system above needs to be compared to what you
> > > > have currenty so that we can be sure it really does sound better.
> > > 
> > > I understand what you mean but I suspect that is of complexity O("shaving piglets").
> > > 
> > > I followed 3GPP TS26.403 which relies on perceptual entropy which more
> > > or less corresponds to the number of bits needed to code it since it's easier.
> > > Anyway, it would be easy to implement psy model that will consider
> > > real coding cost vs. distortion.
> > 
> > if you do not want to implement this then i will have to investigate if it
> > is doable or not and why, could you provide me with some more elaborate
> > explanation of where the problem is?
> 
> Current scheme (just to clarify things a bit):

> 1. encoder calls psy model functions to preprocess data

This should eventually be done in a filter prior to the encoder, but that
can wait until after its in svn and libavfilter is there and capable to
filter audio


> 2. then encoder calls psy model to determine frame and window type

This is almost ok
What the psy model should return is long_window, short_windows, dont_know
and in the dont_know case both should be encoded and the one with better
rate distortion choosen (distortion would be calculated by the psy model
using whatever (posibly non trivial) method it sees fit.
how often the dont_know case is returned could be determined by some
speed/quality tradeoff option from the command line


> 3. based on psy model suggestions, encoder performs windowing and MDCT

ok


> 4. encoder feeds coefficients to psy model
> 5. psy model by some magic determines scalefactors and use them to convert
> coefficients into integer form
> 6. encoder encodes obtained scalefactors and integer coefficients
> 
> There are 11 codebooks for AAC, each designed to code either pairs or quads
> of values with sign coded separately or incorporated into value,
> each has a maximum value limit.
> While it's feasible to find the best encoding (like take raw coeff, quantize
> it and round up or down, then see which vector takes less bits), I feel
> it would be too slow.

thats fine, you already have the fast variant implemented i do not suggest
that to be removed, what we need is a high quality variant. The encoder should
be better than other encoders ...
Also as the max value you mentioned is another example of where your code
fails fatally, a single +3 that would sound nearly as good when encoded as +2
could force a less efficient code book to be choosen. Also the +3 could be
encoded as a pulse, i dont remember if your code optimally choose between
pulse and normal codebook encodings?

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Dictatorship naturally arises out of democracy, and the most aggravated
form of tyranny and slavery out of the most extreme liberty. -- Plato
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080813/05b7ed5b/attachment.pgp>



More information about the ffmpeg-devel mailing list