[FFmpeg-devel] audio gop_size

Tue Aug 24 23:59:05 CEST 2010

Michael Niedermayer wrote:

> On Sun, Aug 22, 2010 at 04:40:54PM -0400, Justin Ruggles wrote:
>> Hi,
>>
>> There are several audio codecs (such as MLP, ALS, and Speex) which
>> utilize the concept of groups of frames.  For decoding, it is simple
>> enough to either decode the whole group at once or decode
>> frame-by-frame, but for encoding there are other issues.  This is how
>> Thilo and I have implemented it in the ALS encoder we've been working
>> on, and I wanted to run it by the list to make sure we're on the right
>> track.
>>
>> 1) Make gop_size an audio option as well.
>>
>> alternative: The downside is that it defaults to 12, which may not be an
>> appropriate default for some audio codecs.  Would changing the gop_size
>> default have API implications?  Should we instead add another field for
>> audio group of frames size that would default to 0 or -1?
>>
>> 2) Use an internal buffer in the encoder to store each encoded frame
>> until it has enough for a group, then set coded_frame->pts appropriately
>> and output the whole group.  The reason that the whole group needs to be
>> encoded at once is because the container packet should always contain a
>> whole group (at least this is the case for ALS and Speex).
> 
> if the container needs a "GOP" per packet and the encoder outputs a whole gop
> at a time then this is semantically different
> from video gops and it appears more a encoder internal structuring like mp3
> or aac using short instead of long windows

Well, it can be handled completely internally, but then there are some
other issues.  I think the best way to describe this is to give specific
examples.

MPEG-4 ALS has a concept of random access (RA) units, where an RA frame
is the first frame in an RA unit.  RA frames do not rely on any samples
from previous frame in the linear prediction, while the remaining frames
in the RA unit do rely on previous samples.  One weird situation is that
there is such thing as streams with no RA frames, so the whole file
contains a single packet (the first frame pretends there was a previous
frame with all zero samples).

Demuxing : A single MP4 packet contains the whole RA unit, and there is
no way to parse out single frames.

Decoding : The decoder only decodes a single frame at a time, requiring
multiple decoding calls for each input packet.  This is fine according
to our audio API.

Encoding : The encoder-side equivalent of what the decoder does would be
like #2 in my original email.  Encode a single frame at a time and
buffer the frames internally until a whole RA unit is done, then output
the whole thing at once.  But for the encoder, we need a way for the
user to specify how many frames to put in an RA unit.  If gop_size is
not really a logical equivalent I guess we need a new field.

Muxing : If the demuxer and encoder both output full RA units then there
is no issue here as far as muxing is concerned.

Speex is a similar situation to ALS, but has some other oddities due to
individual frames not being byte-aligned.  In this case, the encoder
could just as easily encode a whole packet at once.  But the user still
needs to be able to specify how many frames to put in a packet.

So, I think both situations can be handled slightly differently and
still work correctly, but a new field is needed.  Something like
audio_frames_per_packet or similar?

If that sounds ok I can send a patch.

Cheers,
Justin