[FFmpeg-devel] Discuss mp4 fragmented: TFDT::BaseMediaDecodeTime. TRUN::SampleDuration

Mon Nov 3 03:16:58 CET 2014

*Hi ffmpeg-devel, I’m sending this mail in order to encourage some
discussion about ISO BMFF specification and find out what people think
about problems described below and if anybody else also have seen these
problems.I’m software engineer and I write MP4 Muxer/Demuxer (It’s not
FFmpeg code) which is then used to generate DASH streams (MP4 fragmented
container). Some players reported playback issues so I had to provide some
controversial patches in MP4 Muxer to make these players accepting streams
generated by my Muxer. My current understanding is that these issues happen
because people read ISO BMFF specification differently, so they expect
different behavior.Below I would like to describe these issues and I would
be glad to hear your opinion about that. The latest spec I have on hands is
ISO/IEC 14496-12:2012. Fourth edition 2012-07-15, Corrected version
2012-09-15. TFDT::BaseMediaDecodeTimeISO BMFF standard defines
TFDT::BaseMediaDecodeTime as is an integer equal to the sum of the decode
durations of all earlier samples in the media:8.8.12 Track fragment decode
timeThe Track Fragment Base Media Decode Time Box provides the absolute
decode time, measured on the media timeline, of the first sample in decode
order in the track fragment. This can be useful, for example, when
performing random access in a file; it is not necessary to sum the sample
durations of all preceding samples in previous fragments to find this value
(where the sample durations are the deltas in the Decoding Time to Sample
Box and the sample_durations in the preceding track
runs)....baseMediaDecodeTime is an integer equal to the sum of the decode
durations of all earlier samples in the media, expressed in the media's
timescale. It does not include the samples added in the enclosing track
fragment.Player claims that TFDT::BaseMediaDecodeTime must be strictly
equal to sum of all preceding sample decode durations. Most likely, Player
side is absolutely right here, because specification just says
“baseMediaDecodeTime is an integer equal to the sum of the decode durations
of all earlier samples in the media” Unfortunately, strict following of the
spec makes Transcoding/Muxing process much more complicated and flaky.
Modern transcoding engines (for instance, YouTube and Vimeo) split video
and chunks and transcode them in parallel. Parallel transcoding and frame
rate conversion cause DTS/PTS fluctuation, so it becomes not trivial to
follow TFDT::BaseMediaDecodTime and in most cases sample duration
correction or sample dropping is required to follow this rule strictly. The
most complicated thing is that in order to perform this correction for
current fragment (MOOF+MDAT pair), we have to know DTS of first sample of
the next fragment. We can not really get this information because of
parallel processing, so we try to guess this value. We can do it for
constant framerate, but for variable frame rate it’s a real issue. Frankly,
current solution is flaky.Current TFDT::BaseMediaDecodeTime requirement
also doesn’t address frame dropping and stream errors which may happen
during live streaming/transcoding.What do you think about
TFDT::BaseMediaDecodeTime?TRUN::SampleDurationMediaSourceExtensions
<http://www.w3.org/TR/media-source/> (MSE) specification authors think that
TRUN::duraton in ISO BMFF spec is "sample duration". In another words
TRUN::duration[n] = PTS[n+1] - PTS[n]. I always have been thinking that
TRUN::duration is calculated as DTS[n+1] - DTS[n].In most cases delta DTS
is equal to delta PTS, but because of timescale conversion rounding,
DTS/PTS fluctuations caused by parallel processing and framerate
conversion, it’s not true all the time. This mismatch causes holes on MSE
playback timeline. Holes cause poor user experience.I went through ISO spec
and I've seen that it is not clear. It explicitly says that STTS entries
are decoding deltas:8.6.1.1 Time to Sample BoxesThe composition times (CT)
and decoding times (DT) of samples are derived from the Time to Sample
Boxes, of which there are two types. The decoding time is defined in the
Decoding Time to Sample Box, giving time deltas between successive decoding
times.8.6.1.2 Decoding Time to Sample BoxThe Decoding Time to Sample Box
contains decode time delta's: DT(n+1) = DT(n) + STTS(n) where STTS(n) is
the (uncompressed) table entry for sample n.As you can see, ISO spec is
very clear about regular MP4 files. Unfortunately, it's not so clear about
fragmented MP4 and atom TRUN:8.8.8 Track Fragment Run Box...The following
flags are defined:...0x000100 sample-duration-present: indicates that each
sample has its own duration, otherwise thedefault is used....I reviewed
FFmpeg MOV Muxer code and it calculates TRUN::SampleDuration as DTS
delta:http://git.videolan.org/?p=ffmpeg.git;a=blob;f=libavformat/movenc.c;h=a43752a01173eb8a37fb459f8325d516daf2e74a;hb=HEAD#l860
<http://git.videolan.org/?p=ffmpeg.git;a=blob;f=libavformat/movenc.c;h=a43752a01173eb8a37fb459f8325d516daf2e74a;hb=HEAD#l860>What
do you think about TRUN::SampleDuration?*
Thank you