[FFmpeg-devel] Question about edit-list+fragment explanation from " [PATCH] mov.c: read fragment start dts from fragmented mp4"

Yusuke Nakamura muken.the.vfrmaniac at gmail.com
Sun Oct 26 21:07:00 CET 2014


2014-10-26 9:36 GMT+09:00 Bryan Huh <bryan at box.com>:

> Sorry I am not posting to the same thread directly, I joined the list after
> this discussion finished a few weeks ago.
>
> I was reading Yusuke's explanation (pasted below) regarding the
> relationship between PTS, CTS, and DTS in the context of fragments, and
> there are some parts that I don't understand or don't agree with (I am also
> a little new to these concepts).
>
> The part where I get confused is:
>
> The presentation starts with CTS=2002 because of media_time of the second
> edit, so the PTS of the sync sample corresponds to X - 1001.
> >*From this, X is equal to 49001, and the DTS of trun.sample[0] is equal
> to X
> *- trun.sample[0].sample_duration = 49001 - 1001 = 48000.
>
> It looks like X is supposed to represent the DTS of the sync-sample. I see
> that (1001 + X) - 2002 = (X - 1001) represents the amount of time, in
> mdhd.timescale, into the presentation of actual media. Earlier, we had
> computed that the 48000 'time' from tfra corresponded to 600 into the
> presentation of actual media, in mvhd.timescale. So the way I would compute
> X would be:
>
> 600*mdhd.timescale/mvhd.timescale = X-1001
>
> So X = 600*24000/600 + 1001 = 25001
>
> Which does not agree with the original calculation of X = 49001. Do I have
> a misunderstanding?
>
> Thanks,
> Bryan
>

You're right.
Apparently I made a mistake of the computation during writing the
explanation.


>
> >>* Would you mind explaining how edit lists and fragment times are
> *>* supposed to work together?
> *>>The tfra box is designed as the player seeks and finds sync sample on
> presentation timeline i.e. by PTS in units of mdhd.timescale.
> PTS comes from CTS via edit list, and CTS comes from DTS. So, basically you
> can't get DTS directly from the 'time' field in the tfra box.
>
> Let's say mvhd.timescale=600, mdhd.timescale=24000 and the edit list
> contains two edits (edit[0] and edit[1]),
>   edit[0] = {segment_duration=600, media_time=-1, media_rate=1}; // empty
> edit
>   edit[1] = {segment_duration=1200, media_time=2002, media_rate=1};
> and the track fragment run in a track fragment which you get from an entry
> in the tfra box, where 'time' in that entry is equal to 48000 is as
> follows.
>   trun.sample[0].sample_is_non_sync_sample = 1
>   trun.sample[0].sample_duration=1001
>   trun.sample[0].sample_composition_time_offset=1001
>   trun.sample[1].sample_is_non_sync_sample = 0
>   trun.sample[1].sample_duration=1001
>   trun.sample[1].sample_composition_time_offset=1001
> Then, time/mdhd.timescale*mvhd.timescale=1200, that is, the PTS of the sync
> sample is equal to 1200 in mvhd.timescale.

And, the first edit is an empty edit, so the presentation of actual media
> starts with 1200 - 600 = 600 in mvhd.timescale.
>

And the latest sentence is somewhat wrong and/or incomplete. It should be
as follows.

"And, the first edit is an empty edit, so the presentation of actual media
starts with 600 in mvhd.timescale. Thus the sync sample is represented
after 1200 - 600 = 600 in mvhd.timescale."


> The CTS of the second sample in the trun.sample[1] is equal to 1001 + X,
> where the X is the sum of the duration of all preceding samples.
> The presentation starts with CTS=2002 because of media_time of the second
> edit, so the PTS of the sync sample corresponds to X - 1001.
> From this, X is equal to 49001, and the DTS of trun.sample[0] is equal to X
> - trun.sample[0].sample_duration = 49001 - 1001 = 48000.
>
> |<--edit[0]-->|<---------edit[1]--------->|
> |<-------------|-------------|-------------|---->presentation timeline
> 0              D             T'
>                |-------------|------------------>composition timeline
>            media_time        T
>          |-----|-------|-----|------------------>decode timeline
>          0 media_time  X     T
>                        |<--->|
>                       ct_offset
>
> D = edit[0].segment_duration = 600
> T' = time/mdhd.timescale*mvhd.timescale = 1200
> media_time = edit[1].media_time = 1001
> ct_offset = trun.sample[1].sample_composition_time_offset = 1001
> T=ct_offset + X
> T-media_time = (T'-D)*mdhd.timescale/mvhd.timescale
>

>From this figure and these equations,
X = (T' - D)*mdhd.timescale/mvhd.timescale + media_time - ct_offset
   = (T' * mdhd.timescale/mvhd.timescale) - D *
mdhd.timescale/mvhd.timescale + media_time - ct_offset
   = 48000 - 600*24000/600 + 2002 - 1001
   = 25001
(Note that T' need not be computed directly here since the 'time' of the
tfra box is equal to (T' * mdhd.timescale/mvhd.timescale).)

Thanks.


More information about the ffmpeg-devel mailing list