[FFmpeg-devel] Microsoft Smooth Streaming

Wed Oct 26 13:20:19 CEST 2011

On Tue, Oct 25, 2011 at 4:25 PM, Marcus Nascimento <marcus.cps at gmail.com> wrote:
> Please, check the answers bellow.
>
> Thank you very much in advance.
>
>
> On Tue, Oct 25, 2011 at 3:54 PM, Nicolas George <
> nicolas.george at normalesup.org> wrote:
>
>> Le quartidi 4 brumaire, an CCXX, Marcus Nascimento a écrit :
>> > I'd like to extend FFMpeg to support Microsoft Smooth Streaming
>> (streaming
>> > playback), the same way it has been done by all the available Silverlight
>> > players.
>>
>> Contributions are always welcome on principle.
>>
>> > By now I do not intend to dump data to a file to be played locally or
>> > anything like that. And probably will never intend to do that. I just
>> want
>> > to play it.
>>
>> If it can play it, then it can also dump it to a file. I hope you were not
>> counting otherwise.
>>
>>
> Definitely not. I was only worried about legal issues. Don't want to cause
> trouble to FFMpeg or something like that.
>
>
>> > I did some research in this mail list and find out some posts that talked
>> > about that before.
>> > However I couldn't find in depth information or anything beyond the point
>> > I'm stuck.
>> >
>> > I've done a lot of research on MS Smooth Streaming theory of operation,
>> > studied some ISOFF (and PIFF) and some more.
>> > It is pretty clear to me how MS Smooth Streaming works. Now it is time to
>> > focus on how to do that in the FFMpeg way.
>> >
>> > First things first, I'd like to know how a streaming should be processed
>> in
>> > order to be played by FFMpeg.
>>
>> I believe you would receive more relevant replies faster if you took a few
>> minutes to describe an overview of how the protocol works.
>>
>>
> Right away. I'll give as many details as necessary here. Prepare yourself
> for some reading!
>
> First of all, Microsoft Smooth Streaming basic idea is to encode the same
> video in multiple bitrates. The client can decide which bitrate to use. At
> any time it is possible to switch to another bitrate based on bandwidth
> availability and other measurements.
> Each encoding bitrate will originate an independent ISMV file (IIS Smooth
> Media Video I supose).
> The encoding keeps focus in the idea of fragmented structure that ISOFF (ISO
> File Format - the MP4 file format) offers. Keyframes are generated regularly
> and equally spaced in all ISMV files (2s).
> This is more restrictive than regular encoding procedures that allow some
> flexibility on keyframe intervals (I believe it, since I'm not an specialist
> on that).
> Important to say that all fragments always start with a keyframe.
> Each ISOFF fragment is perfectly aligned between different bitrates (in
> terms of time, of course. Data size may vary drastically). That alignment
> allows the client to request different bitrates for one fragment and switch
> to another bitrate in the next fragment.
>
> The ISMV file format is called PIFF and is based on the ISOFF with a few
> additions. There are 3 uuid box types that are dedicated to DRM purposes (I
> wont touch them here). Thus the meaning of PIFF: Protected Interoperable
> File Format. The PIFF brand (ftyp box value) is "piff".
> More on PIFF format here: http://go.microsoft.com/?linkid=9682897
>
> The server side (in the MS implementation) is just an extension to the IIS
> called IIS Media Services.
> That is just a web service that accepts HTTP requests with a custom
> formatted URL.
> The base URL is something like http://domain.com/video.ism (note that is not
> ISMV), which is never requested.
>
> By the time the client wants to play a video, it will request a Manifest
> file. The URL is <baseUrl>/Manifest.
> The Manifest is just a XML file that provides some information regarding
> different streams and other information.
> Here is a basic example (modified parts of the original found here:
> http://playready.directtaps.net/smoothstreaming/SSWSS720H264/SuperSpeedway_720.ism/Manifest
> ):
>
> <SmoothStreamingMedia MajorVersion="2" MinorVersion="1"
> Duration="1209510000">
> <StreamIndex Type="video" Name="video" Chunks="4" QualityLevels="2"
> MaxWidth="1280" MaxHeight="720" DisplayWidth="1280" DisplayHeight="720"
> Url="QualityLevels({bitrate})/Fragments(video={start time})">
> <QualityLevel Index="0" Bitrate="2962000" FourCC="H264" MaxWidth="1280"
> MaxHeight="720"
> CodecPrivateData="000000016764001FAC2CA5014016EFFC100010014808080A000007D200017700C100005A648000B4C9FE31C6080002D3240005A64FF18E1DA12251600000000168E9093525"/>
> <QualityLevel Index="1" Bitrate="2056000" FourCC="H264" MaxWidth="992"
> MaxHeight="560"
> CodecPrivateData="000000016764001FAC2CA503E047BFF040003FC52020202800001F480005DC03030003EBE8000FAFAFE31C6060007D7D0001F5F5FC6387684894580000000168E9093525"/>
> <c d="20020000"/>
> <c d="20020000"/>
> <c d="20020000"/>
> <c d="6670001"/>
> </StreamIndex>
> <StreamIndex Type="audio" Index="0" Name="audio" Chunks="4"
> QualityLevels="1" Url="QualityLevels({bitrate})/Fragments(audio={start
> time})">
> <QualityLevel FourCC="AACL" Bitrate="128000" SamplingRate="44100"
> Channels="2" BitsPerSample="16" PacketSize="4" AudioTag="255"
> CodecPrivateData="1210"/>
> <c d="20201360"/>
> <c d="19969161"/>
> <c d="19969161"/>
> <c d="8126985"/>
> </StreamIndex>
> </SmoothStreamingMedia>
>
> We can see it says the version of the smooth stream media and the duration
> (this is measured in 1 / 10,000,000 seconds).
> Next we see the video section which says each quality level has 4 chunks
> (fragments), with 2 quality levels available. It also says the video
> dimensions and the URL format.
> Next it gives information about each bitrate with codec information and
> codec private data (I believe it is used to configure the codec is a opaque
> way).
> Next it lists each fragment size. The first fragment would be referenced as
> 0 (zero), and the others as a sum of previous fragments size. I'm not sure
> exactly what those values mean.
> Next we have the same structure for audio description.
>
> After getting the Manifest file, the client must decide which quality level
> is best suited for the device and its resources.
> It is not clear to me on what parameters it bases it's decisions. I heard
> about size of the screen and its resolution, computing power, download
> bandwidth, etc.
> As soon as the quality level is chosen, I suppose the decoder has to be
> configured in a suitable way, using the CodecPrivateData information
> provided.
> The client then will start requesting fragments following the URL pattern
> given in the Manifest.
> To request the first fragment for the first quality level, it would follow
> the <baseUrl>/QualityLevel(0)/Fragments(video=0).
> To request the forth fragment for the second quality level, it would follow
> the <baseUrl>/QualityLevel(1)/Fragments(video=60060000).
> It is still possible to request just the audio following the same idea. For
> instance: <baseUrl>/QualityLevels(0)/Fragments(audio=20201360).
>
> Each fragment received is arranged in PIFF wire format. In other words:
> Contains exactly one moof box and exactly one mdat box and nothing
> more (check MP4 specs for more info).
> Of course there are internal boxes to those if applicable. It may contain
> custom uuid boxes designed to allow DRM protection. Lets not consider them
> here.
> I'm not sure which information I can get from the moof boxes, but I assume
> it would be relevant for the demuxer only, since the codec would only work
> on the mdat contained opaque data. Correct me if I'm wrong, please.
>
> The client would apply some heuristics while requesting fragments and
> sometime it may decide to switch to another quality level. I suppose it
> would have to reconfigure the decoder and repeat it over and over until the
> end of that.
>
> I'm not sure how a decoder works, but I believe there is a way to configure
> that in order to receive future "injected" data.
>
> If you get all the way here, I really thank you!
> I wonder how to fit all this into the ffmpeg structure.
> If anyone can point me some direction, I'd be very thankful.
> There is still a few comments bellow...
>
>
> For the rest, I am just shooting in the dark, as I know nothing of the
>> protocol.
>>
>> > I see two possible scenarios:
>> >
>> > 1 - An external code make all HTTP requests to obtain the manifest XML
>> file,
>> > use that to configure the decoder. Then makes further HTTP requests to
>> > obtain fragments that will be parsed by the demuxer (probably a custom
>> one
>> > based on the ISOM already available).
>>
>> This looks like the manifest XML file has a role similar to the SDP file
>> with RTP streams. You could look at how that works to see if that suits
>> you.
>>
>>
> I'm not that familiar with RTP but from what I've ready in the past few
> minutes it sounds similar.
>
>
>> > 2 - A very simple external code just request FFMpeg to play a smooth
>> > streaming media. FFMpeg will detect this is a HTTP based media and will
>> > request the manifest file for that (I believe I'd have to create a custom
>> > HTTP based solution for that). By the time the manifest is available,
>> ffmpeg
>> > would configure the decoder. Then makes further HTTP requests same way as
>> in
>> > 1.
>>
>> There is already HTTP client code, as surely you know.
>>
>>
> Yes. I've seen something about it. It looks suitable for the case.
> It may be my starting point for studying. But I still feel like in need for
> some big picture on how ffmpeg works in general.
>
>
>> Regards,
>>

I think a close match for this would be RTMP support in FFmpeg, the
complexity of negotiations with the server is handled by an external
library like librtmp (FFmpeg compiled with librtmp support enabled),
which feeds the data of a particular chunk size for the decoder to
decode. RTMP protocol supports negotiating a bandwidth as part of its
handshake between the server and the client (based on network load),
the servers usually serves three different quality levels, but they
are not as scalable (that would be H.264 SVC AFAIK) like you
described. Negotiations are handled within librtmp, FFmpeg's interface
to this external library is done in libavformat/flvdec.c.