[FFmpeg-devel] [RFC] AST subtitles

Clément Bœsch ubitux at gmail.com
Sat Nov 24 22:08:33 CET 2012

Hi there,

I wrote a new prototype for storing text subtitles instead of a custom ASS
line like we currently do. It's trying to be flexible enough to be able to
deal with any kind of text subtitles markups, while being as simple as
possible for our users, but also for decoders and encoders.

Of course, we will have to deal with retro compat. The simpler I found
here was to introduce a new AVSubtitleType (SUBTITLE_AST), and we would
use a new field AVSubtitleRect->ast instead of AVSubtitleRect->ass.

I used "AST" initially, but it's actually not an AST, so far it's just
kind of a list, feel free to propose a better random name; I took this
name because it expresses the fact that it's an arbitrary structure
layout, and not a data buffer like currently.

Anyway, here are the basic structures:

    typedef struct AVSubtitleASTChunk {
        int type;           ///< one of the AVSUBTITLE_AST_SETTING_TYPE_*
        int reset;          /**< this chunk restores the setting to the default
                                 value (or disable the previous one in nested
                                 mode) */
        union {
            char *s;        ///< must be a av_malloc'ed string if string type
            double d;
            int i;
            int64_t i64;
            uint32_t u32;
            void *p;        /**< pointer to allocated data of an arbitrary
                                 size (chunk type dependent) */
        int p_nb;           /**< number of entries in p, can be used for
                                 variable sized data */
    } AVSubtitleASTChunk;

    typedef struct AVSubtitleASTSettings {
        char *name;             ///< optional settings name reference
        int nb;                 ///< number of allocated chunks
        AVSubtitleASTChunk *v;  ///< array of nb chunks
    } AVSubtitleASTSettings;

    typedef struct AVSubtitleAST {
        const AVSubtitleASTSettings *g_settings;  /**< pointer to one of the global
                                                       settings for the subtitle event */
        AVSubtitleASTChunk *chunks;               ///< styles and text chunks
        int nb_chunks;                            ///< number of chunks
    } AVSubtitleAST;

A decoder will output an AVSubtitleAST for one event (we can imagine
multiple events at the same time in different AVSubtitleRect).

The main functions are:

    AVSubtitleAST *av_sub_ast_alloc(void);
    int av_sub_ast_add_chunk(AVSubtitleAST *sub, AVSubtitleASTChunk chunk);
    void av_sub_ast_free(AVSubtitleAST *sub);

    AVSubtitleASTSettings *av_sub_ast_settings_alloc(const char *name);
    int av_sub_ast_add_setting(AVSubtitleASTSettings *settings, AVSubtitleASTChunk chunk);
    void av_sub_ast_settings_free(AVSubtitleASTSettings *settings);

    int av_sub_ast_nested_to_flat(AVSubtitleAST *sub);
    void av_sub_ast_cleanup(AVSubtitleAST *sub); // assume flat
    void av_sub_ast_dump(const AVSubtitleAST *sub);

Note that contrary to the structures, all these functions are private (they are
only necessary for decoders, users and encoders will browse the structures),
so please don't mind the "av_" prefix.

And finally here is a non exhaustive (yet) list of chunks:

    enum {
        AVSUBTITLE_AST_CHUNK_RAW_TEXT      = MKBETAG('t','e','x','t'),  // s
        AVSUBTITLE_AST_CHUNK_COMMENT       = MKBETAG('c','o','m',' '),  // s
        AVSUBTITLE_AST_CHUNK_TIMING        = MKBETAG('t','i','m','e'),  // i64
        AVSUBTITLE_AST_CHUNK_KARAOKE       = MKBETAG('k','a','r','a'),  // i
        AVSUBTITLE_AST_CHUNK_FONTNAME      = MKBETAG('f','o','n','t'),  // s
        AVSUBTITLE_AST_CHUNK_FONTSIZE      = MKBETAG('f','s','i','z'),  // i
        AVSUBTITLE_AST_CHUNK_COLOR         = MKBETAG('c','l','r','1'),  // u32
        AVSUBTITLE_AST_CHUNK_COLOR_2       = MKBETAG('c','l','r','2'),  // u32
        AVSUBTITLE_AST_CHUNK_COLOR_OUTLINE = MKBETAG('c','l','r','O'),  // u32
        AVSUBTITLE_AST_CHUNK_COLOR_BACK    = MKBETAG('c','l','r','B'),  // u32
        AVSUBTITLE_AST_CHUNK_BOLD          = MKBETAG('b','o','l','d'),  // i
        AVSUBTITLE_AST_CHUNK_ITALIC        = MKBETAG('i','t','a','l'),  // i
        AVSUBTITLE_AST_CHUNK_STRIKEOUT     = MKBETAG('s','t','r','k'),  // i
        AVSUBTITLE_AST_CHUNK_UNDERLINE     = MKBETAG('u','n','l','n'),  // i
        AVSUBTITLE_AST_CHUNK_BORDER_STYLE  = MKBETAG('b','d','e','r'),  // i
        AVSUBTITLE_AST_CHUNK_OUTLINE       = MKBETAG('o','u','t','l'),  // i
        AVSUBTITLE_AST_CHUNK_SHADOW        = MKBETAG('s','h','a','d'),  // i
        AVSUBTITLE_AST_CHUNK_ALIGNMENT     = MKBETAG('a','l','g','n'),  // i
        AVSUBTITLE_AST_CHUNK_MARGIN_L      = MKBETAG('m','a','r','L'),  // i
        AVSUBTITLE_AST_CHUNK_MARGIN_R      = MKBETAG('m','a','r','R'),  // i
        AVSUBTITLE_AST_CHUNK_MARGIN_V      = MKBETAG('m','a','r','V'),  // i
        AVSUBTITLE_AST_CHUNK_ALPHA_LEVEL   = MKBETAG('a','l','p','h'),  // i
        AVSUBTITLE_AST_CHUNK_POSITION      = MKBETAG('p','o','s',' '),  // p (2 x i32: x, y)
        AVSUBTITLE_AST_CHUNK_MOVE          = MKBETAG('m','o','v','e'),  // p (4 x i32: x1, y1, x2, y2)
        AVSUBTITLE_AST_CHUNK_LINEBREAK     = MKBETAG('l','b','r','k'),  // i

(Note: using named chunk is handy for debug, and adding/re-order styles without
breaking API since they will be exposed to the user)

Here is what a decoder will basically do:

 - If the markup needs it, the decoder will create default style profiles.
   To do so, one or more AVSubtitleASTSettings can be allocated using
   av_sub_ast_add_setting(), with a name for each one. Each of them
   contains a list of AVSubtitleASTChunk, one for each custom style:

     "default" [italic=1][bold=1][fontface="Arial"]
     "fancy"   [color=red][underline=1][fontface="Comic Sans"]

 - Each time a decoder receive a subtitles buffer, a new AVSubtitleAST is
   allocated with av_sub_ast_alloc(). If necessary, it can be associated
   with one of the global AVSubtitleASTSettings for the default values.
   Then while parsing the buffer, the decoder will insert chunks of text
   or style using av_sub_ast_add_chunk():


In order to test a bit if that can work, I've rewritten the SubRip
decoder, which is a bit special since it has a nested markup, while the
AVSubtitleAST will only be considered as flat (since it's easier to deal
with for users).

Let's take an example on how it works with the following markup'ed event:

    00:00:00,000 --> 00:00:30,000
              hello<font color="red">
    bar<font size="3" color="blue">bla</font>
    <i><font size="5"             >yyyy</font>xxx</i>

So first, the decoder allocate a new AVSubtitleAST, and fill it with
chunks. This is what it looks like at the end of the parsing:

    AST subtitle dump 0x258a900:
      [text] '          hello'
      [clr1] 00FF0000
      [lbrk] 1
      [text] 'bar'
      [fsiz] 3
      [clr1] 000000FF
      [text] 'bla'
      [fsiz] (RESET/CLOSE)
      [clr1] (RESET/CLOSE)
      [lbrk] 1
      [ital] 1
      [fsiz] 5
      [text] 'yyyy'
      [fsiz] (RESET/CLOSE)
      [text] 'xxx'
      [ital] (RESET/CLOSE)
      [lbrk] 1
      [clr1] (RESET/CLOSE)
      [lbrk] 1

Note that the decoder inserted some "reset" chunks with the "close"
meaning: these chunks are telling to close the latest open chunk of the
same type. But in flat representation, it means to reset to the default
style. That is why this decoder is required to call after parsing
av_sub_ast_nested_to_flat(), which will change the AVSubtitleAST into:

AST subtitle dump 0x258a900:
  [text] '          hello'
  [clr1] 00FF0000
  [lbrk] 1
  [text] 'bar'
  [fsiz] 3
  [clr1] 000000FF
  [text] 'bla'
  [fsiz] (RESET/CLOSE)
  [clr1] 00FF0000
  [lbrk] 1
  [ital] 1
  [fsiz] 5
  [text] 'yyyy'
  [fsiz] (RESET/CLOSE)
  [text] 'xxx'
  [ital] (RESET/CLOSE)
  [lbrk] 1
  [clr1] (RESET/CLOSE)
  [lbrk] 1

Now the reset chunks really means a fallback to the default.

Another nice thing we can do now is to clean-up the whole thing with

AST subtitle dump 0x258a900:
  [text] 'hello'
  [clr1] 00FF0000
  [lbrk] 1
  [text] 'bar'
  [fsiz] 3
  [clr1] 000000FF
  [text] 'bla'
  [fsiz] (RESET/CLOSE)
  [clr1] 00FF0000
  [lbrk] 1
  [ital] 1
  [fsiz] 5
  [text] 'yyyy'
  [fsiz] (RESET/CLOSE)
  [text] 'xxx'

That function trims what's not text at the end (style tags and line
breaks). It also trims the initial spaces. Now this event can be perfectly
represented into a flat markup (such as ASS), and it's pretty easy to
write an ASS encoder from this.

Now the other way around (encoding to a nested markup such as SubRip)
isn't that complicated either. The encoder just needs to make sure all the
tags are closed at the end by browsing the list in reverse. There is on
the other hand a little problem of overhead: one chunk can contain only
one style, which means you will get on the output multiple <font> tags
with one attribute instead of one <font> with multiple attributes. I'm not
sure that's worth trying to "compress" this though, given the complexity
it might add for simple cases. The SubRip encoder can of course have its
own heuristics to deal with this.

Anyway, I still have various integration problems because of the API and
ABI constraints we have, but I think we can do something with this stuff.


Clément B.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20121124/2c1b2a48/attachment.asc>

More information about the ffmpeg-devel mailing list