[FFmpeg-devel] [PATCH] [WIP] avformat/assdec: UTF-16 support

wm4 nfxjfg at googlemail.com
Sun Mar 30 11:22:07 CEST 2014


On Sun, 30 Mar 2014 10:29:35 +0200
Clément Bœsch <u at pkh.me> wrote:

> On Fri, Mar 28, 2014 at 07:33:25PM +0100, wm4 wrote:
> > This attempts to add UTF-16 subtitle support in the most simple way
> > possible. It does so by replacing avio_r8() with ff_text_r8(), which
> > converts UTF-16 on the fly to UTF-8. If the source is not UTF-16,
> > it practically wraps avio_r8() without change.
> > 
> > This uses the BOM to recognize UTF-16 files. In practice, all UTF-16
> > text files have a BOM. (I planned to use a somewhat more robust method
> > to ddtect UTF-16, similar to MPlayer's subreader, but libavformat's
> > architecture doesn't allow this easily.)
> > 
> > This also takes care of skipping the BOM properly in the UTF-8 case.
> > Skipping the BOM is somewhat hard, because AVIOContext does not allow
> > any readahead. Since ff_text_r8() includes its own read buffer (in case
> > bytes were read that don't belong to a BOM), this becomes trivial.
> > 
> > The functionality added with this patch could be used to extend other
> > subtitle formats with UTF-16 support.
> > 
> > It might be possible to implement the functionality provided by
> > FFTextReader as custom AVIOContext, but I refrained from that
> > because it's not easily possible to to return the correct stream
> > position with this, and it also seemed too roundabout.
> > ---
> >  libavformat/assdec.c    | 21 ++++++++++++------
> >  libavformat/subtitles.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++
> >  libavformat/subtitles.h | 13 +++++++++++
> >  3 files changed, 84 insertions(+), 7 deletions(-)
> > 
> 
> This sounds good to me mostly. Now the thing is, ff_subtitles_read_chunk()
> (used in mpsub, srt, and webvtt) should be adjusted to use that API. Same
> for ff_smil_extract_next_chunk() which is used for SAMI and realtext (we
> have utf16 sami samples). Finally, the most common one, ff_get_line()
> should be adjusted as well. The problem with this last one is that it
> doesn't use an AVBPrint buffer, so it might be a bit more tricky.

I'm not sure about ff_get_line(). It works on an AVIOContext. So I see
the following possibilities (pick your favorite one):

a) Don't use ff_get_line() in subtitle parsers, and use a new function
   based on ff_subtitles_next_line. (Also, ff_get_line() doesn't handle
   Mac or DOS line breaks.)
b) Actually replace FFTextReader by a "converting" AVIOContext (either
   by implementing FFTextReader as a custom AVIOContext backend, or
   changing AVIOContext directly).

The simplest would be a), while b) sounds a bit involved. b) also makes
it harder to get the original file position for AVPacket.pos.

> I'm not asking for those to be changed for the patch to be accepted, but
> that's how I'd see a complete UTF-16 support for subtitles.
> 
> This code you introduce is definitely an improvement, thanks.
> 
> Your patch states a work in progress. What's your plan for the next
> iteration?

Extending it to the other subtitle parsers. And maybe adding doxygen
and splitting the commit.

By the way, for which formats do you think UTF-16 is needed? I've
spotted only ASS and SAMI UTF-16 files in the wild. Somehow I'd also
expect to find such SRT and MicroDVD files, but I think I haven't seen
such yet. For all other formats, it's probably a waste of time.


More information about the ffmpeg-devel mailing list