[FFmpeg-devel] [PATCH] [WIP] avformat/assdec: UTF-16 support

Clément Bœsch u at pkh.me
Sun Mar 30 10:29:35 CEST 2014


On Fri, Mar 28, 2014 at 07:33:25PM +0100, wm4 wrote:
> This attempts to add UTF-16 subtitle support in the most simple way
> possible. It does so by replacing avio_r8() with ff_text_r8(), which
> converts UTF-16 on the fly to UTF-8. If the source is not UTF-16,
> it practically wraps avio_r8() without change.
> 
> This uses the BOM to recognize UTF-16 files. In practice, all UTF-16
> text files have a BOM. (I planned to use a somewhat more robust method
> to ddtect UTF-16, similar to MPlayer's subreader, but libavformat's
> architecture doesn't allow this easily.)
> 
> This also takes care of skipping the BOM properly in the UTF-8 case.
> Skipping the BOM is somewhat hard, because AVIOContext does not allow
> any readahead. Since ff_text_r8() includes its own read buffer (in case
> bytes were read that don't belong to a BOM), this becomes trivial.
> 
> The functionality added with this patch could be used to extend other
> subtitle formats with UTF-16 support.
> 
> It might be possible to implement the functionality provided by
> FFTextReader as custom AVIOContext, but I refrained from that
> because it's not easily possible to to return the correct stream
> position with this, and it also seemed too roundabout.
> ---
>  libavformat/assdec.c    | 21 ++++++++++++------
>  libavformat/subtitles.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++
>  libavformat/subtitles.h | 13 +++++++++++
>  3 files changed, 84 insertions(+), 7 deletions(-)
> 

This sounds good to me mostly. Now the thing is, ff_subtitles_read_chunk()
(used in mpsub, srt, and webvtt) should be adjusted to use that API. Same
for ff_smil_extract_next_chunk() which is used for SAMI and realtext (we
have utf16 sami samples). Finally, the most common one, ff_get_line()
should be adjusted as well. The problem with this last one is that it
doesn't use an AVBPrint buffer, so it might be a bit more tricky.

I'm not asking for those to be changed for the patch to be accepted, but
that's how I'd see a complete UTF-16 support for subtitles.

This code you introduce is definitely an improvement, thanks.

Your patch states a work in progress. What's your plan for the next
iteration?

[...]

-- 
Clément B.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20140330/865d51b9/attachment.asc>


More information about the ffmpeg-devel mailing list