[FFmpeg-devel] [PATCH] movtextdec: fix handling of UTF-8 subtitles

Carl Eugen Hoyos ceffmpeg at gmail.com
Sat Mar 24 18:05:41 EET 2018


2018-03-24 15:48 GMT+01:00, wm4 <nfxjfg at googlemail.com>:
> Subtitles which contained styled UTF-8 subtitles (i.e. not just 7 bit
> ASCII characters) were not handled correctly. The spec mandates that
> styling start/end ranges are in "characters". It's not quite clear what
> a "character" is supposed to be, but maybe they mean unicode codepoints.
>
> FFmpeg's decoder treated the style ranges as byte idexes, which could
> lead to UTF-8 sequences being broken, and the common code dropping the
> whole subtitle line.
>
> Change this and count the codepoint instead. This also means that even
> if this is somehow wrong, the decoder won't break UTF-8 sequences
> anymore. The sample which led me to investigate this now appears to work
> correctly.

Could you confirm that this is also what QT does?

Or is it impossible that the patch breaks something?

[...]

> @@ -388,17 +405,24 @@ static int text_to_ass(AVBPrint *buf, const char
> *text, const char *text_end,
>              }
>          }
>
> -        switch (*text) {
> -        case '\r':
> -            break;
> -        case '\n':
> -            av_bprintf(buf, "\\N");
> -            break;
> -        default:
> -            av_bprint_chars(buf, *text, 1);
> -            break;
> +        len = get_utf8_length_at(text, text_end);
> +        if (len < 1) {
> +            av_log(avctx, AV_LOG_ERROR, "invalid UTF-8 byte in
> subtitle\n");
> +            len = 1;
> +        }
> +        for (i = 0; i < len; i++) {

> +            switch (*text) {
> +            case '\r':
> +                break;
> +            case '\n':
> +                av_bprintf(buf, "\\N");
> +                break;
> +            default:
> +                av_bprint_chars(buf, *text, 1);
> +                break;

Imo, the reindentation is not ok but this isn't my code.

Carl Eugen


More information about the ffmpeg-devel mailing list