[FFmpeg-devel] [PATCH] lavc: make invalid UTF-8 in subtitle output a non-fatal error

wm4 nfxjfg at googlemail.com
Wed Jun 26 16:00:15 CEST 2013

On Wed, 26 Jun 2013 15:23:06 +0200
Nicolas George <nicolas.george at normalesup.org> wrote:

> L'octidi 8 messidor, an CCXXI, wm4 a écrit :
> > The intention of this error was preventing the generation of broken
> > files. But in practice, dropping the output is not very helpful.
> IMHO, dropping is wrong, transcoding should stop altogether.

Some people actually use libav* for playback. There is no transcoding
in this case.

> The correct way of dealing with this error is to provide a valid
> -suc_charenc option.

Other than the fact that the program using libavcodec will not
necessarily have a -sub_charenc option (but perhaps an equivalent)...
yes, having the correct codepage would be ideal, but you are a bit too
optimistic about the amount of broken messes out there. Also, do you
really expect users to open subtitles with a text editor first to
figure out the codepage? The situation currently is that if there
happens to be a subtitle event with, say, a broken umlaut, the user
can't see the line (will he even see the error messages?). And if he
does notice that something is wrong, has to stop playback, guess the
correct codepage, restart, and repeat until ffmpeg is happy. Even if he
knows that most of the text would be readable, and he doesn't consider
it worth the effort to fix it, ffmpeg will simply stay in the way.

Auto-detection can return incorrect results too. Even worse,
auto-detection as well as conversion with iconv could succeed without
indication that something is wrong, even if they produce garbage. This
actually does happen in some cases. And then you have broken files
again. (Not technically broken as they're valid UTF-8, but useless.)

> You assume that the lost subtitles lines are few and unimportant,
> because that suits your point, but you could just as easily assume
> that they are important enough as to make the movie unwatchable,
> while few enough to go unnoticed on a casual check after encoding.

My point is that displaying broken data is slightly better than
displaying nothing at all. On the other hand, I don't really get your
point. If everything is completely broken, it will be immediately
obvious to the user. Why would you drop the subtitle events at all?
Character salad is an obvious hint that it's a codepage issue, while
missing subs will make it harder for the user to figure out what
exactly went wrong.

> > Keeping the broken output will also make it easier for the user to
> > locate the error, such as by knowing the subtitle event time.
> Any decent text editor will give you the information too.
> Regards,

More information about the ffmpeg-devel mailing list