[FFmpeg-devel] [PATCH] lavc: make invalid UTF-8 in subtitle output a non-fatal error

Michael Niedermayer michaelni at gmx.at
Fri Jun 28 19:23:10 CEST 2013

On Fri, Jun 28, 2013 at 11:00:12AM +0200, Nicolas George wrote:
> Le decadi 10 messidor, an CCXXI, wm4 a écrit :
> > Such as?
> Depends on your input and what you want to do with it.
> > I get them from libavformat demuxers, but also elsewhere. I actually
> > can perform codepage auto-detection on subs read by libavformat
> > demuxers (it's really awkward: read a number of subtitle packages,
> > concatenate their contents, then run the charset detector on it). But
> > it's disabled by default
> Then enable it.
> >			   and doesn't guarantee success anyway.
> Success is not guaranteed in that you can not be sure to get the right
> encoding, but you will always succeed in finding at least one encoding that
> can work, since there are common encodings, including plain ISO-8859-1, that
> can accept any byte sequence.
> >								 In some
> > cases, subtitles might be demuxed from interleaved files, in which
> > auto-detection can't be reasonably performed.
> Do you have any such file where conversion fails? If so, share it.
> Also, you have only answered half the question: what do you intend to do
> with the decoded subtitles. If garbaged output suits you, do not bother
> decoding the subtitles, read them directly from /dev/urandom.
> > I have the impression that you still believe the charset problem can
> > be solved perfectly. This is not the case. Such problems are very common
> > even today, and just showing an error message (or even dropping broken
> > text) won't help.
> Please provide a realistic scenario where you believe the encoding problem
> can not be "solved perfectly" and where your proposal would have helped.

I cant awnser the question about the scenarios wm4 and others had in
mind but one scenario that i see is a simple player that displays

the user starts the player by clicking on a media file.
It displays subs as partial gibberish, like messed up umlauts, the
user sees that and hits the cycle encoding or choose encoding button
the user is able to guess from the messed up subtitles what the correct
encoding is

if theres just an error return the user lacks feedback, if theres
an error in the terminal window it will not be seen by 99% of the users
people use touch screens on theit phones there are no terminal windows
and the rest use mice with GUI based players.
A player would need specific code to detect this error case and
notify the user or take some automated action, that means extra code
in the player, while for the above case of simply displying giberish
no extra code is needed in the player

also slightly off topic but i like software that just works,
a simple player file,ext should autodetect the charset/encoding
if lavc is smart enough to figure out that its not utf8 then it
should pick something else instead of failing (unless thats damaged
utf8 in which case it should do some kind of concealing and stay with
I dont think that check and fail is a good default for players. Its
fine if the user forces utf8 and asks for hard failures (which a
transcoding application might do by default to prevent generating
broken files)


Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

When you are offended at any man's fault, turn to yourself and study your
own failings. Then you will forget your anger. -- Epictetus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20130628/532b929d/attachment.asc>

More information about the ffmpeg-devel mailing list