[FFmpeg-devel] [PATCH 1/3] lavu: add av_is_valid_utf8().

Sun Apr 7 10:54:32 CEST 2013

On Sun, Apr 07, 2013 at 10:36:19AM +0200, Nicolas George wrote:
> L'octidi 18 germinal, an CCXXI, Reimar Döffinger a écrit :
> > I don't think this macro improves readability, quite the opposite.
> 
> I find they do, but I will not insist on it.
> 
> > Why should a normal byte-order-mark be accepted in UTF-8?
> > It really has no place or purpose in it either.
> > (though it might be hard to detect, since the code point is really just
> > a special non-breaking space I think)
> 
> Exactly: U+FEFF is utterly useless, but it is a valid Unicode character,
> while U+FFFE is guaranteed to never exist, precisely to detect endianness
> errors in conversions.
> 
> Also, there are quite a lot of text file out there in UTF-8 with a BOM; I
> believe some editors take it as a hint to open the file in UTF-8 rather than
> a legacy 8-bits encoding.

Yes, for pure text files (though I'd consider that bad implementation as well,
mostly caused by the amazingly incompetent Unicode handling in Windows).
However IMHO those should be stripped away by the subtitle demuxer.
Inside a (e.g. MKV) subtitle stream I'd say we usually shouldn't
encounter a BOM (except in the very unlikely case that it really means
the U+FEFF whitespace codepoint).