[FFmpeg-devel] [PATCH] lavu/avstring: add av_get_utf8() function

Michael Niedermayer michaelni at gmx.at
Wed Nov 13 19:04:34 CET 2013


On Wed, Nov 13, 2013 at 05:51:32PM +0100, Nicolas George wrote:
> Le tridi 23 brumaire, an CCXXII, Stefano Sabatini a écrit :
> > Yes, on the other hand developers would easily forget to update their
> > commit log, resulting in missing entries in the resulting output (and
> > we're not allowed to change log commits). I don't know if git allows
> > to markup specific commits after they have been committed.
> 
> There is git-notes, it allows to attach a note that will be displayed along
> with the commit message. Unfortunately, it is not cloned by default. Another
> solution would be to add an empty commit to the history with the APIchanges
> tag; unfortunately, in that case the commit would not appear in the log for
> the corresponding files.
> 
> But enough of this digressions.
> 
> > Changed both, but with the only difference that endp points to the
> > last byte in the buffer, in order to avoid overflow issues.
> 
> The C standard specifically allows pointers to the first byte after an
> object, probably exactly for this kind of situation. And it is easier to
> write:
> 
>     end = buf + size;
> 
> ... than to subtract one, because you must check size for 0 (C does not
> allow a pointer to the byte before an object, and anyways size is probably
> unsigned).
> 
> > I implemented the code < (1<<31) check in the patch. I don't know what
> > you exactly mean by "Unicode range check", indeed there is a lot of
> > documentation about which code points should be considered valid, and
> > for some it is not entirely clear (for example surrogates).
> 
> There is absolutely no doubt about surrogates: they are only valid in
> UTF-16.
> 
> The most ambiguous issue is the upper bound: it was initially 0xFFFF, then
> became 0x7FFFFFFF when thousands of ideograms were found in old books, and
> then was lowered to 0x10FFFF when it became apparent that microsoft and sun
> had once again made a mess with UTF-16.
> 
> > Which flags do you propose to support?
> 
> Default, accept any code that is structurally valid in current Unicode:
> 0x000000-0x10FFFF except the surrogates planes and 0xFFFE and 0xFFFF.
> 
> Flag #1: accept any code that is structurally possible in UTF-8, i.e.
> 0x00000000-0x7FFFFFFF.
> 
> Flag #2: reject codes that would make XML choke.
> 
> (Flag #3: toggle the default check for overlong encodings.)
> 
> > I cheated, indeed this list is directly taken from the XML specs:
> > http://www.w3.org/TR/xml/#charsets
> > 
> > after much time spent browsing various Unicode documents. Thus I
> > suppose these ranges should be universally accepted by XML parsers.
> 
> Ok.
> 
> > On the other hand I'm not sure what we should really disallow by
> > default, for example JSON parsers are usually much less strict than
> > XML parsers with regards to accepted code-points.
> 
> I agree, but surrogates, 0xFFFE, 0xFFFF and codes beyond 0x10FFFF should
> really not be there.

slightly off topic, but it is neccessary to support non utf8 sequences
in utf8
surrogates are what python uses to transport them through utf8 code
AFAIK.

The issue is that filenames are not neccessarily UTF8 in POSIX even if
your locale is UTF8.

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

I have often repented speaking, but never of holding my tongue.
-- Xenocrates
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20131113/c297818e/attachment.asc>


More information about the ffmpeg-devel mailing list