[FFmpeg-devel] [PATCH] lavu/avstring: add av_get_utf8() function

Stefano Sabatini stefasab at gmail.com
Wed Nov 20 13:02:16 CET 2013


On date Saturday 2013-11-16 11:57:07 +0100, Nicolas George encoded:
> Le quartidi 24 brumaire, an CCXXII, Stefano Sabatini a écrit :
> > Another interface optimization would be to document that *code is set
> > whenever the sequence is structurally valid, even if the code range is
> > not accepted.
> 
> Not sure how it can be done, since you still need to be able to distinguish
> cases where the UTF-8 is really invalid.

I can set the code only in case it is structurally valid, and set the
output code only in this case (and set it unset otherwise). Check the
updated test program.


> OTOH, a flag to automatically return a replacement character when en invalid
> sequence is detected could be useful, but that can come later.

The problem is that you need to specify which code to use instead, or
which sequence. Also for this a separate function should be needed,
and it's not clear what level of control the function should provide
before becoming too bloated.

> 
> > >From 40a1b7a61d509efe64fdd1c1047fdd1507ab181e Mon Sep 17 00:00:00 2001
> > From: Stefano Sabatini <stefasab at gmail.com>
> > Date: Thu, 3 Oct 2013 01:21:40 +0200
> > Subject: [PATCH] lavu/avstring: add av_utf8_decode() function
> > 
> > ---
> >  doc/APIchanges       |  3 +++
> >  libavutil/Makefile   |  1 +
> >  libavutil/avstring.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >  libavutil/avstring.h | 35 ++++++++++++++++++++++++++++
> >  libavutil/utf8.c     | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  libavutil/version.h  |  2 +-
> >  6 files changed, 170 insertions(+), 1 deletion(-)
> >  create mode 100644 libavutil/utf8.c
> > 
> > diff --git a/doc/APIchanges b/doc/APIchanges
> > index dfdc159..b292d19 100644
> > --- a/doc/APIchanges
> > +++ b/doc/APIchanges
> > @@ -15,6 +15,9 @@ libavutil:     2012-10-22
> >  
> >  API changes, most recent first:
> >  
> > +2013-11-12 - xxxxxxx - lavu 52.53.100 - avstring.h
> > +  Add av_utf8_decode() function.
> > +
> >  2013-11-xx - xxxxxxx - lavc 55.41.100 / 55.25.0 - avcodec.h
> >                         lavu 52.51.100 - frame.h
> >    Add ITU-R BT.2020 and other not yet included values to color primaries,
> > diff --git a/libavutil/Makefile b/libavutil/Makefile
> > index 7b3b439..19540e4 100644
> > --- a/libavutil/Makefile
> > +++ b/libavutil/Makefile
> > @@ -155,6 +155,7 @@ TESTPROGS = adler32                                                     \
> >              sha                                                         \
> >              sha512                                                      \
> >              tree                                                        \
> > +            utf8                                                        \
> >              xtea                                                        \
> >  
> >  TESTPROGS-$(HAVE_LZO1X_999_COMPRESS) += lzo
> > diff --git a/libavutil/avstring.c b/libavutil/avstring.c
> > index eed58fa..8ff953e 100644
> > --- a/libavutil/avstring.c
> > +++ b/libavutil/avstring.c
> > @@ -307,6 +307,70 @@ int av_isxdigit(int c)
> >      return av_isdigit(c) || (c >= 'a' && c <= 'f');
> >  }
> >  
> > +int av_utf8_decode(int32_t *codep, const uint8_t **bufp, const uint8_t *buf_end,
> > +                   unsigned int flags)
> > +{
> > +    const uint8_t *p = *bufp;
> > +    uint32_t top;
> > +    uint64_t code;
> > +    int ret = 0;
> > +
> > +    if (p >= buf_end)
> > +        return 0;
> > +
> > +    code = *p++;
> > +
> > +    /* first sequence byte starts with 10, or is 1111-1110 or 1111-1111,
> > +       which is not admitted */
> > +    if ((code & 0xc0) == 0x80 || code >= 0xFE) {
> > +        ret = AVERROR(EILSEQ);
> > +        goto end;
> > +    }
> > +    top = (code & 128) >> 1;
> > +
> > +    while (code & top) {
> > +        int tmp;
> > +        if (p >= buf_end) {
> > +            ret = AVERROR(EILSEQ); /* incomplete sequence */
> > +            goto end;
> > +        }
> > +
> > +        /* we assume the byte to be in the form 10xx-xxxx */
> > +        tmp = *p++ - 128;   /* strip leading 1 */
> > +        if (tmp>>6) {
> > +            ret = AVERROR(EILSEQ);
> > +            goto end;
> > +        }
> > +        code = (code<<6) + tmp;
> > +        top <<= 5;
> > +    }
> > +    code &= (top << 1) - 1;
> > +
> > +    if (code >= 1<<31) {
> > +        ret = AVERROR(EILSEQ);  /* out-of-range value */
> > +        goto end;
> > +    }
> > +
> > +    *codep = code;
> > +
> > +    if (code > 0x10FFFF &&
> > +        !(flags & AV_UTF8_CHECK_FLAG_ACCEPT_INVALID_BIG_CODES))
> > +        ret = AVERROR(EILSEQ);
> > +    if (code < 0x20 && code != 0x9 && code != 0xA && code != 0xD &&
> > +        flags & AV_UTF8_CHECK_FLAG_EXCLUDE_XML_INVALID_CONTROL_CODES)
> > +        ret = AVERROR(EILSEQ);
> > +    if (code >= 0xD800 && code <= 0xDFFF &&
> > +        !(flags & AV_UTF8_CHECK_FLAG_ACCEPT_SURROGATES))
> > +        ret = AVERROR(EILSEQ);
> > +    if (code == 0xFFFE || code == 0xFFFF &&
> > +        (!flags & AV_UTF8_CHECK_FLAG_ACCEPT_NON_CHARACTERS))
> > +        ret = AVERROR(EILSEQ);
> > +
> > +end:
> > +    *bufp = p;
> > +    return ret;
> > +}
> > +
> >  #ifdef TEST
> >  
> >  int main(void)
> > diff --git a/libavutil/avstring.h b/libavutil/avstring.h
> > index 438ef79..9a8aadf 100644
> > --- a/libavutil/avstring.h
> > +++ b/libavutil/avstring.h
> > @@ -22,6 +22,7 @@
> >  #define AVUTIL_AVSTRING_H
> >  
> >  #include <stddef.h>
> > +#include <stdint.h>
> >  #include "attributes.h"
> >  
> >  /**
> > @@ -295,6 +296,40 @@ enum AVEscapeMode {
> >  int av_escape(char **dst, const char *src, const char *special_chars,
> >                enum AVEscapeMode mode, int flags);
> >  
> 
> > +#define AV_UTF8_CHECK_FLAG_ACCEPT_INVALID_BIG_CODES 1 ///< accept codepoints over 0x10FFFF
> > +#define AV_UTF8_CHECK_FLAG_ACCEPT_NON_CHARACTERS    2 ///< accept non-characters - 0xFFFE and 0xFFFF
> > +#define AV_UTF8_CHECK_FLAG_ACCEPT_SURROGATES        4 ///< accept UTF-16 surrogates codes
> > +#define AV_UTF8_CHECK_FLAG_EXCLUDE_XML_INVALID_CONTROL_CODES 8 ///< exclude control codes not accepted by XML
> 
> I still think that CHECK is redundant with ACCEPT and EXCLUDE, but that is
> your call.

Removed CHECK, hope we won't need to change it later.
 
> > +
> > +/**
> > + * Read and decode a single UTF-8 code point (character) from the
> > + * buffer in *buf, and update *buf to point to the next byte to
> > + * decode.
> > + *
> > + * In case of an invalid byte sequence, the pointer will be updated to
> > + * the next byte after the invalid sequence and the function will
> > + * return an error code.
> > + *
> > + * Depending on the specified flags, the function will also fail in
> > + * case the decoded code point does not belong to a valid range.
> > + *
> > + * @note For speed-relevant code a carefully implemented use of
> > + * GET_UTF8() may be preferred.
> > + *
> > + * @param code pointer used to return the parsed code in case of success
> > + * @param buf      pointer to the first byte of the sequence to decode
> > +
> > + * @param buf_end mark the end of the buffer, points to the next byte
> > + *                past the last in the buffer. This is used to avoid
> > + *                buffer overreads (in case of an unfinished UTF-8
> 
> > + *                sequence towards the end of the buffer).
> > + * @param flags    a collection of AV_UTF8_CHECK_FLAG_* flags
> 
> Nit: broken alignment.
[...] 
> The patch looks very fine to me now, thanks for bearing with me.

Updated.
-- 
FFmpeg = Frenzy and Fancy Mythic Powered Evil Goblin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-lavu-avstring-add-av_utf8_decode-function.patch
Type: text/x-diff
Size: 8385 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20131120/865b818d/attachment.bin>


More information about the ffmpeg-devel mailing list