[FFmpeg-devel] [PATCH] avutil/avstring: add av_strreplace API into avstring

Steven Liu lingjiujianke at gmail.com
Sun Apr 2 14:45:34 EEST 2017


2017-04-02 17:41 GMT+08:00 Nicolas George <george at nsup.org>:

> Le decadi 10 germinal, an CCXXV, Steven Liu a écrit :
> > refer to: http://creativeandcritical.net/str-replace-c
> > add av_strreplace API for replace string operations.
> >
> > Signed-off-by: Steven Liu <lq at chinaffmpeg.org>
> > ---
> >  libavutil/avstring.c | 77 ++++++++++++++++++++++++++++++
> ++++++++++++++++++++++
> >  libavutil/avstring.h |  5 ++++
> >  2 files changed, 82 insertions(+)
> >
> > diff --git a/libavutil/avstring.c b/libavutil/avstring.c
> > index 1787a1ef54..52e6e6cd13 100644
> > --- a/libavutil/avstring.c
> > +++ b/libavutil/avstring.c
> > @@ -231,6 +231,83 @@ int av_strncasecmp(const char *a, const char *b,
> size_t n)
> >      return c1 - c2;
> >  }
> >
>
> > +char *av_strreplace(const char *str, const char *from, const char *to)
>
> Before starting: Steven, you seem not completely familiar with the
> English language. Sometimes, I make complex sentences. If you need, ask
> and I will rephrase.
>
> First, I think the "strreplace" name should have been reserved for the
> case-sensitive version, using possibly strireplace for the
> case-insensitive version.
>
> Since it was only applied very recently, and is not yet used even in the
> rest of FFmpeg, I say we fix the name right now.
>
> (Note: that does not mean this minute, that means this week, leaving a
> chance for other to give their opinion. And of course, you yourself are
> allowed to disagree.)
>
> Second, I notice quite a few potential integer overflows. They would
> need to be fixed if the code is to stay that way. Fortunately, I think
> we can get rid of all that.
>
> Third, and most important for the rest of my discourse: this code is
> guilty of the sin of premature optimization. It builds a rather complex
> cache mechanism, but possibly cripples it by introducing an arbitrary
> limit, and more importantly, it uses av_stristr() repeatedly;
> av_stristr() itself is not optimized at all, and there are further
> optimizations for repeated use with the same needle; see the
> Knuth-Morris-Pratt algorithm. Keep that in mind if anything I write
> below seems "less efficient" than what is already there.
>
> I told that it duplicated the BPrint API, I was wrong: the API it
> duplicates dynarray. But it could have been BPrint, and they are the
> same anyway.
>
> Because there are three obvious ways of implementing this function.
>
> 1. (the present implementation) Find and keep track of all occurrences
>    of the needle, allocate the target string with the exact size, and
>    build it.
>
> 2. Find and count all occurrences of the needle, allocate the target
>    string with the exact size, and then re-find all occurrences of the
>    needle to build it.
>
> 3. Build the target string on the fly.
>
> (Lexicon: strstr(haystack, needle): to find a needle in a haystack.)
>
> By itself, the simplest solution to implement in C is #2 because it uses
> only a single dynamic allocation and very simple loops. It requires
> twice as much needle searches, but if they are optimized it may well be
> a non-issue.
>
> Solution #1 addresses the "twice as much needle searches" issue by
> storing the results. I believe it is a very bad idea, because storing
> the results requires a rather complex dynamic reallocation mechanism: in
> this code, the dynamic array to store the results is about one third of
> the whole code, and most of the integer overflows. One of the
> av_dynarray API can make it simpler, but it is still a little tricky to
> use. And I suspect it is not even worth the effort; only a benchmark
> could tell: premature optimization.
>
> I say: let us NOT use solution #1.
>
> Solution #3 takes a completely different road. To implement it in plain
> C would require the same kind of tricky dynamic reallocation as solution
> #1, for the target string instead of the positions. But we have the
> BPrint API that makes it very simple.
>
> My advice would be: go with solution #3, and if performance is a problem
> start by implementing KMP.
>
> If per chance you prefer solution #2, you can achieve it by removing all
> the "cache"-related code and replacing access to the "cache" with a new
> call to av_stristr().
>
>
> > +{
> > +    /* Adjust each of the below values to suit your needs. */
> > +    /* Increment positions cache size initially by this number. */
> > +    size_t cache_sz_inc = 16;
> > +    /* Thereafter, each time capacity needs to be increased,
> > +     * multiply the increment by this factor. */
> > +    const size_t cache_sz_inc_factor = 3;
> > +    /* But never increment capacity by more than this number. */
> > +    const size_t cache_sz_inc_max = 1048576;
>
> > +    uintptr_t *pos_cache_tmp, *pos_cache = NULL;
> > +    size_t cache_sz = 0;
>
> > +        /* Increase the cache size when necessary. */
> > +        if (cache_sz < count) {
> > +            cache_sz += cache_sz_inc;
> > +            pos_cache_tmp = av_realloc(pos_cache, sizeof(*pos_cache) *
> cache_sz);
> > +            if (!pos_cache_tmp) {
> > +                goto end_strreplace;
> > +            } else pos_cache = pos_cache_tmp;
> > +            cache_sz_inc *= cache_sz_inc_factor;
> > +            if (cache_sz_inc > cache_sz_inc_max) {
> > +                cache_sz_inc = cache_sz_inc_max;
> > +            }
> > +        }
> > +        pos_cache[count-1] = pstr2 - str;
> > +        pstr = pstr2 + fromlen;
>
> All this is the code needed to keep track of the positions of the
> needle. One way or another, it must go away.
>
> > +        retlen = orglen + (tolen - fromlen) * count;
>
> This is a potential integer overflow, for example.
>
> > +        /* Otherwise, duplicate the string whilst performing
> > +         * the replacements using the position cache. */
> > +        pret = ret;
> > +        memcpy(pret, str, pos_cache[0]);
> > +        pret += pos_cache[0];
> > +        for (i = 0; i < count; i++) {
> > +            memcpy(pret, to, tolen);
> > +            pret += tolen;
> > +            pstr = str + pos_cache[i] + fromlen;
> > +            cpylen = (i == count-1 ? orglen : pos_cache[i+1]) -
> pos_cache[i] - fromlen;
> > +            memcpy(pret, pstr, cpylen);
> > +            pret += cpylen;
> > +        }
>
> If there are n copies of the needle, there are n+1 pieces of string
> around them, and one of them has to be taken out of the loop. This
> version isolates the one before the first needle. It would have been
> much simpler to isolate the one after the last needle. The formula for
> cpylen, in particular, is a symptom of this non-optimal choice. The loop
> could look that way:
>
>         while (needle) {
>             copy from pos to needle
>             update pos
>             add replacement string
>         }
>         copy from pos to the end
>
> The BPrint version is actually exactly the same:
>
>         av_bprint_init()
>         while (needle) {
>             av_bprint_append_data() to copy from pos to needle
>             update pos
>             av_bprint_append_data() to add the replacement string
>         }
>         av_bprint_append_data() to copy from pos to the end
>         av_bprint_is_complete() to check malloc errors
>         av_print_finalize()
>
> All in all, I think the BPrint version is the easiest to implement right
> now.
>

I'm not sure if i misunderstand, but i will try to do it with your
suggestion.
I will try to understand BPrint and try to use it.

>
> Regards,
>
> --
>   Nicolas George
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
>


More information about the ffmpeg-devel mailing list