[FFmpeg-devel] [PATCH 2/4] avcodec/samidec: use ff_htmlmarkup_to_ass()

Clément Bœsch u at pkh.me
Sat Aug 15 20:24:52 CEST 2015


On Sat, Aug 08, 2015 at 09:24:03PM -0700, Yayoi Ukai wrote:
> On Sat, Aug 8, 2015 at 8:23 AM, Clément Bœsch <u at pkh.me> wrote:
> > On Fri, Aug 07, 2015 at 11:03:29PM -0700, Yayoi wrote:
> >> ---
> >>  libavcodec/samidec.c    | 59 +++++++++++++++++++++++++++++--------------------
> >>  tests/ref/fate/sub-sami | 18 +++++++--------
> >>  2 files changed, 44 insertions(+), 33 deletions(-)
> >>
> >> diff --git a/libavcodec/samidec.c b/libavcodec/samidec.c
> >> index 47850e2..41826a7 100644
> >> --- a/libavcodec/samidec.c
> >> +++ b/libavcodec/samidec.c
> >> @@ -27,6 +27,7 @@
> >>  #include "ass.h"
> >>  #include "libavutil/avstring.h"
> >>  #include "libavutil/bprint.h"
> >> +#include "htmlsubtitles.h"
> >>
> >>  typedef struct {
> >>      AVBPrint source;
> >> @@ -41,11 +42,12 @@ static int sami_paragraph_to_ass(AVCodecContext *avctx, const char *src)
> >>      char *tag = NULL;
> >>      char *dupsrc = av_strdup(src);
> >>      char *p = dupsrc;
> >> +    char *pcopy = NULL;
> >> +    int index = 0;
> >> +    int second_paragraph = 0;
> >>
> >> -    av_bprint_clear(&sami->content);
> >>      for (;;) {
> >>          char *saveptr = NULL;
> >> -        int prev_chr_is_space = 0;
> >>          AVBPrint *dst = &sami->content;
> >>
> >>          /* parse & extract paragraph tag */
> >> @@ -77,37 +79,46 @@ static int sami_paragraph_to_ass(AVCodecContext *avctx, const char *src)
> >>              goto end;
> >>          }
> >>
> >
> >> -        /* extract the text, stripping most of the tags */
> >> +        /* check for the second paragrph */
> >
> > Why change the comment? What does "check" mean here? What is the "second
> > paragraph"?
> 
> 
> I answer it below with the other questions you have because you are
> basically asking the same things.
> And I will document better too.
> 
> >
> >> +        pcopy = av_strdup(p);
> >>          while (*p) {
> >>              if (*p == '<') {
> >> -                if (!av_strncasecmp(p, "<P", 2) && (p[2] == '>' || av_isspace(p[2])))
> >> +                if (!av_strncasecmp(p, "<p", 2) && (p[2] == '>' || av_isspace(p[2]))) {
> >> +                    second_paragraph = 1;
> >>                      break;
> >> -                if (!av_strncasecmp(p, "<BR", 3))
> >> -                    av_bprintf(dst, "\\N");
> >> -                p++;
> >> -                while (*p && *p != '>')
> >> -                    p++;
> >> -                if (!*p)
> >> -                    break;
> >> -                if (*p == '>')
> >> -                    p++;
> >> -                continue;
> >> +                }
> >>              }
> >> -            if (!av_isspace(*p))
> >> -                av_bprint_chars(dst, *p, 1);
> >> -            else if (!prev_chr_is_space)
> >> -                av_bprint_chars(dst, ' ', 1);
> >> -            prev_chr_is_space = av_isspace(*p);
> >>              p++;
> >> +            index++;
> >> +        }
> >> +        p = p - index;
> >> +        if (second_paragraph) {
> >> +            p[index] = 0;
> >>          }
> >> -    }
> >>
> >> -    av_bprint_clear(&sami->full);
> >> -    if (sami->source.len)
> >> -        av_bprintf(&sami->full, "{\\i1}%s{\\i0}\\N", sami->source.str);
> >> -    av_bprintf(&sami->full, "%s", sami->content.str);
> >> +        ff_htmlmarkup_to_ass(avctx, dst, p);
> >> +
> >> +        /* add the source if there are any. */
> >> +        av_bprint_clear(&sami->full);
> >> +        if (sami->source.len) {
> >> +            av_bprintf(&sami->full, "{\\i1}%s{\\i0}\\N", sami->source.str);
> >> +            av_bprintf(&sami->full, "%s", sami->content.str);
> >> +            if (second_paragraph) {
> >> +                second_paragraph = 0;
> >> +                p = pcopy;
> >> +                p += index;
> >> +                index = 0;
> >> +                continue;
> >> +            }
> >> +        } else {
> >> +            av_bprintf(&sami->full, "%s", sami->content.str);
> >> +        }
> >> +        av_bprint_clear(&sami->content);
> >> +
> >> +    }
> >>
> >
> > This looks clumsy at best: you are finalizing the subtitle event inside
> > the paragraph loop when it should be outside. It also seems there is a
> > duplicating "second paragraph" logic even though the loop is supposed to
> > handle one paragraph at a time.
> 
> I know. It is clumsy.. I explain it below as well.
> 
> >
> > If you are uncomfortable with the current logic or believe it's badly
> > designed for the logic you're trying to mangle, feel free to rewrite it;
> > it's better than trying to inhibit the old behaviour with hacks.
> 
> It's not that your code was badly designed to begin with. I feel like
> it is the nature
> of this format.
> 
> So let me explain what was going on your old code and what I had changed.
> Here is the example: (This one is from current fate sample.)
> 
> <SYNC Start=73000>
>       <P Class=ENUSCC ID=Source>End of:
>       <P Class=ENUSCC>President John F. Kennedy Speech
> 
> So this is one subtitle event and there are two paragraphs. One
> paragraph has a new source and
> the other paragraph has its content.
> Since it is decided that this code uses html
> parser(ff_htmlmarkup_to_ass) and use it to parse the contents of the
> paragraph. Each paragraph content needs to go to to html parser,
> regardless whether it is source or subtitle content.
> 
> So I need to tell html_parser, ff_htmlmarkup_to_ass that where the
> destination is.. (whether it is source or content).
> I can't no longer specify the destination which I was able to do in
> your old code. I mean maybe I can change the function header and add
> additional flag to tell our parser where to put in the dst but I don't
> want to because it may require to change srtdec and I do not want to
> do that..
> 
> Anyways, so that's the basic idea. I honestly think it is good that
> subtitle event handle is also in loop because one loop iteration per
> one subtitle event and subtitle event does not guarantee that it
> contains only one paragraph to begin with. It was a possibility either
> adding additional loop or having goto to honor your original logic but
> after I noticed that subtitle event contains more than one paragraph,
> I gave up on the idea to keep your logic. The comment says "check
> second paragraph" but it should be able to handle multiple paragraphs
> more than 2 in one subtitle event. (I definitely tested while I was
> working on this part.. I can modify sample or attach test file if that
> makes you feel happier. And I make a better comment then.)
> 
> So I want to leave it as it is if you do not see major flaws or you do
> not have very very strong opinion about how it needs to be modified.
> Well, given the information, if you have some ideas, you can tell me
> too. Besides, I can't cross out the possibility if you or me or
> someone wants to add more enhancement (maybe one day you decided to
> add support for img tag? who knows..) , and it may need to change code
> structure again. And as it is now, it is still enhancement that works.
> ;-).
> 

I think it's breakable one way or another without that much effort.

Here is a suggestion based on the original code:

 - make sure you have 5 bprint buffers: source_sami, content_sami,
   source_encoded, content_encoded, full (you already have 3 of them in
   the context)

 - remove the current markup processing from the loop and make sure the
   loop just adds the content of each paragraph where it belongs (that is,
   in either source_sami or content_sami). The code already does that, so
   there is probably no change to do (code already concats in each bprint
   buffer if more than 2 paragraphs happen).

 - after the loop ends, you now have the SAMI data of your packet
   split into content_sami and eventually in source_sami, without the SAMI
   crap (such as <P ...>). So what you need to do is just call the markup
   on content_sami (and eventually source_sami if there is data) to create
   content_encoded (and eventually source_encoded).

 - the existing code will then finish things up by concatenating source &
   content as it is already doing

This is probably way more solid than trying to interlace another partial
sub-loop in a loop of the same purpose like you proposed.

If you do not want to do that, I will have to push your version and do the
above or similar before pushing the whole set.

[...]

-- 
Clément B.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20150815/ceda11f3/attachment.sig>


More information about the ffmpeg-devel mailing list