[FFmpeg-devel] [PATCH] libavcodec: Do not return encoding errors when -sub_charenc_mode is do_nothing

Nicolas George nicolas.george at normalesup.org
Sun Sep 8 19:38:15 CEST 2013


Le nonidi 19 fructidor, an CCXXI, Eelco Lempsink a écrit :
> Ah, but there is where it gets interesting.  The fact that there is no
> perfect solution means there is an opportunity for design.  That is where
> user feedback and naming of options and choosing the right defaults
> matter.  And that is the kind of discussion I wish we could have.  I’ll
> start.

Ok, good.

> I think FFmpeg should try to do character encoding detection when no
> character encoding is specified using a library that uses a heuristic
> based on statistics.  As with analyzing video streams, FFmpeg can analyze
> a bit of the subtitles and report the most probable encoding.  When trying
> to run FFmpeg on a subtitle stream whose encoding could not be detected
> (with enough confidence) and without specifying an encoding by hand, it
> will exit with an error.

There are several issues to discuss.

First of all, analyzing audio and video to find the resolution or number of
channels requires only a few frames, possibly just one. OTOH, analyzing text
encoding based on statistical methods requires a lot of text.

Therefore, it can not work for embedded subtitles, since it would hit the
probesize limit before having anything near enough text.

Fortunately, AFAIK, all formats that allow embedded text subtitles specify
the character encoding, making probing unnecessary. (Broken files probably
exist; requiring user intervention to handle broken files seems acceptable.)

That leaves the plain text subtitles files.

The second point I want to make is that when lavc/lavf probe the audio and
video streams, they do so using the codecs in lavc, not external libraries
(except for a few codecs), and the result is reliable. FFmpeg is not
intended to host a complex library of encoding detection, nor should such a
library be a mandatory dependency.

Therefore, probing using a statistical library should be optional, and for
consistency of behaviour, it should be optional both at build time and at
run time.

That brings us back to my original proposal: use by default a simple and
naïve (but good enough for a lot of cases) algorithm, and leave the option
of compiling FFmpeg with support for third-party libraries doing a smarter
detection.

> Fully agreed.  
> 
> Talking about simplification, would it be useful to simplify the current
> situation first before introducing new stuff?  I reviewed the code and I
> fail to see the need for the ‘sub_charenc_mode’ option:

The reason for that is that sub_charenc_mode is necessary for parts of the
code that are not there yet.

> I would therefore argue that the -sub_charenc_mode option should be
> removed (also preventing confusion over what the options mean exactly).

As I explained once before, sub_charenc_mode is there to allow the recoding
to happen before the decoder, for example in the demuxer. Without it, either
the demuxer sets sub_charenc, and the recoding is done twice, or it does not
set it and loses information.

Do you have a better proposal?

Regards,

-- 
  Nicolas George
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20130908/b8506447/attachment.asc>


More information about the ffmpeg-devel mailing list