[FFmpeg-trac] #2431(undetermined:new): ffmpeg subtitle encoding of special characters does not working correctly

FFmpeg trac at avcodec.org
Thu Apr 4 22:07:35 CEST 2013

#2431: ffmpeg subtitle encoding of special characters does not working correctly
             Reporter:  Nick         |                    Owner:
                 Type:  defect       |                   Status:  new
             Priority:  normal       |                Component:
              Version:  git-master   |  undetermined
             Keywords:  sub srt      |               Resolution:
             Blocking:               |               Blocked By:
Analyzed by developer:  0            |  Reproduced by developer:  0

Comment (by Nick):

 You are right, the presence of the UTF-8 BOM is optional but here are
 different software tools which can detect the right encoding type (meaning
 ANSI text, UTF-8 with BOM or UTF-8 without BOM but not the code page).
 I tested MP4Box with *.srt files in ANSI, UTF-8 and UTF-8 w/o BOM. MP4Box
 seems to detect the encoding type and create in all three cases the same
 result! It is possible!
 Another example is the open source tool Notepad++, it can also detect the
 encoding type. Maybe you can find in source code of such tools methods to
 detect the right encoding type.

 ISO-8859-1 and CP-1252 are not exactly the same but the used special
 characters in my "subtitle_test.srt" are the same in both! Therefore the
 little comment in my srt file ;-) ...
 ''"These are printable characters of ISO-8859-1:
 (*str >= 32 && *str < 128) II (*str >= 160 && *str <= 255)"''
 ... for this range it is exactly the same.

 For the most European Languages like French, German, Italian, Spanish and
 more it is enough to use as default CP-1252 or ISO-8859-1.
 '''More important for the imported subtitle file is the question:
 "Is it plain text or is it already UTF-8?"'''

 My proposal to select a default code page for every subtitle stream:
 - If no language is defined for the subtitle stream or the language is
 unknown: [[BR]]
  --> use CP-1252 as default (or ISO-8859-1)
 - If a language is defined (e.g. with '''-metadata:s:s:0 language=ger'''):
 [[BR]] --> use a selection table to set automatically a code page
 - If a dedicated code page is selected by an option like
 "''-sub_charenc''": [[BR]] --> use that setting instead of the other ones

Ticket URL: <https://ffmpeg.org/trac/ffmpeg/ticket/2431#comment:8>
FFmpeg <http://ffmpeg.org>
FFmpeg issue tracker

More information about the FFmpeg-trac mailing list