[FFmpeg-trac] #6021(avcodec:new): tx3g / mov_text subtitles are not encoded correctly in some specific cases
FFmpeg
trac at avcodec.org
Wed Dec 14 08:25:49 EET 2016
#6021: tx3g / mov_text subtitles are not encoded correctly in some specific cases
-------------------------------------+-------------------------------------
Reporter: erikbs | Type: defect
Status: new | Priority: normal
Component: avcodec | Version: 3.2.1
Keywords: utf8 | Blocked By:
mov_text ttxt tx3g subtitles mp4 | Reproduced by developer: 0
Blocking: |
Analyzed by developer: 0 |
-------------------------------------+-------------------------------------
Consider the following command:
{{{
ffmpeg -i input.mp4 -i input.srt -c:a copy -c:v copy -c:s mov_text
output.mp4
}}}
for converting SRT subtitles into 3GPP timed text (TTXT) and embedding
them inside an MPEG4 container. Until recently, ffmpeg ignored any
formatting/styling in the SRT file and just converted the raw text and
timestamps instead. That produced files that had no problems.
In SRT files, the start and end points of the text to be formatted are
determined by tags, e.g. <i> and </i>. In TTXT subtitles, the start and
end points are instead saved as numbers. Currently ffmpeg measures these
values in bytes, but it looks like they should be measured in characters
instead. For example, the Chinese character 我 consists of three bytes,
but is considered a single character.
The problem arises when I try to convert an SRT where a line contains
multibyte characters and a formatted string, and there are less than X
characters between the formatted string and the end of the line, where X
is the difference between the length of the line in bytes and the length
in characters. Take for example this SRT line:
{{{
1
00:00:01,000 --> 00:00:02,000
The character 我 consists of three bytes
<i>this string will cause problems</i>
}}}
Measured in characters, the formatted string starts at position 40 and
ends at position 71 (i.e. the character at position 71 is not part of the
string). Measured in bytes (excluding the tags of course), the string
starts at position 42 and ends at position 73, which are the values ffmpeg
stores inside the output file. When I open this file in QuickTime Player
on Mac OS X, it seems to expect that these numbers be measured in
characters. Since 我 is counted as one character, it proceeds to read off
the end of the line (i.e. past the 73rd byte), resulting in an instant,
brutal crash. VLC, which appears to handle errors better, either tries to
correct the error or just ignores the formatting when invalid data is
found.
I used MP4Box to convert the SRT to TTXT and to extract the TTXT from the
MP4 generated by ffmpeg using
{{{
mp4box -ttxt input.srt # convert SRT to TTXT
mp4box -ttxt 3 output.mp4 # extract the third stream (subtitles)
}}}
When I compared the output files, it immediately became clear that MP4Box
counts characters while ffmpeg counts bytes. During testing I was able to
confirm that VLC counts in the same way as MP4Box and QuickTime: in
characters, not bytes. It should also be mentioned that the standalone
TTXT files, which are XML files, contain the properties ''fromChar'' and
''toChar'', further indicating that we should count characters and not
bytes. When stored inside an MPEG4 container, the TTXT files are
“compressed” into some binary format I do not fully understand instead of
using XML style tags. By replacing the correct bytes in the file produced
by ffmpeg (byte count --> character count) using a hex editor, the file
played correctly in VLC and QuickTime (with the correct letters
italicized), and I also got MP4Box to extract an SRT that looked correct
(without the hex editing, the SRT generated by MP4Box from the file
produced by ffmpeg had the tags in wrong place).
The erroneous data seems to be written by the function ''encode_styl'' in
the file ''libavcodec/movtextenc.c'', at line 108-109 to be precise. Here
the raw byte positions are written to the files. These are passed to the
function through an ''MovTextContext'' struct, which has a member called
''style_attributes'' – an array where each element corresponds to a
formatted string. Each element in this array is another struct, having
members such as ''style_start'' and ''style_end''. At the moment I have no
idea where these values are produced, but the ''encode_styl'' function
writes them to the file.
'''To correct the bug''', the code that counts bytes and writes these
values to the struct that is eventually passed to ''encode_styl'' should
be corrected, so that it counts characters instead. I guess ffmpeg already
has code for counting utf8 characters, but if it has not, then
[http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html this
article] presents various ways of doing it.
--
Ticket URL: <https://trac.ffmpeg.org/ticket/6021>
FFmpeg <https://ffmpeg.org>
FFmpeg issue tracker
More information about the FFmpeg-trac
mailing list