[FFmpeg-devel] [PATCH 2/5] lavu: add text_file API.

Sat Aug 10 16:54:58 CEST 2013

On Sat, 10 Aug 2013 15:03:50 +0200
Nicolas George <nicolas.george at normalesup.org> wrote:

> Le tridi 23 thermidor, an CCXXI, wm4 a écrit :
> > This does not work. You're trying to detect legacy 8-bit codepage
> > encodings by testing whether conversion to iconv works.
> > For example, as far as I remember, ISO-8859-1 is at best a
> > subset of WINDOWS-1252, so your code will never detect ISO-8859-1.
> 
> You are wrong, once again. windows-1252 uses some codes in the 128-159 range
> for extra characters, and the other are undefined, while ISO-8859-1 define
> the whole range for control characters.
> 
> ISO-8859-1 is placed in default the list of encodings as a catch-all case,
> since any binary sequence is valid ISO-8859-1.

I was wrong here which of these charsets has the narrower range. But
this test doesn't give meaningful results anyway. For most inputs,
even trying to distinguish between ISO-8859-1 and WINDOWS-1252 will
easily fail. What about other charsets? iconv supports literally
hundreds of charsets, and most of them are 8 bit. How can you hope to
get any meaningful results? It's near impossible, and at best you can
try statistical methods. The east-asian multibyte character sets are
worse, btw.

If you only care about the property of valid UTF-8 of the read strings,
you might as well insert unicode replacement characters for invalid
bytes. This is actually what unicode recommends and it's much simpler.
It also doesn't pretend the input charset was detected correctly. (I
think GNU iconv can do this with //TRANSLIT - although it's probably GNU
specific, and I don't know if it really uses U+FFFD.)

> > This kind of detection will most likely lead to broken text, which is
> > silently accepted without even printing a warning anywhere.
> 
> If ISO-8859-1 is in the list of encodings, that is the normal and expected
> behaviour. As Reimer pointed out some time ago, users of exotic encodings
> are usually familiar about how their language looks when improperly read as
> ISO-8859-1, and know how to deal with it.

Weren't you arguing about producing a character mess and raising an
error instead? This behavior doesn't even allow programmatically
checking the result.

Why again were you vehemently fighting against removing the dropping
of subtitle lines, instead of just implementing (or even just
suggesting) some sort of transliteration of the output? I still insist
that the application can do a better job at handling broken input than
libavcodec (the fact alone that the application knows whether it's
transcoding or displaying counts a lot).

I guess an application using ffmpeg will have to do some stupid things
like special-casing subtitle formats, and trying to reopen the file
a second time with the correct encoding, or so. That would be quite a
PITAful collection of hacks and workarounds though. I just keep asking
myself: why?

Some months ago I was suggesting to MPlayer to use ffmpeg's subtitle
code, but it sounds like a joke now. MPlayer's charset handling code is
not very good, but easily better than ffmpeg's or what you're
suggesting.

> > Additionally, the application (and not even a ffmpeg.c user) has the
> > slightest control over how character set detection and conversion is
> > handled
> 
> Again, you are wrong, that is exactly what the AVTextFile.encodings field is
> there for.

That doesn't allow the application to do its own detection. And the
list of encodings can only be set _before_ opening the file. It doesn't
allow lossless retrieving of the file packetized file contents either,
unless the file is in UTF-8.

> > Not checking for av_calloc result.
> 
> Indeed, thanks.
> 
> > Why does it create an array of lines? This seems slow, complex, and
> > unnecessary.
> 
> It makes further line-based processing much easier. That is the whole point
> of this kind of API.

Easier and less efficient? Introducing an API function that reads a
line of text would be easy enough.

> >	       Separate \n and \r processing also prevents you from
> > treating MacOS line endings correctly.
> 
> I believe this kind of file has disappeared more than a decade ago. If it is
> not so, adding a flag to deal with them is easy enough.
> 
> > Why does this use its own callback facility instead of using avio?
> 
> Because avio is in lavf, not lavu.

Then move the code instead of creating awkward workarounds for a
library separation that exists in the heads of ffmpeg developers only?

> > All in all, this patch adds lots of code without actually improving
> > anything, except testing for UTF-16.
> 
> Hopefully you will forgive me for not trusting your judgement after the many
> technical mistakes you made.

I hereby question your technical competence as well.