[FFmpeg-devel] [PATCH] Support for UTF8 filenames on Windows

Sat Jul 18 19:11:02 CEST 2009

Ramiro Polla wrote:
> On Sat, Jul 18, 2009 at 2:00 AM, Karl Blomster<thefluff at uppcon.com> wrote:
>> Ramiro Polla wrote:
>>> On Thu, Jul 16, 2009 at 2:55 PM, Karl Blomster<thefluff at uppcon.com> wrote:
>>>> Ramiro Polla wrote:
>>>>> On Thu, Jul 16, 2009 at 11:20 AM, Karl Blomster<thefluff at uppcon.com>
>>>>> wrote:
>>>>>> Unless I am severely missing something in your updated patch (thanks
>>>>>> for
>>>>>> the
>>>>>> nice work, by the way!) it will not work with the FFmpeg commandline
>>>>>> program. If you want an Unicode commandline in Windows you need to use
>>>>>> wmain() or _tmain() instead of plain old main(), AFAIK. As I said
>>>>>> earlier
>>>>>> my
>>>>>> original patch was only intended to let the API support Unicode.
>>>>>> Working
>>>>>> it
>>>>>> into ffmpeg.c would be a lot more work, I think.
>>>>> How do you test UNICODE support?
>>>>>
>>>>> I used attached shell file with msys (sh test_unicode.sh) and it works
>>>>> as expected (only the unicode filename without FF_WINUTF8 fails). I
>>>>> also tested with an app that used Find(First,Next)FileA() and passed
>>>>> the unicode filenames as ascii string to ff_winutf8_open() and it also
>>>>> worked as expected.
>>>> Plain old cmd.exe (both with and without the chcp 65001 trick). I can do
>>>> stuff like notepad.exe <unicode filename> and it'll work fine, but with
>>>> ffmpeg it just says file not found (and prints a bogus filename). It
>>>> works
>>>> fine with mingw's sh; MinGW probably does some kind of black magic there
>>>> to
>>>> get Unix apps to work without having to patch in the Windows mess. The
>>>> API
>>>> works fine, of course.
>>> Do you know of any real example where a codepage->utf8 conversion
>>> fails? I only see some possible theoretical references scattered
>>> around the web, but no real examples.
>> Not sure what you mean here. A given character string in a known codepage
>> should always be possible to convert to UTF8, assuming that all the glyphs
>> have UTF8 equivalents. I'm not sure if any codepages that aren't fully
>> translatable actually exist.
>>
>>> I'm tempted to do the following:
>>> - Always expect filenames in Windows to be passed in UTF8.
>> This is dangerous for the reasons I mentioned earlier; namely that it isn't
>> possible to reliably detect if a given string is UTF8 or not. Lots of
>> applications using the ffmpeg API will pass strings in the local codepage,
>> and it's theoretically quite possible that a given string in some unknown
>> codepage could translate as valid UTF8 while not actually being that. For
>> example, the ISO8859-1 string 0xC3 0xA1 (capital letter a with tilde +
>> inverted exclamation mark) will translate as valid UTF8, but the result will
>> be the single character U+00A1 (inverted exclamation mark) which is
>> obviously wrong.
>>
>> While the likelihood of this actually happening in a real-world filename may
>> be low, it's definitely there. In my humble opinion it's big enough to
>> justify not turning UTF8 mode on always (despite how much I would like for
>> everyone to switch to Unicode), but you're the maintainer, not I.
> 
> Oh, I wouldn't want to guess between codepage or UTF-8 or whatever,
> that would be a nightmare. I was thinking about documenting "all file
> names in Windows *must* be UTF-8 encoded[, unless environment variable
> FOO is set]", and let the user of libavformat take care of that
> conversion[ or set that variable]. I'm still unsure about the
> environment variable.

Oh okay, yeah, that would work.

>>> - Always get the Unicode command line and convert it to UTF8.
>> By all means, go for this if you feel up to it. Personally I was too lazy to
>> do it since I didn't really need it myself (I submit patches mostly to
>> scratch my own itches) but it would be a nice improvement.
> 
> Then assuming filenames that come through the command line are in
> UTF-8, we could choose between:

How were you going to make sure commandlines are received as UTF8? Force the 
user to use the MinGW shell instead of cmd.exe?

> 1 - lavf takes in UTF-8. lavf users must convert. No environment
> variables. API breakage.
> 2 - lavf takes in UTF-8 by default, with environment variable to
> select system codepage. ffmpeg always overrides that variable to use
> UTF-8. API breakage. This would be a nuisance to lavf users who want
> to pass filenames from system codepage.
> 3 - lavf takes in system codepage by default, with environment
> variable to select UTF-8. ffmpeg always overrides that variable to use
> UTF-8. No API breakage. This would be a nuisance for lavf users who
> want to pass UTF-8 filenames.
> 
> They're all better than the current "0 - no unicode support".
> 
> I'm thinking now of aiming towards 3.

Yeah, 3 would be my preference as well, especially since 1 and 2 would lead to a 
rather subtle API breakage: it won't result in something obvious like your 
application failing to compile against the new ffmpeg, but rather it'll just 
mysteriously fail to open certain files.

Regards,
Karl Blomster