[FFmpeg-devel] [PATCH] Support for UTF8 filenames on Windows

Karl Blomster thefluff
Sat Jul 18 19:11:02 CEST 2009

Ramiro Polla wrote:
> On Sat, Jul 18, 2009 at 2:00 AM, Karl Blomster<thefluff at uppcon.com> wrote:
>> Ramiro Polla wrote:
>>> On Thu, Jul 16, 2009 at 2:55 PM, Karl Blomster<thefluff at uppcon.com> wrote:
>>>> Ramiro Polla wrote:
>>>>> On Thu, Jul 16, 2009 at 11:20 AM, Karl Blomster<thefluff at uppcon.com>
>>>>> wrote:
>>>>>> Unless I am severely missing something in your updated patch (thanks
>>>>>> for
>>>>>> the
>>>>>> nice work, by the way!) it will not work with the FFmpeg commandline
>>>>>> program. If you want an Unicode commandline in Windows you need to use
>>>>>> wmain() or _tmain() instead of plain old main(), AFAIK. As I said
>>>>>> earlier
>>>>>> my
>>>>>> original patch was only intended to let the API support Unicode.
>>>>>> Working
>>>>>> it
>>>>>> into ffmpeg.c would be a lot more work, I think.
>>>>> How do you test UNICODE support?
>>>>> I used attached shell file with msys (sh test_unicode.sh) and it works
>>>>> as expected (only the unicode filename without FF_WINUTF8 fails). I
>>>>> also tested with an app that used Find(First,Next)FileA() and passed
>>>>> the unicode filenames as ascii string to ff_winutf8_open() and it also
>>>>> worked as expected.
>>>> Plain old cmd.exe (both with and without the chcp 65001 trick). I can do
>>>> stuff like notepad.exe <unicode filename> and it'll work fine, but with
>>>> ffmpeg it just says file not found (and prints a bogus filename). It
>>>> works
>>>> fine with mingw's sh; MinGW probably does some kind of black magic there
>>>> to
>>>> get Unix apps to work without having to patch in the Windows mess. The
>>>> API
>>>> works fine, of course.
>>> Do you know of any real example where a codepage->utf8 conversion
>>> fails? I only see some possible theoretical references scattered
>>> around the web, but no real examples.
>> Not sure what you mean here. A given character string in a known codepage
>> should always be possible to convert to UTF8, assuming that all the glyphs
>> have UTF8 equivalents. I'm not sure if any codepages that aren't fully
>> translatable actually exist.
>>> I'm tempted to do the following:
>>> - Always expect filenames in Windows to be passed in UTF8.
>> This is dangerous for the reasons I mentioned earlier; namely that it isn't
>> possible to reliably detect if a given string is UTF8 or not. Lots of
>> applications using the ffmpeg API will pass strings in the local codepage,
>> and it's theoretically quite possible that a given string in some unknown
>> codepage could translate as valid UTF8 while not actually being that. For
>> example, the ISO8859-1 string 0xC3 0xA1 (capital letter a with tilde +
>> inverted exclamation mark) will translate as valid UTF8, but the result will
>> be the single character U+00A1 (inverted exclamation mark) which is
>> obviously wrong.
>> While the likelihood of this actually happening in a real-world filename may
>> be low, it's definitely there. In my humble opinion it's big enough to
>> justify not turning UTF8 mode on always (despite how much I would like for
>> everyone to switch to Unicode), but you're the maintainer, not I.
> Oh, I wouldn't want to guess between codepage or UTF-8 or whatever,
> that would be a nightmare. I was thinking about documenting "all file
> names in Windows *must* be UTF-8 encoded[, unless environment variable
> FOO is set]", and let the user of libavformat take care of that
> conversion[ or set that variable]. I'm still unsure about the
> environment variable.

Oh okay, yeah, that would work.

>>> - Always get the Unicode command line and convert it to UTF8.
>> By all means, go for this if you feel up to it. Personally I was too lazy to
>> do it since I didn't really need it myself (I submit patches mostly to
>> scratch my own itches) but it would be a nice improvement.
> Then assuming filenames that come through the command line are in
> UTF-8, we could choose between:

How were you going to make sure commandlines are received as UTF8? Force the 
user to use the MinGW shell instead of cmd.exe?

> 1 - lavf takes in UTF-8. lavf users must convert. No environment
> variables. API breakage.
> 2 - lavf takes in UTF-8 by default, with environment variable to
> select system codepage. ffmpeg always overrides that variable to use
> UTF-8. API breakage. This would be a nuisance to lavf users who want
> to pass filenames from system codepage.
> 3 - lavf takes in system codepage by default, with environment
> variable to select UTF-8. ffmpeg always overrides that variable to use
> UTF-8. No API breakage. This would be a nuisance for lavf users who
> want to pass UTF-8 filenames.
> They're all better than the current "0 - no unicode support".
> I'm thinking now of aiming towards 3.

Yeah, 3 would be my preference as well, especially since 1 and 2 would lead to a 
rather subtle API breakage: it won't result in something obvious like your 
application failing to compile against the new ffmpeg, but rather it'll just 
mysteriously fail to open certain files.

Karl Blomster

More information about the ffmpeg-devel mailing list