[FFmpeg-devel] [PATCH] "Mojibake" in Japanese

Tetsuya Yoshida tetu at eth0.jp
Thu Feb 23 23:01:25 CET 2012


Hi Michael Niedermayer!

> It is maybe easier for the end user of a package if it could be
> selected at runtime.
> For example with a environment variable.

Thanks for your idea!
I rewrote patch with a environment variable.


> Still better would be if autodetection could be done, is there some
> readily available software (like iconv) that can guess from a char*
> what encoding is used ?

Do you know the juniversalchardet?
universalchardet is the encoding detector extension of Mozilla Firefox.
juniversalchardet is librarized-universalchardet.
but I have never used it.

I think it is not precise auto detect.
because ID3 tags is very short string.

http://code.google.com/p/juniversalchardet/

Yoshida Tetsuya


2012/2/15 Michael Niedermayer <michaelni at gmx.at>

> Hi Vladimir
>
> On Wed, Feb 15, 2012 at 02:52:23AM +0400, Vladimir Mosgalin wrote:
> > Hi Michael Niedermayer!
> >
> >  On 2012.02.14 at 22:36:04 +0100, Michael Niedermayer wrote next:
> >
> > > It is maybe easier for the end user of a package if it could be
> > > selected at runtime.
> > > For example with a environment variable.
> > >
> > > Still better would be if autodetection could be done, is there some
> > > readily available software (like iconv) that can guess from a char*
> > > what encoding is used ?
> >
> > I can answer this, and the answer is "no". From experience, even
> > detection among few cyrillic encodings (based on letter frequencies) is
> > often impossible if all you have is two or three words. It just won't
> > work correctly. And that's when you are sure it's single-byte encoding,
> > if you add possibility of multibyte Japanese and some other into the
> > mix, it becomes impossible to detect anything on something as short as
> > typical ID3 tag.
>
> it could in theory be done by looking the decodings up in a database
> of artists and titles. I doubt a false encoding would lead to a better
> match on all fields of a ID3 tag than the correct encoding.
>
>
> >
> > Browsers choke on detecting encoding if none is specified sometimes even
> > when whole page of text is provided; of course, if you have a page you
>
> no doubt they do but this isnt a strong statement on the difficulty
> of the problem. We dont know how much time and effort the browsers
> developers have put in writing the encoding guessing code.
> Someone could just as well argue that its hard because ffmpeg doesnt
> get it right.
>
> That said ATM i dont see many reasonable ways for us to guess the
> encoding either.
> Using a offline database of music titles is a clear no-way as much as
> qerrying a online database is (privacy issues here)
> one thing that could be tried would be feeding trial decoded strings
> to a spell checker if one is installed. If its wordlist is complete
> enough it might work out in detecting the correct encoding.
> Though iam not sure this is reasonable but if the code is clean and
> obviously optional id apply a patch that adds such guessing feature.
>
> [...]
> --
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> Let us carefully observe those good qualities wherein our enemies excel us
> and endeavor to excel them, by avoiding what is faulty, and imitating what
> is excellent in them. -- Plutarch
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ffmpeg-0.10_iconv_v2.patch
Type: application/octet-stream
Size: 5509 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20120224/1ab0adcf/attachment.obj>


More information about the ffmpeg-devel mailing list