[FFmpeg-devel] [BUG] UTF-8 decoder vulnerable to character spoofing attacks

Rich Felker dalias
Mon Oct 22 19:00:18 CEST 2007

On Mon, Oct 22, 2007 at 05:31:18PM +0200, Michael Niedermayer wrote:
> Hi
> On Mon, Oct 22, 2007 at 10:24:41AM -0400, Rich Felker wrote:
> > The UTF-8 decoder in libavutil has a classic bug where it incorrectly
> > decodes illegal sequences as aliases for valid characters. For
> > instance, the sequence "C0 80" will decode to NUL, and the sequence
> > "C0 AF" will decode to '/'. Aside from possible direct vulnerabilities
> > (of which there are probably none at present, but I have not checked),
> > this can lead to indirect problems by allowing illegal sequences to
> > get into files generated by ffmpeg, causing problems for other
> > processes interpreting the files.
> well, i dont see how any change to GET_UTF8 could affect that.
> UTF8 generated by ffmpeg should be valid, UTF8 input into ffmpeg gets
> stored in the output files without ever being touched by GET_UTF8

*nod* okay. Actually it would be a good idea to verify it though since
lots of users have their locales set wrong, etc., but this is a
separate issue from the bug I reported of course.

We still have a bug with GET_UTF8 though because it's used for writing
data into containers that use UTF-16 or UCS-2/4. I suspect writing a 0
into the container when "C0 80" is encountered is fairly bad...

> we could add some code to check the validity of UTF8 strings but iam
> not sure if this is ffmpegs job, if the user feeds ffmpeg with invalid
> data and asks it to copy it why shouldnt it?

Because we're contributing to the existence of invalid (and horribly
broken) files. Do you really want to see NUT files with Latin-1
strings in them because some idiot's locale is latin-1-based? :) And
then for lame apps to start playing the "charset guessing" game.....

> should we also decode the video stream for stream copy to ensure its not
> damaged?

No, but there are various levels of this. We certainly should check
things that are semantic at the container level, like validity of
timestamps. IMO checking text is inexpensive and saves us from a HUGE
risk of invalid files due to the input coming directly from
luser-enterred bytes rather than machine-generated crap... :)

BTW, does ffmpeg even do anything with strings input from the command
line to convert them to UTF-8? AFAIK it's generating bogus files
unless the user's locale is UTF-8-based.

> > In addition, the code fails to detect illegal sequences beginning with
> > more than 4 bits equal to 1. I have attached a naive, inefficient
> > patch for fixing these issues, but someone should really write a
> > better fix.
> i would first like to understand under what circumstances the current code
> is causing a real problem (security or normal bug)

It could actually cause crash, e.g. if a string is right at the top of
the heap and contains nothing but 0xff, then the UTF-8 decoder will
read 8 bytes and crash.

As I said, it will also incorrectly decode illegal aliases for
characters rather than signalling error. I don't think this will lead
to vulns in the current code (since UTF-8 decoder is hardly used
anyway), but it's bad to have buggy code that could cause problems
when someone uses it in the future. Even if not, it teaches people who
read it extremely bad practices.

> > Also, a few other 'bugs' in the code:
> [...]
> > - it's not wrapped in the proper do { ... } while(0) construct to make
> >   it behave as a single statement.
> i wouldnt call that "bug" but rather feature, id like to keep the code
> readable ...

If you look at the code, it's already inside { } except for one
statement at the beginning that could be moved inside. Then it's just
a matter of adding the do and while keywords on their own lines
before/after, so IMO there's no readability issue here.


More information about the ffmpeg-devel mailing list