[FFmpeg-devel] GSoC with FFMpeg waht a combination!

Uoti Urpala uoti.urpala
Sun Mar 23 02:31:15 CET 2008

On Sun, 2008-03-23 at 00:19 +0100, Michael Niedermayer wrote:
> On Sun, Mar 23, 2008 at 12:33:04AM +0200, Uoti Urpala wrote:
> > On Sat, 2008-03-22 at 13:29 +0100, Michael Niedermayer wrote:
> > > On Sat, Mar 22, 2008 at 12:37:39AM -0400, Ronald S. Bultje wrote:
> > > > Will you accept patches that add internationalization-support to
> > > > ffmpeg/lav*?
> > > > 
> > > > It's high on my fairy-list.
> > > 
> > > Sure just keep in mind that you will be flamed if this is done with gettext :)
> > > The reason being
> > > * gettext duplicates english strings all over the place
> > 
> > I'm not sure exactly what gettext duplicates, but how much space would
> > this waste? It'd need to have many copies of every string to make a
> > difference for FFmpeg.
> I would estimate that gettext needs roughly twice as much disk space as a
> integer based system and about 3 times as much in memory
> (english string in _() and as key as well as the translated string)
> The disk space matters only for embeded systems of course. There it can
> matter a lot though, especially with few codecs and many languages.

I don't think that amount is too large. Also note that
1) Any additional memory/disk usage is only for actually translated
strings and not for untranslated obscure strings.
2) This is not a comparison of the basic approaches of calculating a
string hash at compile time or runtime; neither has an inherent
advantage in disk usage and memory advantage depends on which language
is being used (details later).

> > > * gettext uses strings as keys (very inefficient requireing O(log n) lookups)
> > 
> > O(log n) is NOT inefficient for text lookups!
> If the area where the strings are has been paged out to disk then O(log n)
> can be alot slower than O(1). And i would expect it to be paged out in
> practice as ffmpeg doesnt print the overhelming amount of these strings
> regularly. If it were in memory id immedeatly agree with you that O(log N)
> is irrelevant.

If it's used rarely enough to need paging does the speed really matter?
Also it's not that obvious that there would be much of a speed
difference. Even a directly hardcoded string without any lookup can be
swapped out, and a lookup structure is unlikely to need a separate seek
and read for every accessed page.

Also note that this isn't an inherent difference between compile-time
and runtime hash calculation, both have the same O() behavior.

> > Also your claim that using
> > strings as keys would necessarily require O(log n) lookups is not true.
> > Hash tables require O(1) on average, and your own suggested method needs
> > an equal lookup.
> Yes, but if you do use a hash table why calculate the hash values at runtime?
> Why store the english strings twice instead of corresponding hash values?
> I dont belive you consider this good design. It is plain waste of space.

Having the strings in the binary means you don't need a separate
translation file to use the program. If you use a translation but it
doesn't contain all needed strings you always have at least the English
version available. If you manage to find some use case where performance
actually matters then it allows you to turn off translation to get
optimal behavior (direct access to the string with no lookup step).

English strings do not need to be stored twice if you're willing to rely
on the uniqueness of the hashes (as you did in your suggested
implementation). Here's a list of the memory, disk and CPU usage
differences between this kind of runtime (R) and compile-time (C)

When using English:
R has no overhead.
C needs to store the hash of every translatable string twice plus some
extra lookup-structure overhead.
R has no overhead.
C needs at least one extra copy of the hashes (in the binary), another
unless the English translation file is special-cased.
R has no overhead.
C needs to do a hash lookup.

When translating to another language:
For actually translated strings R stores also the English string.
For untranslated strings R has no overhead.
C stores a hash for every string.
Translation files for other languages are identical.
R needs to calculate the hash of printed strings.

> > The only calculation your suggestion can save is
> > calculating the hash at runtime, which is O(length of string) and thus
> > cannot affect O() behavior (assuming the result is of similar length and
> > has to be output).
> The .gmo files do not contain hashes, they contain 2 lists of pointers to
> to arrays of sorted strings. This structure is designed for a O(log N)
> binary search. One of course could convert that to a sane hash table on
> load but then you loose the file backing this hash table which means
> it needs more time if its paged out.
> Also the gmo files are 50% larger than they have to be and you still have
> your key strings from _() in memory wasting space.

In the above I was talking about calculating the hash at compile-time
rather than runtime, not comparing to gettext. I'm not sure whether
gettext should be used as is, but I am pretty sure that calculating
hashes at compile time is not worth doing.

> > > * no duplicate strings in the final binary files
> > 
> > I assume you mean that the English strings are not used at all when
> > using a translation. This is also a nonsensical requirement, as the
> > total amount of translated text is not large enough to justify it. There
> > are easier ways to reduce FFmpeg binary size (and by a larger amount)
> > than the complexity required for this.
> If you can reduce ffmpegs binary size by a large amount and with little
> compexity do it please, but judging from your past claims this is just
> hot air you at best sketch something very complex noone and especially
> not you will ever implement.

I assume you're referring to what I said about the motion_est bloat. I
already explained that the bloat doesn't bother me enough for me to work
on removing it. And I did give a very simple example patch that was
enough to reduce binary size by 10 kB. That's already enough to
translate quite a few of the most common strings even in a way that
completely duplicates them.

> Also even of you can reduce ffmpegs size significantly, this is hardly
> an argument not to attempt to implement other unrelated improvments.

But it is an argument not to add complexity or do extra work for the
sake of relatively insignificant savings elsewhere.

More information about the ffmpeg-devel mailing list