[FFmpeg-devel] GSoC with FFMpeg waht a combination!

Uoti Urpala uoti.urpala
Mon Mar 24 22:18:00 CET 2008


On Mon, 2008-03-24 at 21:16 +0100, Michael Niedermayer wrote:
> On Mon, Mar 24, 2008 at 08:08:43PM +0200, Uoti Urpala wrote:
> > On Mon, 2008-03-24 at 00:41 +0100, Michael Niedermayer wrote:
> > > On Sun, Mar 23, 2008 at 03:31:15AM +0200, Uoti Urpala wrote:
> > > > Also it's not that obvious that there would be much of a speed
> > > > difference. Even a directly hardcoded string without any lookup can be
> > > > swapped out, and a lookup structure is unlikely to need a separate seek
> > > > and read for every accessed page.
> > > 
> > > binary search needs O(log N) seeks.
> > 
> > That's the asymptotic behavior as you get beyond billions of translated
> > messages, but in practice the number of disk seeks needed during the
> > binary search is significantly less than log2(N) because several last
> > steps are read with a single seek.
> 
> a 64k string table with an average of 10bytes per string with a binary search
> behaves as follows:
> 
> read 16 -20k (ptr to string)
> read 48 -52k
> read  8 -12k (ptr to string)
> read 40 -44k
> read  4 - 8k (ptr to string)
> read 36 -40k
> read  0 - 4k (ptr to string)
> read 32 -36k
> 
> these are 8 seeks

That's a fucked up table... there are 32 KiB of pointers to strings and
32 KiB of actual strings (I assume you meant Ki, not k)? Assuming 4-byte
pointers that means there are 8192 translated strings and each
translated string is on average 4 bytes long (!!), even less than the
already unlikely short 10 you claimed to be using. Even in this example
it's unlikely that all 8 areas would be read separately (and 8 is still
less than log2(8192)).

> A hashtable which is 50% full would need <2 seeks at average with billions
> of strings.

Getting to 2 seeks on average would cost extra space though, as you'd
need to store enough information in each hash table cell to be able to
verify whether it's the correct one without extra seeks (and this would
be wasted in the unused cells).

> Also for each of these read pages another page must be paged out needing
> another seek thus its 16 vs 4.

Your calculation for binary search exaggerated the amount of seeks
needed, and comparing those numbers directly is misleading as there's
other overhead (such as loading the lookup code and the actual
translated string) which means overall ratio would differ less.

> And your idea of reading much more than 4kb, well you can but other
> equally large blocks would be paged out.

That's not "my idea", that is what operating systems ACTUALLY DO.

> And just to see how idiotic this argument is,
> consider a embeded system with 1mb ram, this thing is not going to page
> half of its memory out on each page fault.

Such an embedded system is unlikely to be running FFmpeg, much less one
with 10k translated strings.

> > Anyway this isn't too important if even gettext doesn't use binary
> > search in practice.
> 
> Which shows that even the gettext developers consider binary searches
> problematic. I assume they have more practical experience with them in
> respect to translation than you.
> Of course that will not stop you from explaining us why you are still
> correct and the world is wrong.

The .mo file format description mentioned earlier says that a hash table
is optional because it doesn't win that much speed.





More information about the ffmpeg-devel mailing list