[Beowulf] since we are talking about file systems ...

Jim Lux James.P.Lux at jpl.nasa.gov
Sun Jan 22 12:17:36 PST 2006


At 10:23 AM 1/22/2006, Robert G. Brown wrote:
>On Sun, 22 Jan 2006, PS wrote:
>
>>Indexing is the key;  observe how Google accesses millions of files in 
>>split seconds; this could easily be achieved in a PC file system.
>
>I think that you mean the right thing, but you're saying it in a very
>confusing way.
>
>1) Google doesn't access millions of files in a split second, it AFAIK
>accesses relatively few files that are hashes (on its "index server")
>that lead to URLs in a split second WITHOUT actually traversing millions
>of alternatives (as you say, indexing is the key:-).  File access
>latency on a physical disk makes the former all but impossible without
>highly specialized kernel hacks/hooks, ramdisks, caches, disk arrays,
>and so on.  Even bandwidth would be a limitation if one assumes block
>I/O with a minimum block size of 4K -- 4K x 1M -> 4 Gigabytes/second
>(note BYTES, not bits) exceeds the bandwidth of pretty much any physical
>medium except maybe memory.
>
>2) It cannot "easily" be achieved in a PC file system, if by that you
>mean building an actual filesystem (at the kernel level) that supports
>this sort of access.  There is a lot more to a scalable, robust,
>journalizeable filesystem than directory lookup capabilities.  A lot of
>Google's speed comes from being able to use substantial parallelism on a
>distributed server environment with lots of data replication and
>redundancy, a thing that is impossible for a PC filesystem with a number
>of latency and bandwidth bottlenecks at different points in the dataflow
>pathways towards what is typically a single physical disk on a single
>e.g.  PCI-whatever channel.
>
>I think that what you mean (correctly) is that this is something that
>"most" user/programmers would be better off trying to do in userspace on
>top of any general purpose, known reliable/robust/efficient PC
>filesystem, using hashes customized to the application.  When I first
>read your reply, though, I read it very differently as saying that it
>would be easy to build a linux filesystem that actually permits millions
>of files per second to be accessed and that this is what Google does
>operationally.


This is almost certainly true.  Typically, the user knows a bit about their 
application, and can come up with a "good" way to hash or structure the 
directories/filenames that will have decent performance with the underlying 
OS filesystem.  It's also easy to test with some programs that will 
generate the zillions of files needed.

However, if you are writing software for eventual distribution to others, 
make sure you explain how you do it, and be aware that other file systems 
may not see it the same way. Anecdote to illustrate:  Back in the late 80s, 
early 90s, I built a software system which provided a database of around 
10,000 industrial real estate properties.  The amount of information for 
each property was highly variable (if, for no other reason than we stored 
all the historical change information), so I stored all the data for each 
property in it's own (MS-DOS) file.  All those files were stored 
(originally) in a directory called something like "database".  Early 
testing worked real well, with a data base of a few hundred records (i.e. 
files), but when we loaded several thousand up, it slowed to a crawl. A bit 
of experimentation enabled me to figure out where the "breakpoints" in the 
MS-DOS internal directory caches were, so we could come up with an 
appropriate set of directories: do you do 100 directories of 100 files, or 
10 directories of 1000 files, or 10 directories of 10 directories of 100 
files, etc.  As it happens we wound up with 100 directories, and hashed 
based on the low order digits of the property's id number (the numbers 
being a legacy of the original manual system, and guaranteed unique: most 
of the time<grin>).

All was well, even through many revs..  People would ask about it when they 
did backups (what are all those directories for?).

Enter Novell Networking... Apparently, Netware had it's own scheme for 
cacheing file directory information on their servers with very different 
properties from that in MS-DOS, AND, the default installations never 
contemplated the possibility that someone might have, gasp, 10,000 files 
that they needed regular access to.  Much wailing, gnashing of teeth, and 
Novell CNEs who needed to go and talk to Novell customer support about how 
to reconfigure the server (which essentially required reformatting and 
rebuilding the server, a long and tedious process involves many 5 1/4" 
floppies to backup the existing contents, etc.)

--- so, if you DO implement one of these hashing schemes, document it well.

BTW, you're still better off figuring out how to work WITH the OS existing 
architecture.  I built a replacement file system for RSX-11M-PLUS back in 
the late 70s  that supported shared disks, and it was a royal pain to make 
work.  Things like cache concurrency, write behind, and such-like are 
tricky to deal with. 




More information about the Beowulf mailing list