Large FOSS filesystems, was Re: [Beowulf] 512 nodes Myrinet cluster Challanges

Sun May 7 11:43:17 PDT 2006

On Friday 05 May 2006 11:36, Craig Tierney wrote:
> My concern with this setup isn't xfs, it would be the stability of
> the storage.  Also, if there is a disk hiccup  (which will happen) that
> repairing a 16 TB filesystem takes a long time.  A distributed
> filesystem (PVFS2, Ibrix, etc) you would only have to fix the one
> volume, not the entire filesystem.  There may be some filesystem
> consistency checks after repair, but not to the extent of a full
> filesystem check.

We have a single 35TB Ibrix filesystem, served by 16 fileservers and backed 
by 64 SAN LUNs on a DataDirect Networks storage array.  The fsck protocol 
today is to do a full filesystem check first, and then do fixes if 
necessary.  The LUN filesystems are modified ext3, so the "Phase I" fsck is 
64 ext3 fsck's in parallel.  The check-only Phase I run takes quite a while 
(ext3 fsck is fairly slow).  Once the damaged LUN filesystems are 
identified, the repairs can optionally be restricted to the damaged LUNs; 
fewer LUNs being accessed in parallel means that the repair run can 
actually go much faster than the check-only run.  Usually a post-repair 
consistency check is not necessary (Ibrix tech support advises us what to 
do in each case, depending on what the logs show).  There are two more fsck 
phases that are run separately; the second phase is somewhat faster than 
the first, and the third is very fast.

I'll leave out any further details of the Ibrix filesystem architecture and 
fsck, since I'm not entirely clear how much they want to keep private in 
their conversations with their supported or pre-sales customers.  You can 
talk to them yourself. :)

David