Because XFS is BETTER (Re: opinion on XFS)

Thu May 9 18:13:33 PDT 2002

On Thu, 9 May 2002, Eray Ozkural wrote:
> On Thursday 09 May 2002 04:48, Donald Becker wrote:
> > On Thu, 9 May 2002, Eray Ozkural wrote:

> Okay. Now that may indeed be the case, because I never used XFS code
> prior to the one based on 2.4.x release. It does seem to be very
> stable at the moment, though, so perhaps you can give it a whirl
> again.

> I trust your knowledge of the kernel more than any other person on the
> list, so maybe you can tell us, in your opinion, which filesystem is
> truly the best in an I/O intensive environment (parallel database/IR
> algorithms, etc.)

Oh, I'm not the right person to ask about filesystems.  I used to know
about them, but that was long ago.  There are two things that I do know:
    Last year's conventional wisdoms is now completely wrong, and
    We don't yet have a good general-purpose cluster file system.

The critical issues of the '80s, such as using the disk geometry
information, optimizing the packing of small files, and bitmaps vs. free
lists are all irrelevant today.  NFS is slow because it was designed to
work around file servers that crashed several times a day.  NFS killed
the faster RFS protocol just as Sun file servers changed into
exceptionally reliable machines.

In the mid-90s there were people loudly supporting UFS and its
synchronous metadata updates, even while their filesystems were silently
losing data during crashes.  But the directory structure was consistent..

For the Scyld cluster system we decided to be completely filesystem
agnostic.  That's not because we consider the filesystem unimportant.
It's because we consider it vitally important, both for performance and
scalability.

The problem is that there is no single file system that can give us the
single-system consistency combined with broad high performance on low
cost hardware.  We decided that the filesystems would have to be matched
to the hardware configuration and application's needs. The systems
integrator and administrator can configure the system, without changing
the architecture or user interaction.  We make this easy with
single-point driver installation (kernel drivers exist only on the
master) and single point, one-time administration (/etc/beowulf/fstab
supports macros, the "cluster" netgroups and per-node exceptions).

While our base slave node architecture is diskless, we recommend
   A local disk for swap space and a node-local filesystem.
   NFS mounting master:/home
   PVFS for large temporary intermediate files.
   Sistina GFS with a fibre channel SAN for consistent databases.

The reasoning behind this is that
    Local disk is the least expensive bandwidth you can get.
       But it has version skew problem if you use it for persistent data.
    NFS (especially v2) is great for _reading_ small (<8KB) configuration
       files.  But avoid it for writing, executables and large files.
    PVFS is the worlds fastest cluster file system, but only works well
      for carefully laid out large files.
    GFS is great for transaction consistency and general purpose
      large-site storage, but there is an (inevitable?) $ and
      scalability/performance cost for its semantics.

Bottom line: until we have the do-it-all cluster filesystem, we have to
provide a reasonable default and interchangable tools to do the job.

-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993