[Beowulf] integrating node disks into a cluster filesystem?

Fri Sep 25 15:09:11 PDT 2009

Hi all,
I'm sure you've noticed that disks are incredibly cheap, obscenely large and
remarkably fast (at least in bandwidth).  the "cheap" part is the only one 
of these that's really an issue, since the question becomes: how to keep
storage infrastructure cost (overhead) dominating the system cost?

the backblaze people took a great swing at this - their solution is really 
centered on the 5-disk port-multiplier backplanes.  (I would love to hear
from anyone who has experience with PM's, btw.)

but since 1U nodes are still the most common HPC building block, and most 
of them support 4 LFF SATA disks with very little added cost (esp using the 
chipset's integrated controller), is there a way to integrate them into a 
whole-cluster filesystem?

- obviously want to minimize the interference of remote IO to a node's jobs.
   for serial jobs, this is almost moot.  for loosely-coupled parallel jobs
   (whether threaded or cross-node), this is probably non-critical.  even for
   tight-coupled jobs, perhaps it would be enough to reserve a core for
   admin/filesystem overhead.

- iscsi/ataoe approach: export the local disks via a low-level block protocol
   and raid them together on dedicated fileserving node(s).  not only does
   this address the probability of node failure, but a block protocol might
   be simple enough to avoid deadlock (ie, job does IO, allocating memory for
   pagecache them network packets, which may by chance wind up triggering
   network activity back to the same node, and more allocations for the
   underlying disk IO.)

- distributed filesystem (ceph?  gluster?  please post any experience!)  I
   know it's possible to run oss+ost services on a lustre client, but not
   recommended because of the deadlock issue.

- this is certainly related to more focused systems like google/mapreduce.
   but I'm mainly looking for more general-purpose clusters - the space would
   be used for normal files, and definitely mixed read/write with something
   close to normal POSIX semantics...

thanks, mark hahn.