[Beowulf] scratch File system for small cluster

Joe Landman landman at scalableinformatics.com
Thu Sep 25 07:19:26 PDT 2008


Glen Beane wrote:
> I am considering adding a small parallel file system ~(5-10TB) my small
> cluster (~32 2x dual core Opteron nodes) that is used mostly by a handful of
> regular users.  Currently the only storage accessible to all nodes is home
> directory space which is provided by the Lab's IT department (this is a SAN
> volume connected to the head node by 2x FC links, and NFS exported to the
> compute nodes). I don't have to "worry" about the IT provided SAN space -
> they back it up, provide redundant hardware, etc.  The parallel file system
> would be scratch space (and not backed up by IT).  We have a mix of home
> grown apps doing a pretty wide range of things (some do a lot of I/O, others
> don't), and things like BLAST and BLAT.

Hi Glen:

   BLAST uses mmap'ed IO.  This has some interesting ... interactions 
... with parallel file systems.

> 
> Can anyone out there provide recommendations for a good solution for fast
> scratch space for a cluster of this size?

   Yes, but we are biased, as this is in part what we 
design/build/sell/support.  Linky in .sig .

> Right now I was thinking about PVFS2. How many I/O servers should I have,
> and how many cores and RAM per I/O server?

   It turns out that PVFS2 sadly has a significant problem with BLAST 
and  mpiBLAST due to the mmap'ed files.  We found this out when trying 
to help a customer with a small tier-1 cluster deal with file system 
instability.  We saw this in PVFS2 2.6.9, 2.7.0 on 32 and 64 bit 
platforms.  The customer was going to update the PVFS2 group, haven't 
heard if they have had a chance to do anything to trace this down and 
fix it (I don't think it is a priority, as BLAST doesn't use MPI-IO, 
which PVFS2 is quite good at).

> Are there other recommendations for fast scratch space (it doesn't have to
> be a parallel file system, something with less hardware would be nice)

Pure software:  GlusterFS currently, ceph in the near future.  GFS won't 
give you very good performance (meta-data shuttling limits what you can 
do).  You could go Lustre, but then you need to build MDS/ODS setups so 
this is hybrid.

Pure hardware:  Panasas (awesome kit, but not for the light-of-wallet), 
DDN, Bluearc (same comments for these as well).

Reasonable cost HW with good performance:  us and a few others.  Put any 
parallel FS atop this, or pure NFS.  We have measured NFSoverRDMA speeds 
(on SDR IB at that) at 460 MB/s, on an RDMA adapter reporting 750 MB/s 
(in a 4x PCIe slot, so ~860 MB/s max is what we should expect for this). 
  Faster IB hardware should result in better performance, though you 
still have to walk through the various software stacks, and they ... 
remove efficiency ... (nice PC way to say that they slow things down a 
bit :( )

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615



More information about the Beowulf mailing list