[Beowulf] scratch File system for small cluster
landman at scalableinformatics.com
Thu Sep 25 07:19:26 PDT 2008
Glen Beane wrote:
> I am considering adding a small parallel file system ~(5-10TB) my small
> cluster (~32 2x dual core Opteron nodes) that is used mostly by a handful of
> regular users. Currently the only storage accessible to all nodes is home
> directory space which is provided by the Lab's IT department (this is a SAN
> volume connected to the head node by 2x FC links, and NFS exported to the
> compute nodes). I don't have to "worry" about the IT provided SAN space -
> they back it up, provide redundant hardware, etc. The parallel file system
> would be scratch space (and not backed up by IT). We have a mix of home
> grown apps doing a pretty wide range of things (some do a lot of I/O, others
> don't), and things like BLAST and BLAT.
BLAST uses mmap'ed IO. This has some interesting ... interactions
... with parallel file systems.
> Can anyone out there provide recommendations for a good solution for fast
> scratch space for a cluster of this size?
Yes, but we are biased, as this is in part what we
design/build/sell/support. Linky in .sig .
> Right now I was thinking about PVFS2. How many I/O servers should I have,
> and how many cores and RAM per I/O server?
It turns out that PVFS2 sadly has a significant problem with BLAST
and mpiBLAST due to the mmap'ed files. We found this out when trying
to help a customer with a small tier-1 cluster deal with file system
instability. We saw this in PVFS2 2.6.9, 2.7.0 on 32 and 64 bit
platforms. The customer was going to update the PVFS2 group, haven't
heard if they have had a chance to do anything to trace this down and
fix it (I don't think it is a priority, as BLAST doesn't use MPI-IO,
which PVFS2 is quite good at).
> Are there other recommendations for fast scratch space (it doesn't have to
> be a parallel file system, something with less hardware would be nice)
Pure software: GlusterFS currently, ceph in the near future. GFS won't
give you very good performance (meta-data shuttling limits what you can
do). You could go Lustre, but then you need to build MDS/ODS setups so
this is hybrid.
Pure hardware: Panasas (awesome kit, but not for the light-of-wallet),
DDN, Bluearc (same comments for these as well).
Reasonable cost HW with good performance: us and a few others. Put any
parallel FS atop this, or pure NFS. We have measured NFSoverRDMA speeds
(on SDR IB at that) at 460 MB/s, on an RDMA adapter reporting 750 MB/s
(in a 4x PCIe slot, so ~860 MB/s max is what we should expect for this).
Faster IB hardware should result in better performance, though you
still have to walk through the various software stacks, and they ...
remove efficiency ... (nice PC way to say that they slow things down a
bit :( )
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf