[Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance?

Wed Sep 9 23:24:13 PDT 2009

Mark Hahn wrote:
>> Our new cluster aims to have around 300 compute nodes. I was wondering
>> what is the largest setup people have tested NFS with? Any tips or
> 
> well, 300 is no problem at all.  though if you're talking to a single
> Gb-connected server, you can't home for much BW per node...

True.  Then again in the days of 20+ GB/sec memory systems, 8GB/sec pci-e
busses, fast raid controllers and quad ethernet pci-e/motherboards there's
not much reason to connect a single GigE to a file server these days.  Hell
even many of the cheap 1U systems come with quad ethernet.  Sun ships in on
many (all?) of their systems.  I was just looking at a single socket xeon
lynnfield board with 4 GigE from supermicro.  I suspect the supermicro quad
port motherboard costs less than an intel quad port pci-e card ($422 at newegg).

If you have 4 48 port GigE switches connected via the interconnect cable it
would make sense to give each switch a subnet and direct connect one GigE per
switch.  Of course file servers themselves are cheap, at anything much more
than 16 disks I usually build multiple servers.  In the comparisons I've made
so far 3 16 disk servers compare rather well against a single 48 disk server.
 Usually cheaper as well, granted it does take 9U instead of 3-4U.

Granted without a fancy parallel file system you can't load balance file
servers or network links.  Having more file servers and multiple uplink per
server can certainly substantially improve your average throughput.  With that
said 3 16 disk servers make a good building block for a parallel file system
of your choice if you change your mind later on.

Often we have different groups of users contributing to a cluster and
politically it's nice to separate them off on their own server that by
definition get 100% of the disk they paid for, and their use of their uplink
or disk doesn't effect the other file file servers.  This gives said group
options that they wouldn't have with a larger shared server.

>> comments? There seems no way for me to say if it will scale well or
>> not.
> 
> it's not to hard to figure out some order-of-magnitude bandwidth
> requirements.  how many nodes need access to a single namespace at
> once?  do jobs drop checkpoints of a known size periodically?
> faster/more ports on a single NFS server gets you fairly far (hundreds
> of MB/s), but you can also agregate across multiple NFS
> servers (if you don't need all the IO in a single directory...)

Indeed, this kind of thing works quite well for us with 180-230 node clusters.

>> Assume each of my compute nodes have gigabit ethernet AND I specify
>> the switch such that it can handle full line capacity on all ports.
> 
> but why?  your fileservers won't support saturating all nodes links at
> once, so why a full-bandwidth fabric?  the fabric backbone only needs to

Agreed, it's a waste unless MPI or related needs it.

> match the capacity of the storage (I'd guess 10G would be reasonable,

I've been looking at 10G for NFS/file servers (we already use sdr and ddr for
MPI), but so far the cost seems to favor not putting too many disks in a
single box and using more than one uplink.  So instead of 1 48 disk file
server with a 10G uplink we end up with 3 16 disk servers with 4 GigE uplinks
each.  It also avoids each individual file server not being as mission
critical.  I'd be a bit nervous if an expensive cluster depended on a single
piece equipment.  My theory goes if you have just one you spent too much on
that piece of equipment.  That way we leave the switch <-> switch connections
entirely for MPI and not for trying to spread the single 10G connection for
all I/O across 4 switches.

> unless you really ramp up the number of fat fileservers.)
> or do you mean the fabric is full-bandwidth to optimally support MPI?

Hopefully.

>> If not NFS then Lustre etc options do exist. But the more I read about
> 
> yes - I wouldn't resort to Lustre until it was clear that NFS wouldn't do.
> Lustre does a great job of scaling content bandwidth and capacity all
> within a single namespace.  but NFS, even several instances, is indeed
> a lot simpler...

Agreed.  NFS works well in the 200 node range if you aren't too I/O intensive.
 Sometimes it's a bit more complicated because you end up staging to a local
disk for better performance.  But it's stable and keeps our users happy.  I'm
watching the parallel file system space closely and would certainly design
around it if we ended up with a significantly larger (or more I/O intensive)
cluster.