[Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes

Wed Sep 2 20:54:17 PDT 2009

On Wed, 2 Sep 2009 at 10:29pm, Rahul Nabar wrote

> That brings me to another important question. Any hints on speccing
> the head-node? Especially the kind of storage I put in on the head
> node. I need around 1 Terabyte of storage. In the past I've uses
> RAID5+SAS in the server. Mostly for running jobs that access their I/O
> via files stored centrally.
>
> For muscle I was thinking of a Nehalem E5520 with 16 GB RAM. Should I
> boost the RAM up? Or any other comments. It is tricky to spec the
> central node.
>
> Or is it more advisable to go for storage-box external to the server
> for NFS-stores and then figure out a fast way of connecting it to the
> server. Fiber perhaps?

Speccing storage for a 300 node cluster is a non-trivial task and is 
heavily dependent on your expected access patterns.  Unless you anticipate 
vanishingly little concurrent access, you'll be very hard pressed to 
service a cluster that large with a basic Linux NFS server.  About a year 
ago I had ~300 nodes pointed at a NetApp FAS3020 with 84 spindles of 10K 
RPM FC-AL disks.  A single user could *easily* flatten the NetApp (read: 
100% CPU and multi-second/minute latencies for everybody else) without 
even using the whole cluster.

Whatever you end up with for storage, you'll need to be vigilant regarding 
user education.  Jobs should store as much in-process data as they can on 
the nodes (assuming you're not running diskless nodes) and large jobs 
should stagger their access to the central storage as best they can.

-- 
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF