[Beowulf] Need recommendation for a new 512 core Linux cluster

Wed Nov 7 16:29:52 PST 2007

Steven Truong wrote:
> Hi, all.  I would like to know for that many cores, what kind of file
> system should we go with? 

That would depend on how you plan to use it, how big it is, how much
money you have, local expertise, performance expectations, reliability
expectations, expected parallelism, average filesize, locking requirements
(posix? bitrange? relaxed? none), usage (mpi-io? database? multi-writer?
append?  pub/get?) and other variables.

> Currently we have a couple of clusters with
> around 100 cores and NFS seems to be ok but not great.

Quantifying not great would be useful.  Quantifying exactly what bandwidth
or latency in your I/O system is supporting a workload you are working with
is even better.  Even something as crude as looking at the ifconfig packet
counters can be useful.  Many cluster packages (like rocks and others)
support ganglia that allow for looking at things like network bandwidth
used per hour, week, month.  If those graphs happen to include your NAS
nodes all the better.

>  We definitely
> need to put in place a parallel file system for this new cluster and I
> do not know which one I should go with?  Lustre, GFS, PVFS2 or what
> else?  Could you share your experiences regarding this aspect?

This has come up multiple times, I suggest looking at the beowulf archives.

If you are writing your own applications I'd look at hadoop/googlefs if
put/get kind of behavior looks like a good fit.  http://www.dcache.org/
looks rather promising for some workloads as well and features (corrections
welcome) nfs-4.1 protocol support for supporting clients that don't have
to know the details of the distributed magic at work.

In general I'd recommend figuring out your performance and reliability
requirements FIRST, then decide if you really need parallel.  It's not
rocket science these days to put 16-48 disks in a single box, a few
600-800MB/sec raid controllers and a 10G uplink to a switch to allow
reasonable performance for some uses.

The biggest pitfalls I've seen with lustre and pvfs2 (corrections welcome)
is they depend on the reliability of the hardware, so you end up either
feeling the pain when a storage node dies, or hooking up pairs of storage
nodes to one array and handling fallover (and hoping the array doesn't die).

ceph.sourceforge.net looks interesting in that it doesn't depend on reliable
storage nodes, but it doesn't look production ready quite yet (corrections
welcome).

> I also would like to know how many head nodes should I need to manage
> jobs and queues.  And what else should I have to worry about?

I'd say 1 for the jobs and queue, 512 cores is only on the order of 64-128
nodes usually, not particularly hard to keep track of the queue unless you
have some strange setup like 1M 1 second jobs you want to submit hourly.
or something.  Again more detail needed.

> 
> Thank you very much for sharing any experiences.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf