[Beowulf] 512 nodes Myrinet cluster Challanges

Mark Hahn hahn at physics.mcmaster.ca
Fri Apr 28 05:04:53 PDT 2006

> Does any one know what types of problems/challanges for big clusters?

cooling, power, managability, reliability, delivering IO, space.

> we are considering having a 512 node cluster that will be using
> Myrinet as its main interconnect, and would like to do our homework

how confident are you at addressing especially the physical issues above?
cooling and power happen to be prominent in my awareness right now because 
of a 768-node cluster I'm working on.  but even ~200 node clusters need to 
have some careful thought applied to managability (cleaining up dead jobs,
making sure the scheduler doesn't let jobs hang around consuming myrinet 
ports, for instance.)  reliability is a fairly cut and dried issue, IMO - 
either you make the right hardware decisions at purchase, or not.

> The cluster is meant to run an inhouse fluid simulation application
> that is I/O intensve, and requires large memory models.

what parallel-cluster filesystem are you planning to run?  how many fileservers?
(or is the IO intensivity handlable using per-node disks?)

More information about the Beowulf mailing list