[Beowulf] 512 nodes Myrinet cluster Challanges
hahn at physics.mcmaster.ca
Fri Apr 28 05:04:53 PDT 2006
> Does any one know what types of problems/challanges for big clusters?
cooling, power, managability, reliability, delivering IO, space.
> we are considering having a 512 node cluster that will be using
> Myrinet as its main interconnect, and would like to do our homework
how confident are you at addressing especially the physical issues above?
cooling and power happen to be prominent in my awareness right now because
of a 768-node cluster I'm working on. but even ~200 node clusters need to
have some careful thought applied to managability (cleaining up dead jobs,
making sure the scheduler doesn't let jobs hang around consuming myrinet
ports, for instance.) reliability is a fairly cut and dried issue, IMO -
either you make the right hardware decisions at purchase, or not.
> The cluster is meant to run an inhouse fluid simulation application
> that is I/O intensve, and requires large memory models.
what parallel-cluster filesystem are you planning to run? how many fileservers?
(or is the IO intensivity handlable using per-node disks?)
More information about the Beowulf