[Beowulf] 512 nodes Myrinet cluster Challanges

Dan Stromberg strombrg at dcs.nac.uci.edu
Tue May 2 14:44:05 PDT 2006


I think IMPI sounds pretty worthwhile, although I don't have any first
hand experience with it yet.  The abilities to reboot a hung system or
get decent statistics about this, that and the other thing, seems worth
the cost in many cases, and my management has decided to require it on
all of our new internal equipment.

On the other hand, if you don't have the $$ for IMPI, some of you might
find a program I wrote of interest:
http://dcs.nac.uci.edu/~strombrg/fallback-reboot/  It allows you to
reboot a system remotely, even if that system's disk or filesystem
layers are temporarily messed up.  It does this by mlockall()'ing itself
into physical memory and reading any data it may need later, in advance
instead.  Usually, if the machine is pingable, you can reboot it with
fallback-reboot.

On Tue, 2006-05-02 at 14:02 -0700, Bill Broadley wrote:
> Mark Hahn said:
> > moving it, stripped them out as I didn't need them.  (I _do_ always require
> > net-IPMI on anything newly purchased.)  I've added more nodes to the cluster
> 
> Net-IPMI on all hardware?  Why? Running a second (or 3rd) network isn't
> a trivial amount of additional complexity, cables, or cost.  What do
> you figure you pay extra on the nodes (many vendors charge to add IPMI,
> sun, tyan, supermicro, etc), cables, switches, etc.  As a data point on
> a x2100 I bought recently the IPMI card was $150.
> 
> Seems like collecting fan speeds and temperatures in-band seems reasonable,
> after all much of the data you want to collect isn't available via IPMI
> anyways (cpu utilization, memory, disk I/O, etc.).
> 
> Upgrading a 208 3phase PDU to a switched PDU seems like it costs on the
> order of $30 per node list.  As a side benefit you get easy to query
> load per phase.  The management network ends up being just one network
> cable per PDU (usually 2-3 per rack).
> 
> After dealing with a few clusters with PDUs in the airflow blocking
> airflow and physical access to parts of the node I now specify the
> zero-u variety that are outside the airflow.
> 




More information about the Beowulf mailing list