[Beowulf] 512 nodes Myrinet cluster Challanges

David Kewley kewley at gps.caltech.edu
Tue May 2 19:56:43 PDT 2006


On Tuesday 02 May 2006 14:02, Bill Broadley wrote:
> Mark Hahn said:
> > moving it, stripped them out as I didn't need them.  (I _do_ always
> > require net-IPMI on anything newly purchased.)  I've added more nodes
> > to the cluster
>
> Net-IPMI on all hardware?  Why? Running a second (or 3rd) network isn't
> a trivial amount of additional complexity, cables, or cost.  What do
> you figure you pay extra on the nodes (many vendors charge to add IPMI,
> sun, tyan, supermicro, etc), cables, switches, etc.  As a data point on
> a x2100 I bought recently the IPMI card was $150.

On our Dell PE1850s, the IPMI controller (the BMC: Baseboard Management 
Controller) is built on the baseboard, and it piggybacks on one of the 
built-in ethernet ports.  So over one cable, I get GigE to the OS and 100Mb 
to the BMC.  Apparently there's an ethernet switch built into the 
baseboard.

No extra cable, no extra cost.  (Or consider that you're paying for it 
whether you use it or not. :)

> Seems like collecting fan speeds and temperatures in-band seems
> reasonable, after all much of the data you want to collect isn't
> available via IPMI anyways (cpu utilization, memory, disk I/O, etc.).

It probably is reasonable, but that's not why I use the BMC / IPMI 
capability.  I use it all the time for:

* querying whether the node is powered up or down
* powering up the node
* powering down the node
* power-cycling the node

I also occasionally use it to get the System Event Log (also known as the 
Embedded System Management log), which tells me about ECC errors, PCI 
errors, processor errors, temperature excursions, fan failures, etc.  
Normally I get the SEL / ESM log via Dell's in-band OpenManage software, 
but if the node is down, I can also get the information using net-IPMI.

I don't use net-IPMI to get fan speed, temps, etc.  I use Open Manage for 
that, when I care about it.  But I seldom care.  That's in contrast to 
power management, which, as I said, I use *all the time*.

> Upgrading a 208 3phase PDU to a switched PDU seems like it costs on the
> order of $30 per node list.  As a side benefit you get easy to query
> load per phase.  The management network ends up being just one network
> cable per PDU (usually 2-3 per rack).

Yeah, we have APC AP7960s, which have the advanced capabilities you name.  I 
thought I'd network them all, but I've never gotten around to it (except on 
our fileservers, which use the AP7960 for fencing), because the net-IPMI 
methods work so well.  Someday...

The AP7960 is about $650 street, and supports up to 5.7kW per unit.  That's 
24 outlets, individually switched, but you can only get about 16 high-power 
computers on one PDU (16*350W is 5.6kW).  We have 3 PDUs ~evenly handling 
40 nodes per rack, and each node is ~320W max, so we're fine.

> After dealing with a few clusters with PDUs in the airflow blocking
> airflow and physical access to parts of the node I now specify the
> zero-u variety that are outside the airflow.

We have 3 AP7960s per rack, all mounted in the rear (of course).  One is in 
the "zero-U" space on the left side at the rear.  The right "zero-U" space 
is taken up by network cable routing.  So the other two units are hung from 
the left rear door (the rear doors are a pair of clamshell-opening 
half-doors).

The Dell rack doors (and hinges) can easily take that load.  And if you 
mount them as close to the center edge of the door as possible, the 
computer power supply fans have a straight route to blow air out the door 
grillwork.  There's some airflow obstruction due to the power cords, and 
Dell didn't mount the AP7960s at the center edges of the doors, so I'm not 
entirely happy with it.  But in fact our computers' ambient and internal 
temps are absolutely fine.  Of course it helps that the ambient is 50-55 
degrees F... ;)

David



More information about the Beowulf mailing list