[Beowulf] 512 nodes Myrinet cluster Challanges
kewley at gps.caltech.edu
Tue May 2 19:56:43 PDT 2006
On Tuesday 02 May 2006 14:02, Bill Broadley wrote:
> Mark Hahn said:
> > moving it, stripped them out as I didn't need them. (I _do_ always
> > require net-IPMI on anything newly purchased.) I've added more nodes
> > to the cluster
> Net-IPMI on all hardware? Why? Running a second (or 3rd) network isn't
> a trivial amount of additional complexity, cables, or cost. What do
> you figure you pay extra on the nodes (many vendors charge to add IPMI,
> sun, tyan, supermicro, etc), cables, switches, etc. As a data point on
> a x2100 I bought recently the IPMI card was $150.
On our Dell PE1850s, the IPMI controller (the BMC: Baseboard Management
Controller) is built on the baseboard, and it piggybacks on one of the
built-in ethernet ports. So over one cable, I get GigE to the OS and 100Mb
to the BMC. Apparently there's an ethernet switch built into the
No extra cable, no extra cost. (Or consider that you're paying for it
whether you use it or not. :)
> Seems like collecting fan speeds and temperatures in-band seems
> reasonable, after all much of the data you want to collect isn't
> available via IPMI anyways (cpu utilization, memory, disk I/O, etc.).
It probably is reasonable, but that's not why I use the BMC / IPMI
capability. I use it all the time for:
* querying whether the node is powered up or down
* powering up the node
* powering down the node
* power-cycling the node
I also occasionally use it to get the System Event Log (also known as the
Embedded System Management log), which tells me about ECC errors, PCI
errors, processor errors, temperature excursions, fan failures, etc.
Normally I get the SEL / ESM log via Dell's in-band OpenManage software,
but if the node is down, I can also get the information using net-IPMI.
I don't use net-IPMI to get fan speed, temps, etc. I use Open Manage for
that, when I care about it. But I seldom care. That's in contrast to
power management, which, as I said, I use *all the time*.
> Upgrading a 208 3phase PDU to a switched PDU seems like it costs on the
> order of $30 per node list. As a side benefit you get easy to query
> load per phase. The management network ends up being just one network
> cable per PDU (usually 2-3 per rack).
Yeah, we have APC AP7960s, which have the advanced capabilities you name. I
thought I'd network them all, but I've never gotten around to it (except on
our fileservers, which use the AP7960 for fencing), because the net-IPMI
methods work so well. Someday...
The AP7960 is about $650 street, and supports up to 5.7kW per unit. That's
24 outlets, individually switched, but you can only get about 16 high-power
computers on one PDU (16*350W is 5.6kW). We have 3 PDUs ~evenly handling
40 nodes per rack, and each node is ~320W max, so we're fine.
> After dealing with a few clusters with PDUs in the airflow blocking
> airflow and physical access to parts of the node I now specify the
> zero-u variety that are outside the airflow.
We have 3 AP7960s per rack, all mounted in the rear (of course). One is in
the "zero-U" space on the left side at the rear. The right "zero-U" space
is taken up by network cable routing. So the other two units are hung from
the left rear door (the rear doors are a pair of clamshell-opening
The Dell rack doors (and hinges) can easily take that load. And if you
mount them as close to the center edge of the door as possible, the
computer power supply fans have a straight route to blow air out the door
grillwork. There's some airflow obstruction due to the power cords, and
Dell didn't mount the AP7960s at the center edges of the doors, so I'm not
entirely happy with it. But in fact our computers' ambient and internal
temps are absolutely fine. Of course it helps that the ambient is 50-55
degrees F... ;)
More information about the Beowulf