[Beowulf] followup on 1000-node Caltech cluster

Sat Jun 18 16:04:56 PDT 2005

Hi all,

I wrote to this list on 6/6 about the large cluster that we expect to 
install at Caltech.  I got a bunch of great replies (on- and off-list), 
and wrote a brief followup on 6/8.  Here's another, more definite & 
detailed followup.

It is confirmed that we will definitely be receiving this cluster around 
7/20.  The shipment will be 26 fully-loaded racks plus various other 
stuff.  I'm working out the receiving details; I believe at this point 
that with eight workers we can get all 30 tons or so of pallets off the 
truck(s) and into the subbasement room inside one workday.  I have been 
working closely with the Caltech Transportation department to plan 
this, and I will work with Dell & the shipper as much as required until 
we have a very good plan.

Our goal is to have the cluster running well enough to show off our 
near-realtime earthquake application in late August or September.  
There is of course an enormous amount of work that needs to happen both 
before and after that point.

I was informed yesterday that there will be only one technical staff 
member supporting the cluster.  We will work very closely with our 
vendors to get things working right, and we will hire additional help 
in the first couple of months if needed.  But after the initial period, 
I'll be the only support person.  I'm excited & honored to work with 
this cluster, but I'm not at all certain that my best efforts are 
capable of giving the results they want.  We'll see; I've made my 
opinions known quite clearly.  I've also advised them to consider what 
happens if I get hit by a bus or otherwise become unavailable.

Many of your replies focused on the need for as much automation as 
possible.  We will in fact have a lot of automation capability 
available; the challenge will be integrating it all.  The room has a 
500kVA/400kW Liebert UPS for the computing equipment, a 130kVA UPS for 
the HVAC, and six ?-ton Liebert chilled-water HVAC units.  The cooling 
is adequate; the numbers suggest that we'll be close to maxing out the 
400kW computer UPS.  All of this equipment will be networked, and our 
automation systems (to be built) will have access to lots of data and 
control parameters on this equipment.

Our compute node racks will dissipate up to ~14kW.  The power strips are 
three 5.7kW 208V 3-phase APC networked/switched "rack PDUs".  The 
intakes for the HVAC units are ~3 feet from the backs of the racks.  We 
expect to have almost laminar air flow, but with good local & roomwide 
mixing due to an up&down airflow pattern from the supply ducts.

We also have a BTU meter hooked up to the chilled water supply/return 
lines.  Combined with the UPS data, we will thus be able to have our 
automated systems continuously calculate the energy balance -- is the 
cooling system taking out the energy going in?  The sign of the 
instantaneous energy balance is a very good predictor of the 
temperature trend in the room.  Thus we will be able to initiate alarms 
and/or shutdown procedures long before the temperature rise is 
noticeable.  This is all to be automated *before* the systems are 
permanently powered up -- we will not have 24/7 human coverage, so the 
automated response systems are critical.

It was calculated that if the air cooling suddenly stopped while the 
power into the room was 400kW, but the fans kept the air circulating, 
the average rate of temperature increase would be 4 degrees F per 
second.  I remain a bit skeptical of that number personally -- for 
example, there was no allowance for the heat capacity of the 
now-stationary chilled water in the heat-transfer coils, nor of the 
walls, floors, ceiling, air ducts, and other surfaces in the room.  But 
the Planetary Science folks here know their atmostpheric modeling (it's 
what they do for a living), so I'm pretty confident that their results 
are correct given their simple assumptions.

We plan to configure our automated systems to initiate a compute-node 
shutdown immediately upon loss of power or loss of chilled water flow, 
to be completed in ~30seconds, with enforcment at the 30-second mark 
via our separate controls on the individual power lines.  Shutdown of 
the more critical systems may be delayed, but will also occur 
automatically in a continuing loss situation.

We'll also deal with events from the smoke alarm / precharge fire 
suppression system and the subfloor water sensors in automated, 
appropriate ways.  We'll have a programmatic trigger for the Emergency 
Power Off circuit for the room, as well as a number of 
shielded-pushbutton and emergency-break-glass buttons in the room 
itself.

I'm hoping to recruit five volunteers to stand at the six AC units with 
me and test what happens if all the fans stop at once with the 
computers running full-tilt.  Hopefully we won't cook too quickly to 
turn the units back on.  The person who did the calculation refuses to 
help with this test, and thinks we shouldn't do it at all. :)

We've already had a flood in the subbasement that left at least 2" of 
water on the sub-raised-floor slab (cause of the flood: plumber errors 
in a different area of the subbasement).  The machines stayed up until 
I shut them down manually (this happened on a Friday night; I got a 
7:13 AM Saturday phone call), but the power outlets were on the slab 
sitting in (clean) water for a few hours, so the electrical contacts 
were completely corroded.  The outlets are now raised 13", and the 
conduits to the outlets are waterproof.

We also have calcium deposits on the unsealed cement slab, which sits on 
soil.  The cause is water impregation of the slab (from the flood and 
from failure of the under-slab sealing layer) bringing lime to the 
surface where the water evaporates, leaving the lime.  I will be 
investigating how best to prevent this in the future, and will be 
arranging for cleaning.  The sub-raised-floor area is a high-speed air 
conduit for the facility, and we've already had a couple of lime 
snowstorms when previously-idle HVAC units were turned on.

Regarding automation capabilities of the computing equipment itself: We 
will have console access via the Dell IPMI BMC (baseboard management 
controller), over the ethernet network.  I have a hint at least that 
BIOS management and updates can be automated to some degree on these 
machines, but I haven't verified the details.  The BMC will give us 
power-on/off/reset control, plus we have networked switched power 
strips as another mechanism to control power to individual nodes.

Rocks is installed on our existing 160-node cluster, but I am far from 
sufficiently familiar with it.  I will be learning a lot about it in 
the near future, and I've gotten expressions of interest & support from 
Rocks authors & users.  One thing Rocks will do (combined with the DELL 
PXE support) is automatic re-imaging of the compute nodes.  So that's 
taken care of.

We have NBD onsite service on the compute nodes and 4hr onsite service 
on the critical equipment.  I fully intend and expect to get Dell to 
help me tap this support efficiently.  I am confident that will work OK 
-- if it doesn't, they'll hear about it.  We do have a few spare 
compute nodes; I believe we'll also have a spare 48-port GigE Nortel 
switch (these stackable switches form our GigE network).  The node hard 
disks are 10k & 15kRPM SCSI.  Out of the 161 machines I have now, a 
handful of disks have gone south (in ~6 months), and a few more nodes 
have failed for other reasons.

We will have support for LSF and for Ibrix; I expect good support 
whenever I need it.  Of course, I'll have to become intimately familiar 
with these technologies myself.

I have Myrinet running fine on our 160-node cluster.  I had no 
significant hardware issues, and the only software issues were getting 
the required versions of the software for our hardware (not included 
with Rocks 3.2), and doing several days of work to convert e.g. gm into 
a proper rpm.  GM's build tools don't conform well to rpm's design 
assumptions, but I got a very good rpm in the end.  Upgrades will now 
be easy.

I will get very good support from Myricom -- they have been very helpful 
already, and are very interested in seeing this cluster work well.

I have a good deal of Linux experience, and I have a good number of 
immediate colleagues whose experience I tap regularly (and they mine).  
We have good Linux resources on campus.  Even so, I will be treading on 
territory that no one on campus has seen before.  This machine room 
will be rivaled at Caltech only by the CACR facility, which is a 
collection of a large number of systems in one large room.

Our data storage will be a top-notch DataDirect SAN with 30-40 TB total 
available after redundancy, built on FibreChannel disks.  I expect to 
have a single filesystem served by sixteen Ibrix segment nodes with 
high-availability failover of the servers (and RAID and other 
redundancy on the SAN side).

I am unsure whether I'll send all the storage data over the Nortel-based 
GigE network (our initial design), or whether I'll kick some compute 
nodes off the Myrinet and put the Ibrix segment servers on that.  It 
will depend on the performance and other issues we see, and I'll have 
some experts to draw on when deciding these issues.

We will not support user storage of non-scratch data on the node-local 
hard drives.  A select subset of the multi-TB main data store will be 
backed up using a 4.8TB-native LTO2 tape library.

The user pool is already fairly experienced on a couple of mid-sized 
clusters, and I expect we'll keep that experience pool growing as 
students etc. come and go.  They will be expected to help each other 
with the majority of helpdesk and howto issues.

Regarding mission creep, the problem is that I have a science background 
and enjoy programming, so I'd *like* to work with users on their code 
issues. :)  I've been told, though, that application support is not my 
responsibility.  I will have to exert a lot of self-discipline not to 
get scattered.

There ya go, a more complete description of our situation.  I'll be 
contacting our vendors (including those on this list) for help in 
planning and execution in the next month and beyond.  Beyond that, I 
may not have sufficient time to respond to each person who writes to 
me.  All the same, I very much appreciate any feedback you send my way.

Thanks!
David