[Beowulf] Cooling vs HW replacement

Luc Vereecken Luc.Vereecken at chem.kuleuven.ac.be
Tue Jan 18 05:41:36 PST 2005


Hi list,

I usually just lurk on this mailinglist, but i think it time to share some 
experience about not having Cooling...

I have been running a cluster (of variable size depending on the season) in 
an average room without AC for several years. Not by choise, I must say, 
but my request for AC was rejected, and it took years before the necessary 
infrastructure was present to move to another room that already had AC 
installed.
It is a horrorstory. During summer (yes, I'm on the sunny side of the 
building) temperatures in excess of 35° Celcius. During winter, even with 
the (small) window open while it was freezing outside, I couldn't get the 
temperature below 20°C. I just could not get rid of the generated heat, 
despite that this is a chemistry building and the ventilation replaces the 
air 7 times per hour (or is it 15 times? can't remember). Note that other 
rooms in this part of the building tend to be chilly in winter because it's 
so hard to heat them with the ventilation taking out the heated air.

The first summer I had a failure rate of over 60%. Some motherboards 
failed, plenty of powersupplies failed, I had 10 brandnew disks that ran so 
hot at times i couldn't put my hand on them at these ambient temperatures. 
5 of them failed in the first 6 months, the other 5 a few months later. 
Some CPUs just stopped working. Some memory modules burned out. I have 2 or 
3 nodes where i can reproducibly crash certain jobs or get faulty results 
just depending on the temperature of the room. I found that during the hot 
season, new computers ran for about 3 months, then started to go awry.

In an attempt to get rid of the hot air, I attached flexible airducts to 
the exhaust of the powersupplies (where most of the hot air comes out) and 
the ventilation sucked the hot air out directly. This idea actually works 
pretty well for a DIY solution, especially as we have uber-ventilation 
given that this room used to be a chemical lab. I might actually implement 
this also for our new cluster (in an AC-ed room) just to reduce the 
AC-requirements. On average, it reduced the temperature in the room several 
degrees, but I had to let go of the idea because I still had to fix nodes 
too often and handling the airducts was a bit too cumbersome to do on daily 
basis (some nodes are still attached to this system).

The next summer, I wisened up, and preemtively turned off some of the 
slower nodes. This time failure rate of the remaining nodes was _only_ 45%, 
but I think this is partly because I just stopped fixing nodes at a certain 
point (ran out of spare powersupplies )-: ) and left the faulty nodes 
turned off waiting for the colder season.

After more than two years, I now have access to an AC-ed room; I plan on 
building a completely new cluster, as all the current hardware has been 
overheating and is prone to produce faulty results because of this despite 
that the room is cooler now than in summer.

My advice: don't even think about trying HW replacement instead of cooling.
- Failure rates are horrible at temperatures above 30° ambient: I lost 
thousands of euros by failures.
- Downtime is also killing you: my scientific output has dropped to less 
than 30% compared to before these heating problems started. With a bit of 
bad luck, I won't get a new grant because of this.
- TCO is horrible due to the operator time: you have to manually walk over 
to the cluster, take out the node, figure out the problem, fix it, get 
spare parts or contact the warranty supplier,... this takes too much time 
especially given the high failure rates. I never worked as hard as during 
these last 2 years, and as I said, scientific output was still strongly 
reduced.
- You will have difficulty with warranty. Some components you can get 
replaced without questions as they show no obvious signs of abuse, but I 
had MoBos with several components blown and blackened. No way you can claim 
this is not caused by overheating. After you send in a couple of these, you 
start to get questions...
- The biggest killer of all is the non-visible problems. At certain moments 
I started to get different results on different nodes for the same job. A 
job would crash at 28°, but run OK at 24°. I get unexplainable outliers 
when calculating what should be a smooth trend. Rerunning the calculations 
exactly the same gives different results. You just cant trust your results 
anymore.

OK. Better stop here, as i don't intend to rival rgb in length of a post :-)
Anyway: going the HW-replacement road would be, in my view and based on 
extensive experience, a wrong decission.

Luc Vereecken






More information about the Beowulf mailing list