[Beowulf] Cooling vs HW replacement
Luc.Vereecken at chem.kuleuven.ac.be
Tue Jan 18 05:41:36 PST 2005
I usually just lurk on this mailinglist, but i think it time to share some
experience about not having Cooling...
I have been running a cluster (of variable size depending on the season) in
an average room without AC for several years. Not by choise, I must say,
but my request for AC was rejected, and it took years before the necessary
infrastructure was present to move to another room that already had AC
It is a horrorstory. During summer (yes, I'm on the sunny side of the
building) temperatures in excess of 35° Celcius. During winter, even with
the (small) window open while it was freezing outside, I couldn't get the
temperature below 20°C. I just could not get rid of the generated heat,
despite that this is a chemistry building and the ventilation replaces the
air 7 times per hour (or is it 15 times? can't remember). Note that other
rooms in this part of the building tend to be chilly in winter because it's
so hard to heat them with the ventilation taking out the heated air.
The first summer I had a failure rate of over 60%. Some motherboards
failed, plenty of powersupplies failed, I had 10 brandnew disks that ran so
hot at times i couldn't put my hand on them at these ambient temperatures.
5 of them failed in the first 6 months, the other 5 a few months later.
Some CPUs just stopped working. Some memory modules burned out. I have 2 or
3 nodes where i can reproducibly crash certain jobs or get faulty results
just depending on the temperature of the room. I found that during the hot
season, new computers ran for about 3 months, then started to go awry.
In an attempt to get rid of the hot air, I attached flexible airducts to
the exhaust of the powersupplies (where most of the hot air comes out) and
the ventilation sucked the hot air out directly. This idea actually works
pretty well for a DIY solution, especially as we have uber-ventilation
given that this room used to be a chemical lab. I might actually implement
this also for our new cluster (in an AC-ed room) just to reduce the
AC-requirements. On average, it reduced the temperature in the room several
degrees, but I had to let go of the idea because I still had to fix nodes
too often and handling the airducts was a bit too cumbersome to do on daily
basis (some nodes are still attached to this system).
The next summer, I wisened up, and preemtively turned off some of the
slower nodes. This time failure rate of the remaining nodes was _only_ 45%,
but I think this is partly because I just stopped fixing nodes at a certain
point (ran out of spare powersupplies )-: ) and left the faulty nodes
turned off waiting for the colder season.
After more than two years, I now have access to an AC-ed room; I plan on
building a completely new cluster, as all the current hardware has been
overheating and is prone to produce faulty results because of this despite
that the room is cooler now than in summer.
My advice: don't even think about trying HW replacement instead of cooling.
- Failure rates are horrible at temperatures above 30° ambient: I lost
thousands of euros by failures.
- Downtime is also killing you: my scientific output has dropped to less
than 30% compared to before these heating problems started. With a bit of
bad luck, I won't get a new grant because of this.
- TCO is horrible due to the operator time: you have to manually walk over
to the cluster, take out the node, figure out the problem, fix it, get
spare parts or contact the warranty supplier,... this takes too much time
especially given the high failure rates. I never worked as hard as during
these last 2 years, and as I said, scientific output was still strongly
- You will have difficulty with warranty. Some components you can get
replaced without questions as they show no obvious signs of abuse, but I
had MoBos with several components blown and blackened. No way you can claim
this is not caused by overheating. After you send in a couple of these, you
start to get questions...
- The biggest killer of all is the non-visible problems. At certain moments
I started to get different results on different nodes for the same job. A
job would crash at 28°, but run OK at 24°. I get unexplainable outliers
when calculating what should be a smooth trend. Rerunning the calculations
exactly the same gives different results. You just cant trust your results
OK. Better stop here, as i don't intend to rival rgb in length of a post :-)
Anyway: going the HW-replacement road would be, in my view and based on
extensive experience, a wrong decission.
More information about the Beowulf