[Beowulf] Cooling vs HW replacement

Robert G. Brown rgb at phy.duke.edu
Mon Jan 17 21:21:06 PST 2005


On Mon, 17 Jan 2005, George Georgalis wrote:

> Also, I should clarify, I've not setup a site like this, by experience I
> really meant exposure. I know the hot room and cold room setup does make
> a difference though.

Most of my experience has been inadvertent.  AC's that fail.  People
that turn off the AC just for a silly reason like it is winter outside
and cold (so why would you need air conditioning?).  People who are
trying to paint in the server room without supervision who helpfully
cover the servers with plastic.  Failing cooling fans.  A purchase
decision that (as it turned out) left us with a pile of some of the most
temperature-temperamental boxes on the planet.

One thing to make clear is that this isn't just about running ambient
air a bit warmer than you should.  It is about setting up your facility
to remove heat.  Remember, the nodes GENERATE heat.  It can be cold
outside, dead of winter, -20 C and with a cold wind blowing and a
midsize room with 64 nodes in it will be burning between 6 and 15
kilowatts.  That's enough heat to keep a small log cabin chinked with
paper towels toasty warm in the middle of winter.

We have at this point somewhere between 100-200 nodes in one medium
sized, fairly well insulated, room.  When the AC fails, we have a time
measured in minutes (usually around 15-20) before the room temperature
goes from maybe 15C to 30C (on its way through the roof), independent of
the temperature outside.

No matter WHAT your design, you'll have to have enough AC to be able to
remove the heat you are releasing into the room as fast as you release
it, and this is by far the bulk of your engineering requirement as far
as AC is concerned.  So I'm not certain what you are thinking about.
You cannot really not have any AC at all, and whatever AC you have will
still have to remove all that heat.  What you're really comparing, then,
is the MARGINAL cost of conditioning the air at a (too) high temperature
vs conditioning the air to a safe operating temperature.

In my estimation (which could be wrong) the amount you save keeping the
room at 30C instead of the far safer 20C will be trivial, maybe
$0.05-0.10/watt/year -- a small fraction of your total expenditure on
power for the nodes (in the US, roughly $0.60/watt/year), the AC
hardware itself (can be anywhere from tens to hundreds of thousands of
dollars), and the power required to remove the heat you MUST remove just
to keep the room temperature stable at ANY temperature (perhaps
$0.20/watt/year).

So you are risking all sorts of catastropic meltdown type situations to
save maybe $5-20 per node operating cost per year against an inevitable
budget for power for the nodes of $100-200 each per year. I don't even
think you'll break even on the additional costs of the hardware that
breaks from running things hot, let alone the human and downtime costs.

To give you an idea of the magnitude of the problem, the ONE TIME our
server room overheated for real, reaching 30-35C for an extended period
of time (many hours -- the thermal warning system that was supposedly in
place but never tested not, actually, working quite the way it was
supposed to) we had node crashes galore, and a string (literally) of
hardware failures over the next three months -- some immediate and
"obviously" due to immediate overheating, some a week later, two weeks
later, four weeks later.  Nowadays if the room gets hot we respond
immediately, typically getting nodes shut down within minutes of a
reported failure and incipient temperature spike.

When the overheating occurred I had 15 nodes racked that had run
perfectly for a year.  3 blew during the event.  4 more failed over the
next few months.  2 more failed after that.  Power supplies,
motherboards, memory chips -- that kind of heat "weakens" components so
that forever afterward they are more susceptible to failure, not just
during the event.  The overheating can just occur one time, for a few
minutes, and you'll be cursing and bitching for months and months later
dealing with all the stuff that got almost-damaged, including the stuff
that isn't actually broken, just bent out of spec so that it fails,
sometimes, under load.

Also to think about is that server room temperature is rarely uniform.
EVEN if you are running it at 20C, there will be places in the room that
are 15C (right in front of the output vents) and other places in the
room that are 25-30C (right behind the nodes).  Any unexpected mixing or
circulation of the air in a room running at "30C" and you could have
35-40C ambient air entering some nodes some of the time, and at those
temperatures I'd expect failure in a matter of days to weeks, not years.

The warmest I'd ever run ambient air is 25C in a workstation
environment, 22C in a server/cluster environment (where hot spots are
more likely to occur).

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu





More information about the Beowulf mailing list