[Beowulf] Cooling vs HW replacement
josip at lanl.gov
Tue Jan 18 08:30:03 PST 2005
At my old job, we had the unfortunate experience of AC failing on the
hottest days of the year. Despite providing plenty of circulating fresh
35-40 deg. C air, we lost hardware, mainly disks. In fact, we'd start
losing hard drives (even high quality SCSI drives in our servers) any
time the ambient temperature approached 30 deg. C.
Based on this experience, I'd say that keeping the ambient temperature
under about 25-27 deg. C is a good policy. As Robert has pointed out,
the cost of lost productivity while the system is down for hard drive
replacement and reconstruction, not to mention the manpower required,
can make an unreliable system "AWESOMELY expensive."
In fact, I'd recommend installing a temperature activated kill switch in
any cluster computing room.
Remember: dissipating 5-10 KW in a small enclosed space can overheat
your expensive cluster within minutes of AC failure, certainly faster
than your system administrator can respond to an alarm triggered on a
Sunday at 2am. Even a forced shutdown (when ambient temperature exceeds
about 30 deg. C for more than a few minutes) is cheaper to fix than
replacing and rebuilding several failed hard drives.
More information about the Beowulf