[Beowulf] Approach For Diagnosing Heat Related Failure?
billycrook at gmail.com
Tue Jul 21 14:33:46 PDT 2009
On Tue, Jul 21, 2009 at 15:42, Bill Broadley<bill at cse.ucdavis.edu> wrote:
> I'd suggest doing a visual inspection. Make sure all fans are not blocked by
> cables, are spinning. If that looks normal pull the CPU heat sinks and make
> sure they have good coverage with the heat sink goo, but not so much that it
> leaks over the edge of the chip. When you put the heat sink back on make sure
> the heat sink mount works as intended, especially on the (mostly intel?) 4
> post system where an unclicked post can result in unevent heat sink pressure.
> Be careful, fans moving != spinning. I've seen some that just vibrate enough
> to look like they are spinning at a casual glance and are actually not moving
> much air and are contributing a fair bit of heat to the system (I.e. very hot
> to the touch).
Use the thin end of a zip tie to slowly interrupt and stop each fan
while it is spinning. The pitch of the sound it makes will make a
(very) rough comparison of the RPM, even in a noisy room. It will be
obvious if it's turning normally or not. You might find one blowing
backwards. Don't forget about double-rotor fans.
> If that looks normal then I'd start swapping parts till you find the heat
> sensitive one.
He might swap his desk with that overheating node to help balance out
the heat load...
Or use something more intense than Memtest in your office. Try ACT
Breakin. Once it's booted all the way, a machine with a heatsink ajar
is usually powered off from thermal protection in < 5 seconds. Even
in an ice cold room
Try swapping it's power supply with another node that doesn't power off.
P.S. And please do not spray liquid spray air upside down at hot
More information about the Beowulf