[Beowulf] Re:hardware question: building a cluster node/ student

David Mathog mathog at caltech.edu
Thu Jul 26 08:48:35 PDT 2007


"Nathan Moore" <ntmoore at gmail.com> wrote

> Earlier this summer, the case fan on one of the machines failed, and the
> result seems like a cooked motherboard (erratic errors with the integrated
> NIC).

There should be an automatic shutdown script running to detect
temperature events and shut down the machine before it is damaged. 
This is what I use on some machines:

ftp://saf.bio.caltech.edu/pub/software/linux_or_unix_tools/sensor_monitor.tar.gz

You'd need to edit the actual sensor_monitor.sh script to make it
match your machine, and to add any other conditions you want to monitor.
The machine this came from uses an Athlon MP processor, and those have
no automatic overtemp shutdown, so it's essential to keep on eye
on the CPU fan and CPU temp and shut down ASAP if either indicates a
problem.

These 2U nodes are packed full of fans (1 case exhaust, one exhaust on
the power supply, about 6" upstream of the case exhaust, and 2 intake
fans.) There's no way to monitor the PS fan, and I found by
experiment that unplugging any one of the others did not lead to
overheating inside the case, so those aren't monitored either,
although they could be.  So far this system has worked properly, we've
lost a couple of CPU fans, and the nodes shut down promptly.  We've
also lost a few case fans, and the nodes did not shut down, which was
also correct, as they were not overheating when down one case fan.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list