[Beowulf] Re:hardware question: building a cluster node/ student

Lombard, David N dnlombar at ichips.intel.com
Fri Jul 27 07:44:17 PDT 2007


On Thu, Jul 26, 2007 at 08:48:35AM -0700, David Mathog wrote:
> "Nathan Moore" <ntmoore at gmail.com> wrote
> 
> > Earlier this summer, the case fan on one of the machines failed, and the
> > result seems like a cooked motherboard (erratic errors with the integrated
> > NIC).
> 
> There should be an automatic shutdown script running to detect
> temperature events and shut down the machine before it is damaged. 
> This is what I use on some machines:
> 
> ftp://saf.bio.caltech.edu/pub/software/linux_or_unix_tools/sensor_monitor.tar.gz

Depending on the board and kernel, ACPI will also provide these services.  On
an FC4 (2.6.14) system, I had to do the following to get that to work:

	echo 90           > /proc/acpi/thermal_zone/THRM/polling_frequency
	echo 80:0:70:65:0 > /proc/acpi/thermal_zone/THRM/trip_points

The first echo caused the auto shutdown to work; the second set the values I
wanted, i.e., shutdown at 80C.  Some ACPI cognescenti said the fact that I
had to "manually enable" the polling/shutdown was an error in that version
of the kernel.

I discovered all this when I came home to that sickening overly-hot electronics
smell, a case *very* hot to the touch, and the CPU at 104C due to a dead CPU
fan.  Happily, it took a licking and kept on ticking.

-- 
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.



More information about the Beowulf mailing list