[Beowulf] Monitoring crashing machines
Robert G. Brown
rgb at phy.duke.edu
Tue Sep 9 11:26:36 PDT 2008
On Tue, 9 Sep 2008, Carsten Aulbert wrote:
> We did get a few messages, albeit not from the kernel when an error
> happened. I'll have another look today, maybe I did something wrong.
If your kernel is out and out crashing, you might not get anything at
all. In that case, let me add:
"putting a cheap monitor on a suspect or crashed node"
Or even after a crash. If the primary graphics card is being used as a
console, the frame buffer will probably retain the last kernel oops
written to it (if any) even after it locks up the system proper. Just
plug a monitor into the framebuffer of a machine that has crashed and
see if there is anything there.
One last method (from back in the dark ages):
"putting a tty-output printer on as a console printer"
This was actually standard of practice for servers through the end of
the 80s', anyway, because it was COMMON for servers to crash -- or be
cracked -- and a hard copy of syslog/console output was often your only
clue as to the cause, your only evidence of the intrusion.
You still will have the problem of a kernel crash not infrequently
being, well, "instant death". Some problems just lock up your system
"now", without passing go or collecting $200. Nothing will then help
you, although modern kernels have settings and setups that SHOULD die
with oops and some sort of message, most of the time. Or some of the
time. Or heck, who knows of the time?
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Robert G. Brown Phone(cell): 1-919-280-8443
Duke University Physics Dept, Box 90305
Durham, N.C. 27708-0305
Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977
More information about the Beowulf