[Beowulf] Monitoring crashing machines
Robert G. Brown
rgb at phy.duke.edu
Tue Sep 9 11:19:24 PDT 2008
On Tue, 9 Sep 2008, Carsten Aulbert wrote:
> My question now, is there a cute little way to gather all the console
> outputs of > 1000 nodes? The nodes don't have physical serial cables
> attached to them - nor do we want to use many concentrators to achieve
> this - but the off-the-shelf Supermicro boxes all have an IPMI card
> installed and SoL works quite ok.
Syslog-ng? Popping a USB flash disk on them to use as an alternative
log location (if the kernel doesn't actively lock up on the disk error)?
Booting from a USB flash image or diskless, so that a disk crash is just
a disk crash?
> Initially, conserver.com looked nice and we also found an IPMI interface
> for it, but that comes with two downsides: (1) it blocks IPMI access (I
> have yet to find out if a secondary user can use SoL when another user
> is using this already, but I doubt it) and (2) it simply does not catch
> messages appearing in dmesg (simple ones like plugging in a USB
> keyboard), but that may be a configuration problem on our side.
> Also we tried (r)syslog but somehow this does not get all the messages
> either, even when using something like *.* @loghost.
> For the time being we are experimenting with using "script" in many
> "screen" environment which should be able to monitor ipmitool's SoL
> output, but somehow that strikes me as inefficient as well.
> So, my question boils down to: How do people solve this problem?
> Thanks a lot
Robert G. Brown Phone(cell): 1-919-280-8443
Duke University Physics Dept, Box 90305
Durham, N.C. 27708-0305
Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977
More information about the Beowulf