[Beowulf] Monitoring crashing machines

Reuti reuti at staff.uni-marburg.de
Tue Sep 9 01:59:14 PDT 2008


Am 09.09.2008 um 09:53 schrieb Carsten Aulbert:

> Hi all,
> I would tend to guess this problem is fairly common and many solutions
> are already in place, so I would like to enquirer about your solutions
> to the problem:
> In our large cluster we have certain nodes going down with I/O hard  
> disk
> errors. We have some suspicion about the causes but would like to
> investigate this further. However, the log files don't show much if
> anything at all (which is understandably given that the log files  
> reside
> on disk and we are hitting I/O disk errors). Albeit the console shows
> some interesting messages but cannot scroll back long enough.
> My question now, is there a cute little way to gather all the console
> outputs of > 1000 nodes? The nodes don't have physical serial cables
> attached to them - nor do we want to use many concentrators to achieve
> this - but the off-the-shelf Supermicro boxes all have an IPMI card
> installed and SoL works quite ok.

I setup syslog-ng on the nodes to log to the headnode. There each  
node will have a distinct file e.g. "/var/log/nodes/node42.messages".  
If you are interested, I could post my configuration files for  
headnode and clients.

-- Reuti

> Initially, conserver.com looked nice and we also found an IPMI  
> interface
> for it, but that comes with two downsides: (1) it blocks IPMI  
> access (I
> have yet to find out if a secondary user can use SoL when another user
> is using this already, but I doubt it) and (2) it simply does not  
> catch
> messages appearing in dmesg (simple ones like plugging in a USB
> keyboard), but that may be a configuration problem on our side.
> Also we tried (r)syslog but somehow this does not get all the messages
> either, even when using something like *.* @loghost.
> For the time being we are experimenting with using "script" in many
> "screen" environment which should be able to monitor ipmitool's SoL
> output, but somehow that strikes me as inefficient as well.
> So, my question boils down to: How do people solve this problem?
> Thanks a lot
> Cheers
> Carsten
> -- 
> Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
> Callinstrasse 38, 30167 Hannover, Germany
> Phone/Fax: +49 511 762-17185 / -17193
> http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/ 
> list/31
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list