Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Monitoring crashing machines

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Reuti reuti at staff.uni-marburg.de
Tue Sep 9 01:59:14 PDT 2008


Hi,

Am 09.09.2008 um 09:53 schrieb Carsten Aulbert:

> Hi all,
>
> I would tend to guess this problem is fairly common and many solutions
> are already in place, so I would like to enquirer about your solutions
> to the problem:
>
> In our large cluster we have certain nodes going down with I/O hard  
> disk
> errors. We have some suspicion about the causes but would like to
> investigate this further. However, the log files don't show much if
> anything at all (which is understandably given that the log files  
> reside
> on disk and we are hitting I/O disk errors). Albeit the console shows
> some interesting messages but cannot scroll back long enough.
>
> My question now, is there a cute little way to gather all the console
> outputs of > 1000 nodes? The nodes don't have physical serial cables
> attached to them - nor do we want to use many concentrators to achieve
> this - but the off-the-shelf Supermicro boxes all have an IPMI card
> installed and SoL works quite ok.

I setup syslog-ng on the nodes to log to the headnode. There each  
node will have a distinct file e.g. "/var/log/nodes/node42.messages".  
If you are interested, I could post my configuration files for  
headnode and clients.

-- Reuti

>
> Initially, conserver.com looked nice and we also found an IPMI  
> interface
> for it, but that comes with two downsides: (1) it blocks IPMI  
> access (I
> have yet to find out if a secondary user can use SoL when another user
> is using this already, but I doubt it) and (2) it simply does not  
> catch
> messages appearing in dmesg (simple ones like plugging in a USB
> keyboard), but that may be a configuration problem on our side.
>
> Also we tried (r)syslog but somehow this does not get all the messages
> either, even when using something like *.* @loghost.
>
> For the time being we are experimenting with using "script" in many
> "screen" environment which should be able to monitor ipmitool's SoL
> output, but somehow that strikes me as inefficient as well.
>
> So, my question boils down to: How do people solve this problem?
>
> Thanks a lot
>
> Cheers
>
> Carsten
>
> -- 
> Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
> Callinstrasse 38, 30167 Hannover, Germany
> Phone/Fax: +49 511 762-17185 / -17193
> http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/ 
> list/31
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list