[Beowulf] Monitoring crashing machines
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Lawrence Stewart larry.stewart at sicortex.comTue Sep 9 05:29:15 PDT 2008
- Previous message: [Beowulf] Monitoring crashing machines
- Next message: [Beowulf] Monitoring crashing machines
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Carsten Aulbert wrote: > Hi all, > > I would tend to guess this problem is fairly common and many solutions > are already in place, so I would like to enquirer about your solutions > to the problem: > > In our large cluster we have certain nodes going down with I/O hard disk > errors. We have some suspicion about the causes but would like to > investigate this further. However, the log files don't show much if > anything at all (which is understandably given that the log files reside > on disk and we are hitting I/O disk errors). Albeit the console shows > some interesting messages but cannot scroll back long enough. > > My question now, is there a cute little way to gather all the console > outputs of > 1000 nodes? The nodes don't have physical serial cables > attached to them - nor do we want to use many concentrators to achieve > this - but the off-the-shelf Supermicro boxes all have an IPMI card > installed and SoL works quite ok. > > Initially, conserver.com looked nice and we also found an IPMI interface > for it, but that comes with two downsides: (1) it blocks IPMI access (I > have yet to find out if a secondary user can use SoL when another user > is using this already, but I doubt it) and (2) it simply does not catch > messages appearing in dmesg (simple ones like plugging in a USB > keyboard), but that may be a configuration problem on our side. > > Also we tried (r)syslog but somehow this does not get all the messages > either, even when using something like *.* @loghost. > > For the time being we are experimenting with using "script" in many > "screen" environment which should be able to monitor ipmitool's SoL > output, but somehow that strikes me as inefficient as well. > > So, my question boils down to: How do people solve this problem? > > Thanks a lot > > Cheers > > Carsten > > We use conserver here at SiCortex, but it doesn't talk to node consoles directly. Instead, we've written a kind of intermediary between conserver and the real console access. The situation isn't exactly parallel, but if you wind up writing your own "intermediary" the structure and code might be useful. Node linux -> custom char device driver -> scan chain hardware -> embedded uClinux board-level microprocessor -> "scan daemon", which concentrates the terminals from 27 nodes -> TCP/IP socket -> x86 service processor -> "scconserver" which speaks the idiosyncratic terminal protocol on one side, and demultiplexes the consoles into invididual TCP sockets -> conserver, which does all the usual conserver stuff. This works well enough at the 972 node scale. In your situation, the intermediary could export IMPI sockets which it would multiplex in with its connection to the real IMPI access on the node. We used libevent to write scconserver, which makes all the book-keeping for a zillion connections fairly straightforward. If you head this way, you might get some benefit from http://downloads.sicortex.com/distfiles/sicortex-scconserver-5.0.0.9.50831.tbz2 All open source. Regarding dmesg vs console, this is all according to node logging settings, which I don't know much about. -- -Larry / Sector IX
- Previous message: [Beowulf] Monitoring crashing machines
- Next message: [Beowulf] Monitoring crashing machines
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
