[Beowulf] Ganglia showing dead node as live

David Mathog mathog at caltech.edu
Thu Nov 15 11:33:14 PST 2007


So, one of our Tyan S2466 nodes finally gave up the ghost.  PS is ok
(tried a known good spare too), replaced battery on Mobo, the fans spin,
the ethernet flashes, but it won't so much as beep and there's no BIOS
video, let alone disk activity.  Probably a blown CPU or motherboard.
Anyway, the failed hardware is another story.

The odd thing was that I found this when a submitted job blew up when it
couldn't connect by PVM to the dead node.  Couldn't ping it either.  On
logging into another node, gstat still showed the dead one was shown,
looking just like the others, here the first one is dead and the second
live:

monkey02.cluster    1 (    0/   54) [  0.00,  0.00,  0.00] [   0.0,  
0.0,   0.0, 100.0,   0.0] ON
monkey03.cluster    1 (    0/   50) [  0.00,  0.00,  0.00] [   0.0,  
0.0,   0.0, 100.0,   0.0] ON

also

 Dead Hosts: 0
Gexec Hosts: 20

Now normally when I shut down ganglia, or shut down a node, the values
in gstat are correct, yet here, they were not.  The dead node probably 
rolled over and died none too gracefully, so it never TOLD ganglia it
was going away.  Odd though that gangia seems not to have figured it out
for itself.  The ganglia version is ganglia-core-3.0.4-1mdv2007.1.

Then "service gmond restart" on that one node, and it came up showing
itself as a gexec host, but none of the others.  It was necessary to
restart gmond on all nodes to pick up the expected 19 gexec hosts.

Seems like that one node exiting abnormally did a number on ganglia.

Anybody else seen this before?

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list