[Beowulf] Ganglia showing dead node as live
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
David Mathog mathog at caltech.eduThu Nov 15 11:33:14 PST 2007
- Previous message: [Beowulf] HI there, any work for me?
- Next message: [Beowulf] help on building Beowulf
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
So, one of our Tyan S2466 nodes finally gave up the ghost. PS is ok (tried a known good spare too), replaced battery on Mobo, the fans spin, the ethernet flashes, but it won't so much as beep and there's no BIOS video, let alone disk activity. Probably a blown CPU or motherboard. Anyway, the failed hardware is another story. The odd thing was that I found this when a submitted job blew up when it couldn't connect by PVM to the dead node. Couldn't ping it either. On logging into another node, gstat still showed the dead one was shown, looking just like the others, here the first one is dead and the second live: monkey02.cluster 1 ( 0/ 54) [ 0.00, 0.00, 0.00] [ 0.0, 0.0, 0.0, 100.0, 0.0] ON monkey03.cluster 1 ( 0/ 50) [ 0.00, 0.00, 0.00] [ 0.0, 0.0, 0.0, 100.0, 0.0] ON also Dead Hosts: 0 Gexec Hosts: 20 Now normally when I shut down ganglia, or shut down a node, the values in gstat are correct, yet here, they were not. The dead node probably rolled over and died none too gracefully, so it never TOLD ganglia it was going away. Odd though that gangia seems not to have figured it out for itself. The ganglia version is ganglia-core-3.0.4-1mdv2007.1. Then "service gmond restart" on that one node, and it came up showing itself as a gexec host, but none of the others. It was necessary to restart gmond on all nodes to pick up the expected 19 gexec hosts. Seems like that one node exiting abnormally did a number on ganglia. Anybody else seen this before? Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
- Previous message: [Beowulf] HI there, any work for me?
- Next message: [Beowulf] help on building Beowulf
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
