[Beowulf] Node Drop-Off
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Chris Samuel csamuel at vpac.orgMon Nov 13 05:36:26 PST 2006
- Previous message: [Beowulf] Node Drop-Off
- Next message: [Beowulf] onboard Gb lan: any opinion, sugestion or impression?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Sunday 12 November 2006 16:13, Tim Moore wrote: > Has anyone ever seen such behavior? Others have mentioned about attaching consoles, etc, but it's also worth trawling through any logs in /var/log to see if anything is showing up there too. Check dmesg whilst the node is under load, if you're seeing machine check problems, ECC parity problems, SCSI errors then you might catch them then (though they should also be in the logs too). If the node supports IPMI try and use that to get to any hardware logs, and if you use Ganglia to monitor the cluster have a look at that and see if there's anything there that could show if it's a user space program that could be causing it. I know users shouldn't be able to crash nodes, but we have seen that on some kernels where the OOM killer is not very good at getting things right and the machine deadlocks when the users program runs it out of RAM. Another possibility is bad blocks in the swap partition which might only show up in low memory conditions (yes, using swap is bad, but people write bad code too) and corrupt something essential that's been paged out. What does uname -a say on the box ? cheers! Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20061113/d8ab1832/attachment.bin
- Previous message: [Beowulf] Node Drop-Off
- Next message: [Beowulf] onboard Gb lan: any opinion, sugestion or impression?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
