Hi, <br><br>I've managed to put together a simple 2-node cluster using Debian etch , OpenMPI , FAI & Cfengine. <br><br>I'm looking for ideas that can help me with building a better self-healing cluster. Right now I'm making rule files for cfengine and would acknowledge any input on sample files and important configurations that need to be made for the cluster's health. (Although it's site-specific but I'm sure I can get good hints out of them)

<br><br>However I'd also be glad to see if you have any monitoring system in mind that can cooperate with cfengine in the maintenance job. I've looked briefly into Ganglia and Nagios so far. It seems Ganglia is mostly meant for large (groups of) clusters and focuses on hw resources. Nagios seems to be better-suited for my job, but the gurus at cfengine mailing list believe that cfenvd & cfexecd can provide equal monitoring & recovery capability (in terms of response time).

<br>What's your take on either of them?<br><br>Thanks beforehand for any input.<br><br><br>