[Beowulf] Configuration Management and Monitoring of a Debian Etch Beowulf Cluster
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Julien Leduc julien.leduc at lri.frWed Sep 12 06:16:01 PDT 2007
- Previous message: [Beowulf] Woodcrest and bad result
- Next message: [Beowulf] Sun buys Lustre
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi, > I've managed to put together a simple 2-node cluster using Debian etch , > OpenMPI , FAI & Cfengine. do you mean 2 nodes + 1 server? or 1 server and 1 node? > I'm looking for ideas that can help me with building a better > self-healing cluster. Right now I'm making rule files for cfengine and > would acknowledge any input on sample files and important configurations > that need to be made for the cluster's health. (Although it's > site-specific but I'm sure I can get good hints out of them) since everything depends on your configuration... since every cluster is different, no clue can be given there... Here is what I think anyway: FAI is a good point to start with: you can so have a fully automated install to start with on all your nodes, then incrementally grow from it with cfengine, BUT if you intend to use your cluster for a long period of time, it is a big bargain to think about restarting from FAI (so zero data on the disk and a brand new install): an image based deployment system would be more efficient. Start with FAI -> fresh install -> increment update with cfengine then manage some snapshots of the system with image based deployment system and synchronize your nodes from time to time with a fresh deployment of the last snapshot. It costs less in term of time, and ressource consumption than starting from scratch, moreover recovery is faster (and safer) than replaying the full process from start. If you are using debian, you have to be sure your packages repository is synchronized for all your nodes (between 2 cluster snapshots), so setting an apt-cacher for your cluster (or a more general http proxy server), will allow you to enforce package synchronization for your cluster, and a fair use of access to the external debian repository). OK, this is really important for huge clusters, but all the clusters are conveived to grow. This is critical since replaying asynchronously some packages install can lead you to many different results (and lots of failures). > However I'd also be glad to see if you have any monitoring system in > mind that can cooperate with cfengine in the maintenance job. I've > looked briefly into Ganglia and Nagios so far. It seems Ganglia is > mostly meant for large (groups of) clusters and focuses on hw resources. > Nagios seems to be better-suited for my job, but the gurus at cfengine > mailing list believe that cfenvd & cfexecd can provide equal monitoring > & recovery capability (in terms of response time). > What's your take on either of them? Ganglia is OK, and can be used to quickly check your cluster usage, nagios deal with critical services for the cluster (NFS, DNS, DHCP server, TFTP...). Be carefull about the load added to your network with all the monitoring tools you are setting up (using cfenvd & cfexecd could give you better control of the additional charge on your infrastructure). To monitor the cluster status, you should use your batch scheduler interface (to look for free nodes, dead ones...). Hope this helped, Julien Leduc
- Previous message: [Beowulf] Woodcrest and bad result
- Next message: [Beowulf] Sun buys Lustre
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
