Beowulf and Big Brother
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
canon at pookie.nersc.gov canon at pookie.nersc.govTue Nov 12 23:17:15 PST 2002
- Previous message: Beowulf and Big Brother
- Next message: Beowulf and Big Brother
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
We are using netsaint/nagios to monitor our cluster (a little over 300 nodes). Netsaint works well for monitoring services and basic host responds. The notification is very tunable and it can all be controlled from a web interface. It comes with scripts for monitoring most of the standard services (mail, web, snmp, etc) and its simple to extend to new services. I'm not sure how well it would scale past 500-1000 nodes, especially if you were monitoring several services or snmp strings on each node. We monitor nfs daemons on our disk servers and mainly host responds on the compute nodes. We also collect the batch scheduler (LSF) host list and monitor that as well. We just recently started using ganglia to monitor performance. I highly recommend it. Its a breeze to install and configure. It proved useful right off the bat for spotting some problems on the cluster. During a linpack run we used it to monitor the performance of the run and determine if things were looking good. Its worth taking an hour or so and try out. The web interface does require php. So far, these two separate systems provide most of the monitoring that we need. I've also developed a hardware database that we use to track other issues such as inventory and hardware repairs. In fact, I first started it because I needed away to track outstanding disk repairs (which happens with 800+ drives). Eventually I hope to integrate some of the monitoring data into the database so that we have one central location to view the cluster. --Shane Canon > I just realized that this question is better to ask from Big Brother > people. But maybe you have some comments about it too. So: > > We have 2000 computing nodes and 96 monitoring computers. There is a > possibility that we have 96 different beowulf clusters there each > having about 20 PC's but you never know(No decisions yet in that > matter) ;) I was just wondering if it is reasonable or smart to monitor > these master nodes with Big Brother? Or is there even ready-made shell- > scripts for that? > > I was thinking of something like this: A script runs every once and a > while gathering data of the status of each slave-node, on each master > node. Then that data is sent to Big Brother-server, whenever it is > asked. So every master would be running a BB client. > > Is there any sense doing things like that? > > Thanks, > Olli Laaksonen > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beo > wulf.org/mailman/listinfo/beowulf
- Previous message: Beowulf and Big Brother
- Next message: Beowulf and Big Brother
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
