Beowulf and Big Brother

canon at pookie.nersc.gov canon at pookie.nersc.gov
Tue Nov 12 23:17:15 PST 2002


We are using netsaint/nagios to monitor our cluster (a little over
300 nodes).  Netsaint works well for monitoring services and basic
host responds.  The notification is very tunable and it can
all be controlled from a web interface.  It comes with scripts
for monitoring most of the standard services (mail, web, snmp, etc)
and its simple to extend to new services.  I'm not sure
how well it would scale past 500-1000 nodes, especially if
you were monitoring several services or snmp strings on
each node.  We monitor nfs daemons on our disk servers
and mainly host responds on the compute nodes.  We also
collect the batch scheduler (LSF) host list and monitor that
as well.

We just recently started using ganglia to monitor performance.
I highly recommend it.  Its a breeze to install and configure.
It proved useful right off the bat for spotting some problems
on the cluster.  During a linpack run we used it to monitor
the performance of the run and determine if things were looking
good.  Its worth taking an hour or so and try out.  The web
interface does require php.

So far, these two separate systems provide most of the monitoring
that we need.  I've also developed a hardware database that
we use to track other issues such as inventory and hardware
repairs.  In fact, I first started it because I needed away
to track outstanding disk repairs (which happens with 800+ drives).
Eventually I hope to integrate some of the monitoring
data into the database so that we have one central location
to view the cluster.


--Shane Canon


> I just realized that this question is better to ask from Big Brother
> people. But maybe you have some comments about it too. So:
> 
> We have 2000 computing nodes and 96 monitoring computers. There is a
> possibility that we have 96 different beowulf clusters there each
> having about 20 PC's but you never know(No decisions yet in that
> matter) ;) I was just wondering if it is reasonable or smart to monitor
> these master nodes with Big Brother? Or is there even ready-made shell-
> scripts for that?
> 
> I was thinking of something like this: A script runs every once and a
> while gathering data of the status of each slave-node, on each master
> node. Then that data is sent to Big Brother-server, whenever it is
> asked. So every master would be running a BB client.
> 
> Is there any sense doing things like that?
> 
> Thanks,
> Olli Laaksonen
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beo
> wulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list