[Beowulf] using Nagios to monitor compute nodes: NPRE vs check_by_ssh

Mon Dec 22 17:28:51 PST 2008

I just installed Nagios to try and monitor my 256 compute nodes
centrally. It seems to work like a charm for all the public services
(ping, ssh etc.) but now I was getting more ambitious and wanted to
try to monitor the private services too (disk usage; process loads;
torque ; pbs etc.).

I was just confused whether (1) to use the NPRE plugin (seems like a
pain to deploy onto all 256 nodes) or (2) go via the check_by_ssh
route. (I already have paswordless logins from master-nodes to
slave-nodes)

I'd like (2) because it is more secure and seems easier to deploy but
I'm a bit afraid if this will overtax my central server.

Any suggestions? Are other users using Nagios here?

--
Rahul