[Beowulf] Monitoring and Metrics

Sat Oct 7 22:19:43 PDT 2017

> On 10/7/2017 8:21 AM, Josh Catana wrote:
>
> This may have been brought up in the past, but I couldn't find much in my
> message  archive.
> What are people using for HPC cluster monitoring and metrics lately? I've
> been low on time to add features to my home grown solution and looking at
> some OTS products.
> I'm looking for something that can do monitoring, alert on condition,
> broken hardware, etc.
> Also something that does system resource utilization metrics. If it has a
> plug-in for a scheduling system like PBS where I can correlate a job ID to
> the metrics of the systems it is currently running on or previously ran on
> at the time, that would be an amazing plus.
> Any of you beowulfers have any suggestions?
>
>
We use XDMoD and Zabbix for per machine monitoring. Logwatch as well, but
not as comprehensively.

Tried Grafana, InfluxDB and this plugin (
http://slurm.schedmd.com/SLUG16/monitoring_influxdb_slug.pdf ) but we
didn't find it as useful as we would have liked. It's a great plugin, we
just didn't need it.

cheers
L.

------
"The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic civics
is the insistence that we cannot ignore the truth, nor should we panic
about it. It is a shared consciousness that our institutions have failed
and our ecosystem is collapsing, yet we are still here — and we are
creative agents who can shape our destinies. Apocalyptic civics is the
conviction that the only way out is through, and the only way through is
together. "

*Greg Bloom* @greggish https://twitter.com/greggish/
status/873177525903609857
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20171008/6588aa96/attachment.html>