[Beowulf] Monitoring and Metrics
benson.muite at ut.ee
Sun Oct 8 02:24:16 PDT 2017
May also be of interest:
JobDigest – Detailed System Monitoring-Based Supercomputer Application
Dmitry Nikitenko, Alexander Antonov, Pavel Shvets, Sergey Sobolev,
Konstantin Stefanov, Vadim Voevodin, Vladimir Voevodin and Sergey Zhumatiy
On 10/07/2017 04:13 PM, Paul Edmon wrote:
> So for general monitoring of the cluster usage we use:
> and pipe to Graphana. We also use XDMod:
> As for specific node alerting, we use the old standby of Nagios.
> -Paul Edmon-
> On 10/7/2017 8:21 AM, Josh Catana wrote:
>> This may have been brought up in the past, but I couldn't find much in
>> my message archive.
>> What are people using for HPC cluster monitoring and metrics lately?
>> I've been low on time to add features to my home grown solution and
>> looking at some OTS products.
>> I'm looking for something that can do monitoring, alert on condition,
>> broken hardware, etc.
>> Also something that does system resource utilization metrics. If it
>> has a plug-in for a scheduling system like PBS where I can correlate a
>> job ID to the metrics of the systems it is currently running on or
>> previously ran on at the time, that would be an amazing plus.
>> Any of you beowulfers have any suggestions?
>> Beowulf mailing list,Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
J. Liivi 2, 50409
Research Fellow of Distributed Systems
Institute of Computer Science
University of Tartu
J. Liivi 2 50409
More information about the Beowulf