[Beowulf] Monitoring and Metrics

Sun Oct 8 02:24:16 PDT 2017

May also be of interest:

JobDigest – Detailed System Monitoring-Based Supercomputer Application 
Behavior Analysis

Dmitry Nikitenko, Alexander Antonov, Pavel Shvets, Sergey Sobolev, 
Konstantin Stefanov, Vadim Voevodin, Vladimir Voevodin and Sergey Zhumatiy

http://russianscdays.org/files/pdf17/185.pdf

On 10/07/2017 04:13 PM, Paul Edmon wrote:
> So for general monitoring of the cluster usage we use:
> 
> https://github.com/fasrc/slurm-diamond-collector
> 
> and pipe to Graphana.  We also use XDMod:
> 
> http://open.xdmod.org/7.0/index.html
> 
> As for specific node alerting, we use the old standby of Nagios.
> 
> -Paul Edmon-
> 
> 
> On 10/7/2017 8:21 AM, Josh Catana wrote:
>> This may have been brought up in the past, but I couldn't find much in 
>> my message  archive.
>> What are people using for HPC cluster monitoring and metrics lately? 
>> I've been low on time to add features to my home grown solution and 
>> looking at some OTS products.
>> I'm looking for something that can do monitoring, alert on condition, 
>> broken hardware, etc.
>> Also something that does system resource utilization metrics. If it 
>> has a plug-in for a scheduling system like PBS where I can correlate a 
>> job ID to the metrics of the systems it is currently running on or 
>> previously ran on at the time, that would be an amazing plus.
>> Any of you beowulfers have any suggestions?
>>
>>
>> _______________________________________________
>> Beowulf mailing list,Beowulf at beowulf.org  sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

----
Hajussüsteemide Teadur
Arvutiteaduse Instituut
Tartu Ülikool
J. Liivi 2, 50409
Tartu
http://kodu.ut.ee/~benson
----
Research Fellow of Distributed Systems
Institute of Computer Science
University of Tartu
J. Liivi 2 50409
Tartu, Estonia
http://kodu.ut.ee/~benson