<div dir="ltr">UGE is used in over thousands of nodes, health checks are done via load sensors, a SGE/UGE feature. however i am not aware of any public repo for shared health checks. as for overheat, in one cluster it was done at the bios/firmware level by asking the vendor for certain thresholds to shut the machine off. and it is logged in the syslog logs</div>

<div class="gmail_extra"><br><br><div class="gmail_quote">On 11 January 2014 00:22, Adam DeConinck <span dir="ltr"><<a href="mailto:ajdecon@ajdecon.org" target="_blank">ajdecon@ajdecon.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div dir="ltr"><div><div><div><div><div><div>Hi Reza,<br><br></div><div>The "common stack" seems to vary depending on what industry you're looking at. For example, Grid Engine seems to be a really popular job scheduler in bioinformatics, even though I get the impression that it's on the way out in a lot of other industries.<br>


<br></div><div>I think most cluster management tools are fairly mature right now. Some are more actively developed than others, but I don't think "what's hot" is necessarily a good way to choose your tools. <br>


<br>More important is whether someone on your team is familiar with those tools, or with the languages they're written in; or whether you can get support easily if you don't have expertise yourself.<br></div><div>


<br></div><div>For what it's worth, my current "favorites" for scheduling and monitoring include:<br><br></div>* Job scheduler: SLURM<br></div>* Light-weight health checks between jobs: Warewulf NHC<br></div>


* Detailed performance monitoring: Ganglia<br></div><br></div>Neither NHC or Ganglia do temperature monitoring out-of-the-box (last I checked), but they're both really easy to extend with something as easy as bash scripts.<br>


<br></div>Adam<br><br></div><div class="gmail_extra"><br><br><div class="gmail_quote"><div><div class="h5">On Fri, Jan 10, 2014 at 12:36 PM, reza azimi <span dir="ltr"><<a href="mailto:reza.c.azimi@gmail.com" target="_blank">reza.c.azimi@gmail.com</a>></span> wrote:<br>


</div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5"><div dir="ltr">hello guys, <div><br></div><div>I'm looking for a state of art job scheduler and health monitoring for my beowulf cluster and due to my research I've found many of them which made me confused. Can you help or recommend me the ones which are very hot and they are using in industry? </div>


<div>I have lm-sensors package on my servers and wanna a health monitoring program which record the temp as well, all I found are mainly record resource utilization. </div><div>Our workload are mainly MPI based benchmarks and we want to test some hadoop benchmarks in future.</div>


<div><br></div><div><br></div><div>Regards</div><span><font color="#888888"><div>Reza</div></font></span></div>

<br></div></div>_______________________________________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>

<br></blockquote></div><br></div>

<br>_______________________________________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>

<br></blockquote></div><br></div>