[Beowulf] job scheduler and health monitoring system

Sat Jan 11 01:14:17 PST 2014

UGE is used in over thousands of nodes, health checks are done via load
sensors, a SGE/UGE feature. however i am not aware of any public repo for
shared health checks. as for overheat, in one cluster it was done at the
bios/firmware level by asking the vendor for certain thresholds to shut the
machine off. and it is logged in the syslog logs

On 11 January 2014 00:22, Adam DeConinck <ajdecon at ajdecon.org> wrote:

> Hi Reza,
>
> The "common stack" seems to vary depending on what industry you're looking
> at. For example, Grid Engine seems to be a really popular job scheduler in
> bioinformatics, even though I get the impression that it's on the way out
> in a lot of other industries.
>
> I think most cluster management tools are fairly mature right now. Some
> are more actively developed than others, but I don't think "what's hot" is
> necessarily a good way to choose your tools.
>
> More important is whether someone on your team is familiar with those
> tools, or with the languages they're written in; or whether you can get
> support easily if you don't have expertise yourself.
>
> For what it's worth, my current "favorites" for scheduling and monitoring
> include:
>
> * Job scheduler: SLURM
> * Light-weight health checks between jobs: Warewulf NHC
> * Detailed performance monitoring: Ganglia
>
> Neither NHC or Ganglia do temperature monitoring out-of-the-box (last I
> checked), but they're both really easy to extend with something as easy as
> bash scripts.
>
> Adam
>
>
>
> On Fri, Jan 10, 2014 at 12:36 PM, reza azimi <reza.c.azimi at gmail.com>wrote:
>
>> hello guys,
>>
>> I'm looking for a state of art job scheduler and health monitoring for my
>> beowulf cluster and due to my research I've found many of them which made
>> me confused. Can you help or recommend me the ones which are very hot and
>> they are using in industry?
>> I have lm-sensors package on my servers and wanna a health monitoring
>> program which record the temp as well, all I found are mainly record
>> resource utilization.
>> Our workload are mainly MPI based benchmarks and we want to test some
>> hadoop benchmarks in future.
>>
>>
>> Regards
>> Reza
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20140111/b88d4107/attachment.html>