Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] When is compute-node load-average "high" in the HPC context? Setting correct thresholds on a warning script.

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Rahul Nabar rpnabar at gmail.com
Tue Aug 31 07:51:20 PDT 2010


My scheduler, Torque flags compute-nodes as "busy" when the load gets
above a threshold "ideal load". My settings on 8-core compute nodes
have this ideal_load set to 8 but I am wondering if this is
appropriate or not?

$max_load 9.0
$ideal_load 8.0

I do understand the"ideal load = # of cores" heuristic but in at least
30% of our jobs ( if not more ) I find the load average greater than
8. Sometimes even in the 9-10 range. But does this mean there is
something wrong or do I take this to be the "happy" scenario for HPC:
i.e. not only are all CPU's busy but the pipeline of processes waiting
for their CPU slice is also relatively full. After all, a
"under-loaded" HPC node is a waste of an expensive resource?

On the other hand, if there truly were something wrong with a node[*]
and I was to use a high load avearage  as one of the signs of
impending trouble what would be a good threshold? Above what
load-average on a compute node do people get actually worried? It
makes sense to set PBS's default "busy" warning to that limit instead
of just "8".

I'm ignoring the 5/10/15 min load average distinction. I'm assuming
Torque is using the most appropriate one!

*e.g. runaway process, infinite loop in user code, multiple jobs
accidentally assigned to some node etc.

-- 
Rahul



More information about the Beowulf mailing list