[Beowulf] When is compute-node load-average "high" in the HPC context? Setting correct thresholds on a warning script.
mm at yuhu.biz
Wed Sep 1 03:15:55 PDT 2010
On Wednesday 01 September 2010 11:47:29 Reuti wrote:
> Am 01.09.2010 um 09:34 schrieb Christopher Samuel:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> > On 01/09/10 01:58, Reuti wrote:
> >> With recent kernels also (kernel) processes in D state
> >> count as running.
> > I wouldn't say recent, that goes back as far as I can
> > remember.
> > For instance I've seen RHEL3 (2.4.x - sort of) NFS servers
> > with load averages in the 80's where they were run with a lot
> > of nfsd's that were blocked waiting for I/O due to ext3.
> My impression was always (as there is a similar setting for the
> load_threshold in OGE), that it should limit the number of jobs on a big
> SMP machine when you oversubscribe by intention, as not all parallel jobs
> are really using all the CPU power over their lifetime (maybe such a
> machine was even operated w/o any NFS). Then allowing e.g. 72 slots for
> jobs on a 60 core maschine might get most out of it with a load near 100%.
> Well, getting now 12 cores in newer CPUs and assemble them to 24 or 48 core
> machines would make such a setting useful again. Maybe the load sensor
> should honor only the scheduled jobs' load.
> -- Reuti
> > cheers!
> > Chris
I believe that the load threshold should be set depending on the type of jobs
you run on your compute nodes.
In some cases the load is not linked only to disk/network I/O and CPU,
sometimes the jobs do a lot of in memory changes which bring more weight then
the actual CPU or disk/network I/O. So for example a load average of 15 can
also be considered for normal load, as far as the system is still responsive
and the jobs time don't degrade.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 198 bytes
Desc: This is a digitally signed message part.
More information about the Beowulf