[Beowulf] When is compute-node load-average "high" in the HPC context? Setting correct thresholds on a warning script.
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Reuti reuti at Staff.Uni-Marburg.DETue Aug 31 08:58:37 PDT 2010
- Previous message: [Beowulf] When is compute-node load-average "high" in the HPC context? Setting correct thresholds on a warning script.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Am 31.08.2010 um 16:51 schrieb Rahul Nabar: > My scheduler, Torque flags compute-nodes as "busy" when the load gets > above a threshold "ideal load". My settings on 8-core compute nodes > have this ideal_load set to 8 but I am wondering if this is > appropriate or not? > > $max_load 9.0 > $ideal_load 8.0 > > I do understand the"ideal load = # of cores" heuristic but in at least Yep. > 30% of our jobs ( if not more ) I find the load average greater than > 8. Sometimes even in the 9-10 range. But does this mean there is > something wrong or do I take this to be the "happy" scenario for HPC: > i.e. not only are all CPU's busy but the pipeline of processes waiting > for their CPU slice is also relatively full. After all, a > "under-loaded" HPC node is a waste of an expensive resource? With recent kernels also (kernel) processes in D state count as running. Hence the load appears higher than the running processes would imply when only these are added up. -- Reuti > On the other hand, if there truly were something wrong with a node[*] > and I was to use a high load avearage as one of the signs of > impending trouble what would be a good threshold? Above what > load-average on a compute node do people get actually worried? It > makes sense to set PBS's default "busy" warning to that limit instead > of just "8". > > I'm ignoring the 5/10/15 min load average distinction. I'm assuming > Torque is using the most appropriate one! > > *e.g. runaway process, infinite loop in user code, multiple jobs > accidentally assigned to some node etc. > > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
- Previous message: [Beowulf] When is compute-node load-average "high" in the HPC context? Setting correct thresholds on a warning script.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
