[Beowulf] user stats on clusters

Fri Feb 27 14:25:31 PST 2009

> A general question: What're folks using for stats, including queue wait, 
> execution times, hours/month?  Any suggestions?

we run ~20 clusters, some large, and collect all the stats to a single db,
with a custom web interface, etc.  users and PI's can see tables and 
graphs of usage.  we don't by default do anything with per-job pend times,
though it's there.  we also don't do anything with hours/month - the closest
would be graphs which show ncpus across time (ie, if over the past 2 weeks,
the y-axis would probably be cpu-hours-per-hour, summed over all jobs,
but possibly partitioned by user/cluster/queue/etc).

I don't know how much this code/etc would be of interest to anyone else.
I at least, have not talked to other cluster people who have quite the 
same take on issues.  for instance, each of our jobs is stored with 
user (we have a single ldap), command, cluster, queue, flags, pend time,
seconds allocated/utime/stime.  users are either sponsor (PI) or sponsored,
and there's another level of ID intended to harmonize with a pan-Canadian
"people" database.  the current database receives job info from a variety
of schedulers - RMS on our original Alphas, LSF, my opensource minimalist 
scheduler, torque/maui and SGE.  having a comprehensive DB like this has 
led to some interesting optimizations having to do with shipping batches
of job records around (cron, ssh, rsync, etc), or ways of binning 
usage to make it feasible to generate dynamic graphs of usage.

if you're OK with a per-cluster interface, aren't nagios and similar 
packages pretty interchangable?