[Beowulf] Cluster Metrics? (Upper management view)

Fri Aug 20 23:06:28 PDT 2010

> I think measuring a clusters success based on the number of jobs run
> or cpu's used is a bad measure of true success.  I would be more
> inclined to consider a cluster a success by speaking with the people
> who use it and find out not only whether they can use it effectively
> and/or what new science having cluster is being enabled by them.

now try that with a large user-base ;)

I think there are two broad categories of cluster: dedicated and shared.
dedicated clusters are easy: limited number of codes, users, etc.
straightforward metrics are appropriate, such as pend time (perhaps 
as a fraction of wallclock), job fail rates, fraction-of-peak measures.

past this, things are harder and fuzzier.  we try pretty hard to get 
research outcomes from our users (lit citations, grants, grad student
and postdoc counts.)  we try other metrics too: trying to find researchers
who get and account, generate minimal usage, then stop ("frustrated").
for bigger, shared facilities, the simple metrics become less useful - 
for instance, pend:wallclock is meaningful as long as cluster contention
doesn't "shape" user behavior.  once users start reacting to contention
(by submitting fewer jobs, or maybe more), the metric's spoiled.

> then only thing i find most of the below metrics overly useful is
> figuring out whether or not we need a bigger cluster.  which i guess

it's a little hard to imagine a case where metrics wouldn't call for 
a larger cluster - does anyone really have persistently underutilized 
clusters?

> I also think you need to ask the "business" people what measure they
> would consider a cluster as a worthwhile investment, it doesn't sound
> as if you have that from your email.

my guess is that suits should be talked to about opportunity cost,
and not given a bunch of stats about utilization.  that means you need 
to get some info from users about what they're doing.  but also 
to figure out whether there's more they could do.  and really, talking
to the users is important to do anyway.

>> clusters?  Upper management is asking for us to define and provide
>> some sort of "numbers" which can be used to gage the success of our
>> cluster project.

take a look at your cluster stats: do you have different groups with 
bursty activity, but which interleaves on the cluster?  that's obviously
better than multiple groups each having (probably smaller) clusters 
with lower utilization over time...

>> - 90/95th percentile wait time for jobs in various queues.  Is smaller
>> better meaning the jobs don't wait long and users are happy?  Is

wait time is kind of tricky.  if you have low wait, then either the cluster
is underutilized, or it's magically rightsized (perhaps a perfectly steady,
predictable workload).  once you have contention, the question is why - 
is there a user who queues 10k jobs every monday?  do users submit chained
(dependent) jobs, where the second is counted as waiting.  do you have 
fairshare turned on, or any kind of static limits or partitioning?

>> - Availability during scheduled hours (ignoring scheduled maintenance
>> times).  Common metric, but how do people actually measure/compute
>> this?  What about down nodes?  Some scheduled percentage (5%?) assumed
>> down?

I don't think it makes sense to obsess about this - yes, it's an easy number,
but it doesn't tell you much from the user's perspective.