Reliability analysis was RE: Windows HPC (@ Cornell)
James.P.Lux at jpl.nasa.gov
Wed Nov 6 16:05:16 PST 2002
>** Reliability. We ran a 256-processor Dell cluster with Windows 2000
>and collected all errors (OS, I/O, hardware) on a secure web site for 6
>months. MIT analyzed and independently verified the up-time--99.9986%.
Couldn't find the real email to respond to here, but the above excerpt
Has anyone running a cluster done some real reliability analysis and
published the data and analysis? Not necessarily peer reviewed.. even a
good web page description would do.
For instance, Paul at Cornell has claimed better than 99.999% reliability
or uptime, but hasn't provided any numerical backup for the assertion,
except to claim that MIT analyzed some unspecified set of data. Is Cornell
going to publish the data and details of the analysis? For instance, what
was the reliability model used? What failure statistical distribution was
implied (Exponential? Weibull?) What's defined as "failure" or "up time"? I
searched the entire Cornell site using their search engine, and all I found
was a couple of marketing speak type presentations that didn't provide any
numerical backup for the assertions.
I think this would be a very useful thing as a point of discussion for
clusters in general, since the terms "high availability", "high
reliability", "MTBF" and so forth are bandied about pretty freely, without
any unambiguous definitions. There's a lot of literature and discussion on
performance and how to fairly evaluate it in terms of BogoMIPs, or GFlops,
or bisection bandwidth, etc, but not nearly as much on other aspects of
running a cluster.
I would think that a reasonably rigorous analysis would need to address
things like (re)boot time, mean time to repair, the difference between
"operating system up and ready" and "actually running user code", and so
forth. Maybe a good start would be to establish a common terminology for
things, and then we can argue/discuss how to boil down
measurements/predictions to a single "figure of merit".
RGB.. maybe another chapter for your book?
More information about the Beowulf