Reliability analysis was RE: Windows HPC (@ Cornell)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jim Lux James.P.Lux at jpl.nasa.govWed Nov 6 16:05:16 PST 2002
- Previous message: Linux Business For Sale
- Next message: Reliability analysis was RE: Windows HPC (@ Cornell)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> > >** Reliability. We ran a 256-processor Dell cluster with Windows 2000 >and collected all errors (OS, I/O, hardware) on a secure web site for 6 >months. MIT analyzed and independently verified the up-time--99.9986%. Couldn't find the real email to respond to here, but the above excerpt captures it... Has anyone running a cluster done some real reliability analysis and published the data and analysis? Not necessarily peer reviewed.. even a good web page description would do. For instance, Paul at Cornell has claimed better than 99.999% reliability or uptime, but hasn't provided any numerical backup for the assertion, except to claim that MIT analyzed some unspecified set of data. Is Cornell going to publish the data and details of the analysis? For instance, what was the reliability model used? What failure statistical distribution was implied (Exponential? Weibull?) What's defined as "failure" or "up time"? I searched the entire Cornell site using their search engine, and all I found was a couple of marketing speak type presentations that didn't provide any numerical backup for the assertions. I think this would be a very useful thing as a point of discussion for clusters in general, since the terms "high availability", "high reliability", "MTBF" and so forth are bandied about pretty freely, without any unambiguous definitions. There's a lot of literature and discussion on performance and how to fairly evaluate it in terms of BogoMIPs, or GFlops, or bisection bandwidth, etc, but not nearly as much on other aspects of running a cluster. I would think that a reasonably rigorous analysis would need to address things like (re)boot time, mean time to repair, the difference between "operating system up and ready" and "actually running user code", and so forth. Maybe a good start would be to establish a common terminology for things, and then we can argue/discuss how to boil down measurements/predictions to a single "figure of merit". RGB.. maybe another chapter for your book?
- Previous message: Linux Business For Sale
- Next message: Reliability analysis was RE: Windows HPC (@ Cornell)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
