Reliability analysis was RE: Windows HPC (@ Cornell)

Tim Wait waitt at
Thu Nov 7 15:34:47 PST 2002

One aspect I haven't seen mentioned in this thread, except for
Greg's oblique reference to Mosix, is that many (most?)
of our clusters run parallel apps. Regardless of HA, if you have
a node fail while running a parallel job, you have just blown your
(supposed) 5 nines away; in my experience, it takes the user O(12+ hours)
to restart the job. Is this deteriorating to HA vice beowulf?

5 nines? Yeah, right ;)

Even those $50k hand built Cray disks die.


Tim Wait       waitt at
SAIC - Advanced Systems Group
PO Box 41, Sumerduck VA 22742
Phone: 540-439-0193

More information about the Beowulf mailing list