Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

Reliability analysis was RE: Windows HPC (@ Cornell)

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Greg Lindahl lindahl at keyresearch.com
Thu Nov 7 20:19:08 PST 2002


On Thu, Nov 07, 2002 at 06:34:47PM -0500, Tim Wait wrote:

> One aspect I haven't seen mentioned in this thread, except for
> Greg's oblique reference to Mosix, is that many (most?)
> of our clusters run parallel apps. Regardless of HA, if you have
> a node fail while running a parallel job, you have just blown your
> (supposed) 5 nines away; in my experience, it takes the user O(12+ hours)
> to restart the job. Is this deteriorating to HA vice beowulf?

It's not that hard for queue systems like PBS to detect and restart
jobs that fail due to machines dying -- this is a major quality of
implementation issue.

It still hurts you utilization, because you have wasted resources. But
at least the user doesn't have to do anything to get their answer;
they just get it later.

-- greg





More information about the Beowulf mailing list