Reliability analysis was RE: Windows HPC (@ Cornell)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Greg Lindahl lindahl at keyresearch.comThu Nov 7 20:19:08 PST 2002
- Previous message: Reliability analysis was RE: Windows HPC (@ Cornell)
- Next message: disadvantages of linux cluster - admin
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, Nov 07, 2002 at 06:34:47PM -0500, Tim Wait wrote: > One aspect I haven't seen mentioned in this thread, except for > Greg's oblique reference to Mosix, is that many (most?) > of our clusters run parallel apps. Regardless of HA, if you have > a node fail while running a parallel job, you have just blown your > (supposed) 5 nines away; in my experience, it takes the user O(12+ hours) > to restart the job. Is this deteriorating to HA vice beowulf? It's not that hard for queue systems like PBS to detect and restart jobs that fail due to machines dying -- this is a major quality of implementation issue. It still hurts you utilization, because you have wasted resources. But at least the user doesn't have to do anything to get their answer; they just get it later. -- greg
- Previous message: Reliability analysis was RE: Windows HPC (@ Cornell)
- Next message: disadvantages of linux cluster - admin
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
