[Beowulf] HPC fault tolerance using virtualization
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
John Hearns hearnsj at googlemail.comTue Jun 16 02:02:11 PDT 2009
- Previous message: [Beowulf] HPC fault tolerance using virtualization
- Next message: [Beowulf] Re: HPC fault tolerance using virtualization
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
2009/6/16 Kilian CAVALOTTI <kilian.cavalotti.work at gmail.com> > > > I may be missing something major here, but if there's bad hardware, chances > are the job has already failed from it, right? Would it be a bad disk (and > the > OS would only notice a bad disk while trying to write on it, likely asked > to > do so by the job), or bad memory, or bad CPU, or faulty PSU. Anything > hardware > losing bits mainly manifests itself in software errors. There is very > little > chance to spot a bad DIMM until something (like a job) tries to write to > it. What you say is very true. However, you could look of correctable ECC errors, and for disks run a smartctl test and see if a disk is showing symtopms which might make it fail in future. Or maybe look at the error rates on your ethernet or infiniband interface - you might want to take that node out till it can be investigated (read- reseating the cable!) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20090616/4c0f1644/attachment.html
- Previous message: [Beowulf] HPC fault tolerance using virtualization
- Next message: [Beowulf] Re: HPC fault tolerance using virtualization
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
