[Beowulf] HPC fault tolerance using virtualization
hearnsj at googlemail.com
Tue Jun 16 02:02:11 PDT 2009
2009/6/16 Kilian CAVALOTTI <kilian.cavalotti.work at gmail.com>
> I may be missing something major here, but if there's bad hardware, chances
> are the job has already failed from it, right? Would it be a bad disk (and
> OS would only notice a bad disk while trying to write on it, likely asked
> do so by the job), or bad memory, or bad CPU, or faulty PSU. Anything
> losing bits mainly manifests itself in software errors. There is very
> chance to spot a bad DIMM until something (like a job) tries to write to
What you say is very true.
However, you could look of correctable ECC errors, and for disks run a
smartctl test and see if a disk is showing
symtopms which might make it fail in future.
Or maybe look at the error rates on your ethernet or infiniband interface -
you might want to take that node out till it can be investigated (read-
reseating the cable!)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf