[Beowulf] Re: HPC fault tolerance using virtualization
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Dave Love d.love at liverpool.ac.ukSun Jun 28 04:17:50 PDT 2009
- Previous message: [Beowulf] HPC fault tolerance using virtualization
- Next message: [Beowulf] Re: HPC fault tolerance using virtualization
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
John Hearns <hearnsj at googlemail.com> writes: > However, you could look of correctable ECC errors, On the systems with which I'm familiar, they either won't show up in the IPMI SEL or will apparently be inconsistent with the kernel mcelog -- mcelog typically displays many more events. (I don't know why this is, though I'm overly familiar with memory errors.) > and for disks run a smartctl test and see if a disk is showing > symtopms which might make it fail in future. What I typically see from smartd is alerts when one or more sectors has already gone bad, although that tends not to be something that will clobber the running job. How should it be configured to do better (without noise)?
- Previous message: [Beowulf] HPC fault tolerance using virtualization
- Next message: [Beowulf] Re: HPC fault tolerance using virtualization
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
