[Beowulf] Re: HPC fault tolerance using virtualization
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Greg Lindahl lindahl at pbm.comSun Jun 28 17:21:26 PDT 2009
- Previous message: [Beowulf] Re: HPC fault tolerance using virtualization
- Next message: [Beowulf] Re: HPC fault tolerance using virtualization)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Sun, Jun 28, 2009 at 12:17:50PM +0100, Dave Love wrote: > > and for disks run a smartctl test and see if a disk is showing > > symtopms which might make it fail in future. > > What I typically see from smartd is alerts when one or more sectors has > already gone bad, although that tends not to be something that will > clobber the running job. How should it be configured to do better > (without noise)? That isn't noise, that's signal. You're just lucky that your running job doesn't need the data off the bad sector. You can try waiting until the job finishes before taking the node out of service; from the sounds of it, you will usually win. But if you don't have application-level end-to-end checksums of your data, how do you know if you won or not? In my big MapReduce cluster (800 data disks), about 2/3 of the time I'll see an I/O error in my application, or checksum failure, and 1/3 of the time I will see a smartd error and no application error. -- greg
- Previous message: [Beowulf] Re: HPC fault tolerance using virtualization
- Next message: [Beowulf] Re: HPC fault tolerance using virtualization)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
