Disk reliability (Was: Node cloning)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduWed Apr 11 15:23:25 PDT 2001
- Previous message: Disk reliability (Was: Node cloning)
- Next message: Disk reliability (Was: Node cloning)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, 11 Apr 2001, Josip Loncaric wrote: > "[...] For example if during testing of your hard drive DFT reports a > error code of 0x70 as shown on page 14, this indicates that your hard > disk drive has one or more bad sectors. In most of these cases the > drive can heal itself of these errors. To do this first back-up all > your data from the problem drive (if possible) then run DFT again and > select the Erase Disk option which is under the Utilities heading. > [...] Once erase disk has completed you can then run one of the test > options Quick or Advance to confirm htat the drive has been healed. The > result code, which should be displayed, is 0x00 if the test returns > another code then you should check with your drive/system vendor if the > drive can be return for warranty replacement." (sic!) The only way I can imagine for this to actually work to heal the disk is if the drive's low-level formatting is somehow faulty. There are two "generic" low-level causes of bad blocks. One is simply imperfect plating or physical damage or anything else that results in an area of the disk that won't hold its ferromagnetic magnetization. This is the kind of error that Greg talks about -- erasing or reformatting or whatever won't fix this -- the only thing that will "fix" it is marking it out as bad. The other kind of error is a dynamic mechanical or electrical error -- a write head starts to write a tiny bit early during a move and overwrites a track boundary or other "soft" format data that defines and stabilizes the disk geometry. In the old days this was pretty common, disks were awesomely expensive, and most disks came with a "low level format" utility that would redraw all the tracks and mark out all the bad blocks (with an optional feature that would look for bad blocks in the event that you accidentally trashed the bad block list on the disk itself). I spent many a happy hour waiting for these utilities to finish, and sometimes they would even work. I would assume the "erase" option is really a name for a new low level reformat that fixes the latter kind of error and MIGHT even help with the former, if the bad blocks are "bad enough". However, ferromagnetism is nastily nonlinear and a bad block can very gradually lose its information -- be "almost" stable. Another bad thing is that a disk that generates dynamical errors that screw up low level formatting and hence "blocks" -- not quite perfectly synchronizing on its read/write activity on certain patterns of use, or (as was the documented case for certain disks some years ago) writing before it fully spins up to speed -- can ALSO "work" after being "fixed" with a low level format or badblocks run, but the problem is generally fundamental and will simply come back again later. On some disks that did this the disks gradually deteriorated until not even badblocks could repair them. There are definitely disks that are just plain "lemons", although IBM disks are admittedly pretty good. Disk errors make me nervous enough that I'm mostly with Greg on this one -- if one can get them replaced for free, do it, and if you value your own time consider spending money (preferrably other people's money, of course:-) to replace them if necessary. There is always a subset of possible errors that will pass a CRC test or miss error detection routines and a bit error in a binary or data file is as undesirable as a bit error in memory. In a way you're lucky if it just causes immediate system failure. Back when low level recovery tools were ubiquitous (and disks cost thousands of dollar, which is WHY they were ubiquitous:-) I certainly used them a lot, but my "three year survival rate" success rate with them has overall been very low. In some cases semi-recoverable failure has occurred near the warranty boundaries and delay has cost me the opportunity to replace under warranty. Of course Moore's law for disk has been if anything more aggressive than ML for other system components (shorter constant cost doubling time) so perhaps this isn't a big deal, but nowadays if a disk fritzes under warranty I just take it back and get a new one right away. Often a bigger one, since small disks are discontinued so aggressively. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: Disk reliability (Was: Node cloning)
- Next message: Disk reliability (Was: Node cloning)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
