Disk reliability (Was: Node cloning)
becker at scyld.com
Wed Apr 11 16:58:40 PDT 2001
On Wed, 11 Apr 2001, Robert G. Brown wrote:
> On Wed, 11 Apr 2001, Josip Loncaric wrote:
> > "[...] In most of these cases the
> > drive can heal itself of these errors.>
> The only way I can imagine for this to actually work to heal the disk is
> if the drive's low-level formatting is somehow faulty. There are two
> "generic" low-level causes of bad blocks.
> One is simply imperfect plating or physical damage or...
> The other kind of error is a dynamic mechanical or electrical error...
> I would assume the "erase" option is really a name for a new low level
> reformat that fixes the latter kind of error and MIGHT even help with
When they say "heal", they actually mean "remap to substitute disk
blocks reserved for this purpose". They must have thought that the
concept of remapping disk blocks was too confusing.
The way most modern disks work is a three level error control scheme. A
typical drive works as follows:
A hardware-based convolutional decoder is applied to the signal
coming off the read heads that picks the most likely value for
marginal signals based on the surrounding bits. The correction/error
level info is usually discarded.
A block check is applied to the resulting data block. If an error is
detected an error correcting step is taken by the drive firmware. If
few enough bits have been corrupted, the error is software corrected and
perhaps written back to the same location.
If too many bits have been corrupted to rely on the software error
correcting code, the drive might return a soft error. Either the
driver or the OS re-tries the read several times. If one of the
re-reads works, the corrected data is written back, perhaps to a newly
remapped disk block. If the re-read doesn't work, the drive returns a
hard error and remaps the bad block to a reserved good block.
You can guess what is happening with a drive by using the SMART data.
If you have plenty of stand-by blocks, you have good disk platters.
If the number of stand-by blocks is decreasing, something is going
wrong. You should think about ordering a replacement drive before you
see a hard error.
If the number of stand-by blocks is approaching zero, buy a new drive
Right Now. You've been lucky to not have encountered a hard error. Or
maybe you just have been lucky not to notice it.
[[ I fondly remember my first real job, working for Bib Cain and George
Clark at Harris-ATD. They wrote one of the important books on Error
Correcting Coding. It was enlightening to hear people talk about
algorithms and circuits in terms of dB. ]]
Donald Becker becker at scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Second Generation Beowulf Clusters
Annapolis MD 21403 410-990-9993
More information about the Beowulf