Disk reliability (Was: Node cloning)
josip at icase.edu
Mon Apr 9 07:12:24 PDT 2001
Thanks to several constructive responses, the following picture emerges:
(1) Modern IDE drives can automatically remap a certain number of bad
blocks. While they are doing this correctly, the OS should not even see
a bad block.
(2) However, the drive's capacity to do this is limited to 256 bad
blocks or so. If more bad blocks exist, then the OS will start to see
them. To recover from this without replacing the hard drive, one can
detect and map out the bad blocks using 'e2fsck -c ...' and 'mkswap -c
...' commands. Obviously, the partition where this is being done should
not be in use (turn swap off first, unmount the file system or reboot
after doing "echo '-f -c' >/fsckoptions").
(3) In general, IDE cables should be at most 18" long with both ends
plugged in (no stubs), and preferably serving only one (master) drive.
For IBM drives (IDE or SCSI), one can download and use the Drive Fitness
Test utility (see
http://www.storage.ibm.com/techsup/hddtech/welcome.htm). This program
can diagnose typical problems with hard drives. In many cases, bad
blocks can be 'healed' by erasing the drive using this utility (back up
your data first, and be prepared for the 'Erase Disk' to take an hour or
more). If that fails and your drive is under warranty, the drive ought
to be replaced.
For older existing drives (in less critical applications, e.g. to boot
Beowulf client nodes where the same data is mirrored by other nodes)
mapping out bad blocks as needed is probably adequate.
Finally, the existing Linux S.M.A.R.T. utilities apparently do not
handle every SMART drive correctly. Use with caution.
Dr. Josip Loncaric, Research Fellow mailto:josip at icase.edu
ICASE, Mail Stop 132C PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA Tel. +1 757 864-2192 Fax +1 757 864-6134
More information about the Beowulf