Disk reliability (Was: Node cloning)

Mon Apr 9 07:12:24 PDT 2001

Thanks to several constructive responses, the following picture emerges:

(1) Modern IDE drives can automatically remap a certain number of bad
blocks.  While they are doing this correctly, the OS should not even see
a bad block.

(2) However, the drive's capacity to do this is limited to 256 bad
blocks or so.  If more bad blocks exist, then the OS will start to see
them.  To recover from this without replacing the hard drive, one can
detect and map out the bad blocks using 'e2fsck -c ...' and 'mkswap -c
...' commands.  Obviously, the partition where this is being done should
not be in use (turn swap off first, unmount the file system or reboot
after doing "echo '-f -c' >/fsckoptions").

(3) In general, IDE cables should be at most 18" long with both ends
plugged in (no stubs), and preferably serving only one (master) drive.

For IBM drives (IDE or SCSI), one can download and use the Drive Fitness
Test utility (see
http://www.storage.ibm.com/techsup/hddtech/welcome.htm).  This program
can diagnose typical problems with hard drives.  In many cases, bad
blocks can be 'healed' by erasing the drive using this utility (back up
your data first, and be prepared for the 'Erase Disk' to take an hour or
more).  If that fails and your drive is under warranty, the drive ought
to be replaced.

For older existing drives (in less critical applications, e.g. to boot
Beowulf client nodes where the same data is mirrored by other nodes)
mapping out bad blocks as needed is probably adequate.

Finally, the existing Linux S.M.A.R.T. utilities apparently do not
handle every SMART drive correctly.  Use with caution.

Sincerely,
Josip

-- 
Dr. Josip Loncaric, Research Fellow               mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134