IDE disk errors

J. G. LaBounty jgl at
Wed Jun 13 08:03:56 PDT 2001

 We are being swamped with disk errors. Most of the errors are logged
 as follows:
 Jun 12 01:44:40 scf402n kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
 Jun 12 01:44:40 scf402n kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=7975408, sector=2625696
 Jun 12 01:44:40 scf402n kernel: end_request: I/O error, dev 03:08 (hda), sector 2625696
 Everything that I can find says this is a media problem. Our typical recovery
 procedure is to:
 1. run e2fsck -c -v -y /dev/hdX
    We will run this procedure following a disk error but eventually the
    system will hang or we get so many errors, it will take too long to 
    complete (over 2 hours, with no errors it takes about 45 minutes).
 2. If #1 fails, we will run the IBM DFT utility to reformat the drive. After
    reformating we have run e2fsck -c and it finds no errors. If reformat
    fails, we return the drive for replacement.
 Number        Motherboard     CPU           DISK per node                   AGE        # Failures
 34 nodes on ASUS P2BD         2-600MHz cpus 2 Western Digital 26gb drives  18 months  6
 50 nodes on ASUS P2BD         2-800MHz cpus 2 IBM deskstar    30 gb drives  8 months  21
 150 nodes on Tyan 2500        2-800MHz cpus 2 IBM deskstar    45 gb drives  6 months  104
     Disks are attached to a Promise 100 card
 50 nodes on Supermicro 370DLE 2-1GHz cpus   2 IBM deskstar    60 gb drives  2 months  28

 All nodes are running Redhat 6.2 with a 2.2.16 kernel. DMA is turned on in the 
 kernel plus the Promise 100 patch is installed.
 For some reason most of our failures have been on the root disk. We have 
 tried running with root and swap on 1 disk and application scratch space on the
 second disk.  While this seems to reduce the frequency of the error, it does
 not eliminate it.
 We are also dropping the transfer rate of the device back to a slower speed. We
 are using DMA mode. As a last resort, we may try PIO mode but really don't
 want to take that performance hit.
 This may seem like a lot of work for drives under warranty but IBM no longer makes
 the 45 gb drive. Warranty returns are taking several weeks to get the replacements.
 We have found that the replacements are not any better than the drives that
 can be reformated. 
 We have looked at moving to SCSI drives of similar size but don't want to take the 
 price hit. Adding 2 - scsi drives and a controller would bump our base price 
 30 - 50%.
 Has anyone else experienced similar problems? Any suggestions as what we could
 try to alleviate the problem?

More information about the Beowulf mailing list