[Beowulf] SATA(?) errors locks up node

Gebhardt Thomas gebhardt at hrz.uni-marburg.de
Wed May 23 02:13:59 PDT 2007


Hi,

we are running a cluster of 57 dual opteron nodes. Once or twice a week
one of these nodes gets in an error state and can't connect to the 
I/O-subsystem anymore. I need to reboot that node. As far as I can see,
the problem occurs randomly at any of our nodes, i.e., the MTBF of a single
node is about 6-12 months.

I still don't know whether this is a problem of the linux kernel sata driver,
a hardware problem, a flaw of the disk firmware or something else. I'm
looking for a possibilty to track down the problem without substantially
interfering with the jobs on the cluster.

This is our environment:
TYAN S3992 motherboard with Serverworks HT1000+2000 chipset.
2 DualCore Opteron  2216 HE 2.4GHz, 16GByte Mem
Maxtor 250GByte SATA disk, WDC WD2500YS-01SHB0, firmware rev. 20.06C03
Debian sarge amd64 (custom kernel)

I tried several linux kernel versions (eg. 2.6.18.1, currently: 2.6.20.3
from kernel.org) which seems to make no difference.

I also tried to reduce SATA bandwidth down to 150MB/s with a jumper at
the disk. This does not help either.

NCQ is disabled:
# cat  /sys/block/sda/device/queue_depth
1

Any ideas?

Thanks, Thomas

+++++++++++++++++++

Here is a typical console error log. As far as I can see, this means that the
communication between the kernel and the disk suddenly get interupted.

May 17 04:39:51 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x40000000 action 
0x2 frozen
May 17 04:39:51 ata1.00: cmd ca/00:50:9a:32:7b/00:00:00:00:00/e0 tag 0 cdb 0x0 
data 40960 out
May 17 04:39:51          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
(timeout)
May 17 04:39:58 ata1: port is slow to respond, please be patient (Status 0xd0)
May 17 04:40:21 ata1: port failed to respond (30 secs, Status 0xd0)
May 17 04:40:21 ata1: soft resetting port
May 17 04:40:28 ata1: port is slow to respond, please be patient (Status 0xd0)
May 17 04:40:51 ata1: port failed to respond (30 secs, Status 0xd0)
May 17 04:40:51 ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
May 17 04:41:21 ata1.00: qc timeout (cmd 0xec)
May 17 04:41:22 ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May 17 04:41:22 ata1.00: revalidation failed (errno=-5)
May 17 04:41:22 ata1: failed to recover some devices, retrying in 5 secs
May 17 04:41:26 ata1: hard resetting port
May 17 04:41:34 ata1: port is slow to respond, please be patient (Status 0xd0)
May 17 04:41:57 ata1: port failed to respond (30 secs, Status 0xd0)
May 17 04:41:57 ata1: COMRESET failed (device not ready)
May 17 04:41:57 ata1: hardreset failed, retrying in 5 secs
May 17 04:42:02 ata1: hard resetting port
May 17 04:42:09 ata1: port is slow to respond, please be patient (Status 0xd0)
May 17 04:42:32 ata1: port failed to respond (30 secs, Status 0xd0)
May 17 04:42:32 ata1: COMRESET failed (device not ready)
May 17 04:42:32 ata1: hardreset failed, retrying in 5 secs
May 17 04:42:37 ata1: hard resetting port
May 17 04:42:45 ata1: port is slow to respond, please be patient (Status 0xd0)
May 17 04:43:08 ata1: port failed to respond (30 secs, Status 0xd0)
May 17 04:43:08 ata1: COMRESET failed (device not ready)
May 17 04:43:08 ata1: reset failed, giving up
May 17 04:43:08 ata1.00: disabled
May 17 04:43:08 ata1: EH complete
May 17 04:43:08 sd 0:0:0:0: SCSI error: return code = 0x00040000
May 17 04:43:08 end_request: I/O error, dev sda, sector 8073882
May 17 04:43:08 Buffer I/O error on device sda2, logical block 9189
May 17 04:43:08 lost page write due to I/O error on sda2
May 17 04:43:08 sd 0:0:0:0: SCSI error: return code = 0x00040000
May 17 04:43:08 end_request: I/O error, dev sda, sector 16099660
May 17 04:43:08 Buffer I/O error on device sda3, logical block 12365
May 17 04:43:08 lost page write due to I/O error on sda3
May 17 04:43:08 sd 0:0:0:0: SCSI error: return code = 0x00040000
May 17 04:43:08 end_request: I/O error, dev sda, sector 73606884
May 17 04:43:08 Buffer I/O error on device sda3, logical block 7200768
May 17 04:43:08 lost page write due to I/O error on sda3
....



More information about the Beowulf mailing list