[Beowulf] SATA(?) errors locks up node

Mark Hahn hahn at mcmaster.ca
Wed May 23 08:37:12 PDT 2007


> I still don't know whether this is a problem of the linux kernel sata driver,
> a hardware problem, a flaw of the disk firmware or something else. I'm

the logs show that a command times out, and defies recovery.  I don't think
your chipset is the most common - is the SATA controller integrated, or
something like a Promise chip?

do you have any guess about whether your disks are getting enough power?
it seems to be a fairly common occurrance for people to report this kind of 
"stops working" bug to the list (linux-ide at vger.kernel.org), only later to 
discover that the problem was a marginal power supply.

> looking for a possibilty to track down the problem without substantially
> interfering with the jobs on the cluster.

the sata developers hang out on linux-ide, and seem very responsive.
quite a lot of work has been done on exception handling, but as always,
it's the most common controllers which are best tested/supported.

> I tried several linux kernel versions (eg. 2.6.18.1, currently: 2.6.20.3
> from kernel.org) which seems to make no difference.

well, by kernel standards, 2.6.20.3 is fairly old; there have certainly been
plenty of SATA updates this year.

> I also tried to reduce SATA bandwidth down to 150MB/s with a jumper at
> the disk. This does not help either.

it wouldn't, unless you had a noise problem with the cable.

> NCQ is disabled:
> # cat  /sys/block/sda/device/queue_depth
> 1

such features wouldn't cause the fairly low-level hang in your logs - 
to me it looks like power, given that it appears to affect even the phy-level
disk interface.  it wouldn't hurt to see what smart says about it (health,
metrics and even a self-test.)  you might also try stressing the disk with 
IO to see whether you can repeatably trigger the problem.

regards, mark hahn.



More information about the Beowulf mailing list