[Beowulf] SATA(?) errors locks up node
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Mark Hahn hahn at mcmaster.caWed May 23 08:37:12 PDT 2007
- Previous message: [Beowulf] SATA(?) errors locks up node
- Next message: [Beowulf] SATA(?) errors locks up node
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> I still don't know whether this is a problem of the linux kernel sata driver, > a hardware problem, a flaw of the disk firmware or something else. I'm the logs show that a command times out, and defies recovery. I don't think your chipset is the most common - is the SATA controller integrated, or something like a Promise chip? do you have any guess about whether your disks are getting enough power? it seems to be a fairly common occurrance for people to report this kind of "stops working" bug to the list (linux-ide at vger.kernel.org), only later to discover that the problem was a marginal power supply. > looking for a possibilty to track down the problem without substantially > interfering with the jobs on the cluster. the sata developers hang out on linux-ide, and seem very responsive. quite a lot of work has been done on exception handling, but as always, it's the most common controllers which are best tested/supported. > I tried several linux kernel versions (eg. 2.6.18.1, currently: 2.6.20.3 > from kernel.org) which seems to make no difference. well, by kernel standards, 2.6.20.3 is fairly old; there have certainly been plenty of SATA updates this year. > I also tried to reduce SATA bandwidth down to 150MB/s with a jumper at > the disk. This does not help either. it wouldn't, unless you had a noise problem with the cable. > NCQ is disabled: > # cat /sys/block/sda/device/queue_depth > 1 such features wouldn't cause the fairly low-level hang in your logs - to me it looks like power, given that it appears to affect even the phy-level disk interface. it wouldn't hurt to see what smart says about it (health, metrics and even a self-test.) you might also try stressing the disk with IO to see whether you can repeatably trigger the problem. regards, mark hahn.
- Previous message: [Beowulf] SATA(?) errors locks up node
- Next message: [Beowulf] SATA(?) errors locks up node
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
