Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Solved: SATA(?) errors locks up node

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Gebhardt Thomas gebhardt at hrz.uni-marburg.de
Mon Jul 2 07:05:34 PDT 2007


Hello,

thank you all for your advice! 
After a Firmware upgrade (->20.06C06) of the SATA disks we had no
further incident until now. So I'm pretty sure that we have caught the bug.

Thanks again, Th. Gebhardt

On Wednesday 23 May 2007 11:13, Gebhardt Thomas wrote:
> we are running a cluster of 57 dual opteron nodes. Once or twice a week
> one of these nodes gets in an error state and can't connect to the
> I/O-subsystem anymore. I need to reboot that node. As far as I can see,
> the problem occurs randomly at any of our nodes, i.e., the MTBF of a single
> node is about 6-12 months.
>
> I still don't know whether this is a problem of the linux kernel sata
> driver, a hardware problem, a flaw of the disk firmware or something else.
> I'm looking for a possibilty to track down the problem without
> substantially interfering with the jobs on the cluster.
>
> This is our environment:
> TYAN S3992 motherboard with Serverworks HT1000+2000 chipset.
> 2 DualCore Opteron  2216 HE 2.4GHz, 16GByte Mem
> Western Digital 250GByte SATA disk, WDC WD2500YS-01SHB0, firmware rev. 
20.06C03




More information about the Beowulf mailing list