Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

Unexplained I/O errors

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Steven Timm timm at fnal.gov
Tue Jul 17 08:19:01 PDT 2001


Hi everyone,

We are currently burning in a new cluster and seeing the following
problem:

We see a number of files, usually contiguous in the same directory,
that ls will list as being there, but ls -l will show Input/output error.
An fsck of the system gets rid of the I/O errors but also gets
rid of the file.  There is no error message on the console, nor
in /var/log/messages, to indicate any disk controller problems.

The problem appears to get worse over time, over a period of a few
days the majority of our 136 machines exhibit these errors.

Our configuration:  Supermicro 370DLE motherboard, 2x1000MHz pentium III,
512 MB ram, Seagate system disk (30 GB)  and CDROM on IDE primary,
2x40GB IBM drives on IDE secondary.
hda: ST330620A, ATA DISK drive
hdb: CD-ROM 48X/AKH, ATAPI CDROM drive
hdc: IC35L040AVER07-0, ATA DISK drive
hdd: IC35L040AVER07-0, ATA DISK drive

I/O errors happen only on the system disk.

We swapped out a large number of IDE cables for the system disk,
replacing them with a better grade, with no luck.

We have downgraded a few machines to the 2.2.16 kernel, and this
appears to be OK, but it is a bit early to tell.

We have also pulled the CD roms off of a few machines and this
also appears to be stable but we need more data yet.

Any idea what could be causing all of this?

Steve



------------------------------------------------------------------
Steven C. Timm (630) 840-8525  timm at fnal.gov  http://home.fnal.gov/~timm/
Fermilab Computing Division/Operating Systems Support
Scientific Computing Support Group--Computing Farms Operations





More information about the Beowulf mailing list