Writing/Reading Files

Donald Becker becker at scyld.com
Mon May 13 10:54:22 PDT 2002


On Mon, 13 May 2002, Wheeler.Mark wrote:

> We have a problem writing and then reading files across the nodes.
...
> Following completion of the production code, I run a routine that
> joins up the individual files into one large file. What I discovered is
> that some of these files created by the production code were corrupt
> (i.e. they contained extraneous bytes) which prevented my
> post-processing job from completing.
>
> It seems to me that this problem is somehow related to NFS mounted
> disks and file transfers perhaps under memory load (i.e. even though my
> production code completes BEFORE I execute the rcp).

Are you using NFS v2 or v3?
What network hardware are you using?
Are you seeing any network errors reported in /proc/net/dev?

My first guess would be that the data is being corrupted in memory.

The next likely problem is that you are using network hardware that
computes and checks the UDP/IP packet checksum in the NIC, rather than
having the CPU compute the checksum.

It's also possible that your disk hardware is corrupting writes during
heavy PCI bus usage.

-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993




More information about the Beowulf mailing list