[Beowulf] serious NFS problem on Mandrake 10.0

Fri Dec 3 13:05:01 PST 2004

I'm seeing a serious NFS problem with MDK 10.0
(using a 2.6.8-1 kernel.org kernel).  All of the
machines run the same OS version.

In a 20 node cluster each node NFS mounts /u1 from
the master.  They run a calculation and generate
a file in /tmp of about 26000 lines coming to
1.3Mb (both the number of lines and total size
vary a little). When it completes the process on
each end node does:

  mv /tmp/blah.$NODENAME . 
  mv -f /tmp/blah.$NODENAME /tmp/SAVEblah

The home directory (".") is a couple of
levels down under /u1, so this effectively performs
a network copy from /tmp on the compute node to /u1
on the master node.  The copies are largely asynchronous
since the end nodes complete at various times.

On the master node there are occasionally
(defined as: 1 bad line, out of 20 files, every
3rd or 4th run) a very long bad line.

Here are four lines from the original file on /tmp:

'4827135'=='-22004070' (3254 9815 3391 9675) 22
'4827135'=='-22004070' (75050 11805 75081 11774) 0
'4827086'=='-22004070' (79588 9817 79809 9594) 28
'4827086'=='-22004070' (34069 11794 34308 11555) 34

Here are the four lines from the copy on /u1 .

'4827135'=='-22004070' (3254 9815 3391 9675) 22
'4827135'=='-22004070' (75050 11805 75081 11774) 0
'4827086'=='-22004070' (79588 9817 798<NUL>(MANY times)<NULL>1
'4156131'=='+22004070' (58122 9687 58250 9818) 11

The final line on /u1 does appear in /tmp, but much, much
farther into the file. I very carefully cut out the missing
text from the original file, pasted it into a new file, and found:

% wc deleted.txt
  642  3849 32769 deleted.txt

So it looks like a block of 32768 bytes was lost
(+1 probably for an extra EOL in my deleted.txt file)
during the mv operation and all bytes replaced
with <NUL>.  On repeated runs on the same data (same
output files each time) the problem line never occurs
twice in the same place, and it hops from node to node,
suggesting that it's a rare event somewhere in the
data transport (mv) operation.

This is very, very, VERY bad. 

No relevant messages show up in /var/log/messages.
/u1 is /dev/sde1 and smartctl -a on that device shows
no errors. On the master /u1 is in /etc/fstab as:

LABEL=usrdisk           /u1             ext2    defaults,quota  1 2

and is exported as:

/u1              *.cluster(rw,no_root_squash)

Has anybody else seen this bug?

Is there a patch for it?  Possibly relevant software:

coreutils-5.1.2-1mdk         #/bin/mv
nfs-utils-clients-1.0.6-1mdk #nfs client
kernel 2.6.8-1               #kernel.org

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech