[Beowulf] serious NFS problem on Mandrake 10.0

David Mathog mathog at mendel.bio.caltech.edu
Fri Dec 3 16:07:38 PST 2004


> 
> >   cp /tmp/SAVELASTMEGABLAST.txt /tmp/TESTLAST.txt
> >   mv /tmp/TESTLAST.txt ./TESTLAST.txt.$NODE
> >   set `md5sum TESTLAST.txt.$NODE`
> >   NEWMD=$1
> >   /bin/rm ./TESTLAST.txt.$NODE
> >   if [ "$NEWMD" != "$HOLDMD" ] 
> 
> hmm. you're doing both the writes and reads from the slave node here.
> was that part of your original description?  I'm wondering about 
> bad writes vs bad reads.  what happens if you run the md5sum on 
> the master instead?

It was originally found as corrupted data on the master.  Then
it was confirmed that the data looked corrupted from the slave too,
so the script ran entirely on the slave.  You do have a point
though, presumably the slave is rereading the data back across
the net for the md5sum, so there are two passes where it could
go wrong, and the script didn't check to see that the corrupted
data was of the same type.  

Poked around in bugzilla for kernel.org, this sounds like it may
be the same or a closely related problem, if so, it's still
around in 2.6.9:

http://bugzilla.kernel.org/show_bug.cgi?id=3608

I'll try some of your suggested changes next week - not the sort of
thing to attempt late on a Friday...

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list