[Beowulf] NFS Read Errors

Tue Dec 4 06:55:12 PST 2007

Joe,

Thanks for the suggestions.  Let me make some quick corrections.

At one point I knew about md5sum, but, as they say in Spanish, it 
forgot itself on me.

You are right about the data rate on the 32 bit PCI cards: I meant 300 Mbps.

As for the time for wire speed transmission of 1 GB, at 300 Mbps it 
is only about 30 seconds.  It turns out the biggest file I am dealing 
with is 400 MB, not 1 GB, and the local md5sum takes only 10 seconds, 
indicating that the disk-to-memory speed is at least 40 MBps, which 
is about what I expect from this hardware, and about equal to the 300 
Mbps ethernet speed on the single processor.  But the remote md5sum 
takes almost 6 minutes to get the wrong answer.

The problem with disk system or memory hypotheses is that the local 
md5sum is consistent, and fast.

There are no unexpected messages in /var/log/messages, and there is 
no /var/log/syslog.

The only thing I haven't checked outside the box is the cable, so I 
will do that, but it seems unlikely.

And yes, these boxes are old, but they have served me well, and my 
replacements won't be up and running till the end of the month.  I 
also was hoping to find a better configuration choice, if there is one.

Mike

At 06:21 PM 12/3/2007, Joe Landman wrote:
>Hi Michael:
>
>Michael H. Frese wrote:
>>We were having trouble restarting from our homegrown parallel 
>>magnetohydrodynamic code's checkpoint files.  The files could be 
>>read, but funny things happened in the run afterward.  Eventually 
>>we figured out that the restarted parallel run differed from the 
>>serial restarted run from the same checkpoint.
>>After much gnashing of teeth and rending of apparel, we found that 
>>the checkpoint files were being read incorrectly across NFS.  That 
>>let us simplify our search for the problem.  We first found that 
>>the local md5 digest [openssl dgst -md5 (file...)] on an NFS cp'ed 
>>version of the file
>
>         md5sum filename
>
>does the same thing with a slightly simpler syntax.  There is 
>mounting evidence that you should use sha1sum rather than md5sum.
>
>>was different from that produced on the original file.  What was 
>>interesting was that the copy either took forEVER -- like 10 
>>minutes or 20 minutes for a 1 GB file -- when the final result was 
>>bad or it took about a minute when the file was perfect.  I'm 
>>guessing that whatever error checking that gets done on the packets 
>>was rejecting so many it finally got a bad packet it couldn't tell was bad.
>
>Sounds a great deal like a bad disk/disk system or something mucking 
>with your connection to the data.  1 GB file, even at 1 MB/s is 1000 
>seconds, or 16 minutes.  If you have a disk which keeps timing out, 
>or has bad blocks, and keeps retrying, well, stuff like this can 
>happen, especially on old kernels (and old hardware).
>
>Could also be a RAM error.
>
>>When we found that doing the md5 digest on a remote file produced a 
>>different result than doing it on the processor on which the disk 
>>was mounted, our tests got simpler.  And shorter, still, after we 
>>found that we could get fairly frequent failures with 10 MB files or smaller.
>>Clearly we had an NFS failure, probably associated with hardware.
>
>Yes.  I would venture a guess that you are seeing *lots* of errors 
>in your /var/log/syslog or /var/log/messages files.
>
>
>>This was all between two specific nodes of our small cluster.  [Old 
>>hardware generally: AMD Athlon 32-bit single (MSI KT4V) and dual 
>>(Tyan...) chip motherboards both running Redhat 9 one with the 
>>2.4.20-8 kernels, though one is the smp version; NetGear GA311 NICs; and a
>
>Owie...
>
>>NetGear GS108 8 port Copper 1 GB/s switch.  The single processor 
>>motherboards have 32-bit PCI slots so their network speeds are 
>>limited to 300 kbps as shown by netpipe.  All of the LEDs at the 
>>ends of the cables show 1000Mb connections.]
>
>300 kbps?  thats 300 kilo bits per second (abbreviations are *very* 
>important to get right, kB/s is not the same as kb/s).  300 kbps is 
>usually read as 300 kilo bits per second.  Or about about 37.5 kB/s. 
>Which is about the average speed of various DSL lines.
>
>I hope you mean 30 MB/s (or 240 Mb/s).
>
>>Then we started checking other pairs.  Some were fine.  Some were 
>>bad in the same way.  So we replaced the switch, changing to a 16 
>>port NetGear GS216.  That seemed to cure most of the problem.  But 
>>we continued to
>
>We have seen bad switches a few times.
>
>>have problems copying a file on one particular single processor 
>>machine from the others.
>>That's where we are now.  The md5 digest run on that machine 
>>consistently shows the same result, whereas the digest for that 
>>file produced on a remote machine will be almost stochastic.  In 
>>some cases it will eventually settle in to the right answer, and 
>>then the speed goes WAY up.  I suppose that happens because the 
>>file request can be served from the local machine's cache.  But why 
>>doesn't it happen after it received bad blocks?
>
>I am guessing you are using TCP NFS mounts as well?  TCP forces 
>retries in the event of bad packets.  UDP doesn't force this, but 
>the NFS protocol will try.  Ram errors, bad cables, burnt switches, 
>and machines with interrupt problems (old machines often shared 
>interrupts without being able to do a very good job of it).
>
>>Most, if not all of the original network cards in those machines 
>>went bad and have been replaced in the last few years, so I decided 
>>to try a brand new GA311.  No joy there.  It still gives out the 
>>wrong info.  I guess the motherboard PCI bus controller is hinky, 
>>but I'm far from sure.
>
>Did you try a new cable?  Had a few cables go bad, usually they are 
>marginal to begin with.
>
>>We are in the process of upgrading and thus replacing all the 
>>machines we have of that configuration due to space limitations and 
>>their age, but I'm still curious what the problem could be.
>
>There are quite a few possibilities unfortunately.  Unless you plan 
>to use these existing machines for quite a while longer, it might be 
>less painful to shut off the malfunctioning node.
>
>>Suggestions?  Comments?
>
>2.4.20?  Athlons?  I would say a serious hardware/OS refresh is in order :)
>
>
>
>--
>Joseph Landman, Ph.D
>Founder and CEO
>Scalable Informatics LLC,
>email: landman at scalableinformatics.com
>web  : http://www.scalableinformatics.com
>        http://jackrabbit.scalableinformatics.com
>phone: +1 734 786 8423
>fax  : +1 866 888 3112
>cell : +1 734 612 4615