[Beowulf] NFS Read Errors

Michael H. Frese Michael.Frese at NumerEx.com
Tue Dec 4 07:54:24 PST 2007


Thanks for your helpful comments.

At 11:31 PM 12/3/2007, you wrote:
>>I am guessing you are using TCP NFS mounts as well?  TCP forces 
>>retries in the event of bad packets.  UDP doesn't force this, but 
>>the NFS protocol will
>UDP has a checksum as well, though it's only 16b.  then again, the TCP
>checksum isn't all that strong for today's data rates either.

 From reading the man page on nfs on the systems with the 2.4 
kernels, it looks like the default for an nfs mount is udp.  It also 
looks like tcp is not really an option until nfs v4, so it may be 
something to try on the 2.6 kernels that I have on some of my newer 
machines at another site.

>you should definitely examine /proc/net/dev on involved machines.

I hadn't known about /proc/net/dev.  When I check there, I see no 
transmit errors on the server side and no receive errors on the 
client side.  That's odd, because the other thing I see is that the 
average packet size received (bytes received divided by packets 
received) on the client side is 3.9, while on the server side, the 
average packet size sent is 1430.  In other words, there are a many 
more packets received than there ought to be.  That's very 
fishy.  It's probably the result of the way the packet count is done 
and reported.  I.e., it may be that all the received packets -- good 
and bad -- are counted, but only the bytes in the good ones are 
counted, with some similar problem on the server side.  I think the 
statistics are aggregate since the last boot, so they may not be just 
from the troublesome tests I was performing, either.

>I would attempt to reduce the complexity of your testing.
>for instance, can a node write and verify to its local disk
>without problem?

The local disk read seems rock solid in comparison to the NFS 
one.  The local md5sum produces the same result time after time, 
which is just not the case for the remote.

>can it stream data over tcp sockets (netcat or the like) without 
>corruption or obvious problems reflected
>in /proc/net/dev?

netcat is not on my systems.  Looks like I have to get someone to 
download and build it for me, and try the streaming tests you recommend.

>does ethtool tell you anything about the config of the nic?

Not on the 2.4 systems, though it seems to tell me a little on the 2.6's.

>comparing tcp vs udp NFS would be sensible
>as well - varying the packet size, too.  switching client and/or 
>server to a modern 2.6 kernel may be instructive.

Upgrading the kernel is probably the only way I'll get nfs over 
tcp.  Given that these systems are headed out the door, I'm not sure 
that's a good use of our time.  But it may be worth doing an our new 
and newer systems.

Thanks again!


More information about the Beowulf mailing list