[Beowulf] NFS Read Errors
Michael H. Frese
Michael.Frese at NumerEx.com
Tue Dec 4 07:54:24 PST 2007
Thanks for your helpful comments.
At 11:31 PM 12/3/2007, you wrote:
>>I am guessing you are using TCP NFS mounts as well? TCP forces
>>retries in the event of bad packets. UDP doesn't force this, but
>>the NFS protocol will
>UDP has a checksum as well, though it's only 16b. then again, the TCP
>checksum isn't all that strong for today's data rates either.
From reading the man page on nfs on the systems with the 2.4
kernels, it looks like the default for an nfs mount is udp. It also
looks like tcp is not really an option until nfs v4, so it may be
something to try on the 2.6 kernels that I have on some of my newer
machines at another site.
>you should definitely examine /proc/net/dev on involved machines.
I hadn't known about /proc/net/dev. When I check there, I see no
transmit errors on the server side and no receive errors on the
client side. That's odd, because the other thing I see is that the
average packet size received (bytes received divided by packets
received) on the client side is 3.9, while on the server side, the
average packet size sent is 1430. In other words, there are a many
more packets received than there ought to be. That's very
fishy. It's probably the result of the way the packet count is done
and reported. I.e., it may be that all the received packets -- good
and bad -- are counted, but only the bytes in the good ones are
counted, with some similar problem on the server side. I think the
statistics are aggregate since the last boot, so they may not be just
from the troublesome tests I was performing, either.
>I would attempt to reduce the complexity of your testing.
>for instance, can a node write and verify to its local disk
The local disk read seems rock solid in comparison to the NFS
one. The local md5sum produces the same result time after time,
which is just not the case for the remote.
>can it stream data over tcp sockets (netcat or the like) without
>corruption or obvious problems reflected
netcat is not on my systems. Looks like I have to get someone to
download and build it for me, and try the streaming tests you recommend.
>does ethtool tell you anything about the config of the nic?
Not on the 2.4 systems, though it seems to tell me a little on the 2.6's.
>comparing tcp vs udp NFS would be sensible
>as well - varying the packet size, too. switching client and/or
>server to a modern 2.6 kernel may be instructive.
Upgrading the kernel is probably the only way I'll get nfs over
tcp. Given that these systems are headed out the door, I'm not sure
that's a good use of our time. But it may be worth doing an our new
and newer systems.
More information about the Beowulf