[Beowulf] NFSv3 client hangs - tcp v/s udp.
asingh at ideeinc.com
Thu May 4 09:57:32 PDT 2006
Have you enabled the jumbo-frames on your network? The man page for NFS
has a big warning against using NFS over udp.
Amitoj G. Singh wrote:
>o 648 single processor Intel P4 worker nodes.
>o single head-node, NFSv3 server
>o OS - RedHat EL 4, kernel 2.6.12
>o Torque 2
>o Maui 3.2
>o all worker nodes NFS mount /home, /usr/local
>After upgrading from Red Hat 7.1 to Red Hat EL 4 we realized that we were
>having a 1 in 10 user jobs fail because of a worker node NFS mount point
>failing to respond. The NFS mount points on the worker nodes would become
>unresponsive during heavy NFS I/O. A simple "netstat -t" on the
>head-node showed that there were thousands of open TCP nfs sockets on the
>head-node. Worker nodes that had frozen NFS mount points responded with
>the following error message:
>nfs_statfs: error no = 512
>The above error message should be handled in kernel space but somehow was
>being reported in user space. The kernel should have handled the
>nfs timeout and reconnected transparent to the user. We realized that NFS
>v3 defaults to TCP if not explicitly mentioned at mount time. The only
>solution for a worker node with a frozen NFS mount point was to reboot the
>node. A "remount" works but you need to stop all services using the NFS
>We recently switched all our NFS mounts to use udp and have had no worker
>nodes with failing or unresponsoive NFS mount points.
>Thought would share this bit of experience with the list. Interestingly
>while googling we did not find a lot of chatter about this issue.
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf