[Beowulf] NFS HPC survey results.

Wed Jul 20 16:19:27 PDT 2016

Many thanks for all the responses.

Here's the promised raw data:
    https://wiki.cse.ucdavis.edu/_media/wiki:linux-hpc-nfs-survey.csv

I'll summarize the 26 results below.  I'll email similar to those that asked.

Not everyone answered all questions.

1) cluster OS:
   72% Redhat/CentOS/Scientific linux or derivative
   24% Debian/Ubuntu or derivative
    4% SUSE or derivative

2) Appliance/NAS or linux server
    32% NFS appliance
    76% linux server
    12% other (illumos/Solaris)

3) Appliances used (one each, free form answers):
    * Hitachi BlueARC, EMC Isilon, DDN/GPFS, x4540
    * Not sure - something that corporate provided. An F5, maybe...? Also a
        Panasas system for /scratch.
    * NetApp FAS6xxx
    * netapp
    * isilon x and nl
    * Isilon
    * NetApp
    * Synology

4) Which kernel do you use:
    88% one provided with the linux distribution
    12% one that I compile/tweak myself

5) what kernel changes do you make
    * CPU performance tweaking, network performance.
    * raise ARP cache size, newer kernel than stock 3.2 was needed for newer
      hardware 3.14 at the moment
    * ZFS

6) Do you often see problems like nfs: server 192.168.5.30 not responding,
    timed out:
    42.3% Never
    23.1% Sometimes
    19.2% rarely
     7.7% daily
     7.7% often

7) If you see NFS time outs what do you do (free form answers)
   * nothing
   * nothing
   * Restart NFSd, look for performance intensive jobs, sometimes increase NFSd.
   * Look at what's going on on that server. That means looking at what the
     disks are doing, what network flows are going to/from that server and
     determine if the load is something to take action on or to let.
   * Not much
   * Reboot
   * Resolve connectivity issue if any and run mount command on nodes. If this
     doesn't fix it, then reboot.
   * Ignore them, unless they become a problem.
   * Look for the root cause of the issue, typically system is suffering network
     issues or is overloaded by a user 'abuse/missuse'.
   * diagnose and identify underlying cause
   * Try to figure out who is overloading the NFS server (hard job)
   * Troubleshoot, typically a machine is offline or network saturation

8) which NFS options do you use (free form):
   * tcp,async,nodev,nosuid,rsize=32768,wsize=32768,timeout=10
   * nfsvers=3,nolock,hard,intr,timeo=16,retrans=8
   * hard,intr,rsize=32768,wsize=32768
   * all default
   * async
   * async,nodev,nosuid,rsize=32768,wsize=32768
   * tcp,async, nodev, nosuid,timeout=10
   * -rw,intr,nosuid,proto=tcp (mostly. Could be "ro" and/or "suid")
   * rsize=32768,wsize=32768,hard,intr,vers=3,proto=tcp,retrans=2,timeo=600
   * rsize=32768,wsize=32768
   * -nobrowse,intr,rsize=32768,wsize=32768,vers=3
   * udp,hard,timeo=50,retrans=7,intr,bg,rsize=8192,wsize=8192,nfsvers=3,
     mountvers=3
   * RHEL defaults
   * default ones, they're almost always the best ones
   * rw,nosuid,nodev,tcp,hard,intr,vers=4
   * rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,
     port=0,timeo=600,retrans=2,sec=sys, clientaddr=10.5.6.7,local_lock=none,
     addr=10.5.6.1
   * defaults, netdev,vers=3
   * nfsvers=3,tcp,rw,hard,intr,timeo=600,retrans=2
   * rw,hard,tcp,nfsvers=3,noacl,nolock
   * default rhel6 (+nosuid, nodev, and sometimes nfsver=3)
   * tcp, intr, noauto, timeout, rsize, wsize, auto
   * nfsvers=3,rsize=1024,wsize=1024,cto

9) Any explanations:
   * We have not yet made the change to nfsv4, we use nolock due to various
     application "issues", we do not hard set rsize/wsize as they have been
     negotiating better values for a number of years on their own under v3,
     and the timeout/retrans are a bit of a legacy set of values from working on
     this issue of server overload. Hard was a choice on our end to pick that
     having things hang definitely seemed better then having things fail and go
     stale. We still agree with the choice of hard. Intr just helps to
     "interupt" stuck things when needed.
   * We like to be able to ctrl-C hung processes. For some systems we use larger
     rsize/wsize if the vendor supports it.
   * works for me without tewaks
   * We didn't use tcp until the last couple of years.
   * Probably needs a revisit- block size was set up for 2.x series kernels
   * default of centos 7
   * nfsv4 was not stable enough last time out, don't fix rsize/wsize as
     client/server usually negotiate to 1M anyway
   * We have frequent power outage (5+ times a year) and noauto helps our not to
     hang on mounting nfs shares. Drawback is you have to manually mount. Time
     out helps with this issue as well.
   * These are adjusted if necessary for particular workloads

10) what parts of the file system do you use NFS for (free form):
   * /home
   * /home
   * /home
   * /home
   * /home
   * /home
   * /home
   * /home and /apps
   * We use NFS for the OS (NFSRoot), App tree, $HOME, Group dedicated space, as
     well as some of our scratch spaces. All of these come from different NFS
     servers.
   * /home, /apps
   * /home /opt /etc /usr /boot
   * /home,/apps,
   * /home, /apps, /scratch - all of 'em
   * /home, long term project storage, shared software
   * /cluster/home,/cluster/local,/cluster/scratch,/cluster/data
   * home, apps, shared data
   * /usr/local, /home
   * /home , /apps
   * various
   * /home, /group, /usr/local
   * /home, parts of /opt, some specific top level auto-mountable dirs
   * What above is called /apps and /home for a few medium sized systems
   * /home, /local, /opt, /diskless
   * /home, /opt, diskless node images

11) How many nodes can mount a single NFS server at once:
    24% >= 512 nodes
    20% 65-128 nodes
    16% 1-16 nodes
    12% 17-32 nodes
    12% 257-512 nodes
    12% 129-256 nodes
     4% 33-64 nodes

12) How many NFSd daemons do you run per NFS server
     45.0% 1-16
     13.6% 129-256
     13.6% 65-128
      9.1% 33-64
      4.5% 17-32
      4.5% 256-512
      4.5% 512-1024
      4.5% 2048-4096

13) Do you use NFSd or user space
     81.0% Kernel NFSd
     14.3% User space
      4.8% Both

14) What interconnect do you use with NFS?
     38.5% 10G
     26.9% GigE
     23.1% IB
     11.5% Other

15) If IB what transport (10 responses)
     100% IPoIB
        0% Other

16) If IB, do you use connected mode (8 responses)
     65.5% Connected mode
     37.5% Don't use connected mode

17) Do you use UDP or TCP (25 responses)
     84% TCP
     12% UDP
      4% Other

18) Which other network file systems do you use? (24 responses)
     0% PNFS
     58.3% Lustre
     16.7% Ceph
     12.5% BeeGFS
     12.5% GlusterFS
      8.3% None (Panansas, GPFS, HSM/SAM/QFS, or more than one of the above)

19) Are the other network file systems more or less reliable than NFS?
     58.3% Similar
     16.7% I use only NFS
     12.5% Much more reliable
      4.2% Much less reliable
      4.2% Somewhat less reliable
      4.2% Somewhat more reliable

20) Do you support MPI-IO (not just MPI)
     70.8% no
     20.8% yes
      8.3% (yes, but nobody uses it)

21) Any tips for making NFS perform better or more reliably?
   * We start with the underlying block (raid/disks) setup that you are going to 
serve data out and plumb up from there. The key things here is choosing your 
raid stride/chunk sizes and insuring your file system is as aware of the raid 
layout for good alignment as you can. We do follow the esnet host tuning found 
at: http://fasterdata.es.net/host-tuning/linux/ on both client and server 
systems. We also bump up the rpc.mountd count to help insure successful mounts 
as we use autofs to mount a number of the nfs spaces. When a larger HPC job 
starts up on many nodes we did have a time where not all would be able to mount 
successfully if the server was under load. Increasing the rpc.mountd count 
helped. We also set async and wdelay on our exports on the servers.
   * Kernel settings
   * I've heard that configuring IB in RDMA boosts NFS performance
   * We don't use NFS for high performance cluster data. That's Lustre's world. 
Where NFS is used for scientific data, it's in places where there are modest 
numbers of concurrent clients.
   * more disks
   * RPCMOUNTDOPTS="--num-threads=64"
   * Try to optimize /etc/sysconfig/nfs as much as possible.

22) Any tips for making NFS clients perform better or more reliably?
   * Following the above mentioned esnet info at: 
http://fasterdata.es.net/host-tuning/linux/. I should note that for both client 
and server that are using IPoIB we use connected mode and set the MTU to 64k.
   * Reducing the size of the kernel dirty buffer on the clients makes 
performance much more consistent.
   * user reliable interconnect hw
   * We've tried scripting NFS mounts w/o much success.
   * Educate users on using the right filesystem for the right task

23) Anything you would like to add:

    * We have also seen input from others that they see gains with the client 
option of 'nocto'. The man pages would suggest this has some risks so while we 
have tested and can see that certain loads see a gain from this we have not yet 
moved forward to deploy this option on our general setup. We are in process of 
testing our apps to insure we do not create other issues for apps if we do use 
this flag.  Another things we have been looking at is cachefilesd and seeing how 
well that helps for data that can easily be cached. For things like our 
application trees, the OS (we are NFSRoot booted), and even some user reference 
data sets this looks quite promising but we have not gone live with this yet either.
   * We're always looking to improve our environment as well. We don't always 
have TIME to do so, of course.
   * Horses for courses. NFS is great for shared software and home directories. 
It's pretty useless for high performance access from hundreds of compute nodes.
   * Every storage system / file system I've ever seen or used has had its 
problems. There is no silver bullet (afaik). Use that which you have the 
competence to handle.
   * We are currently struggling with NFS mounts. We use them extensively 
throughout our department. Problems are they hang constantly and when one person 
is using the share heavily it slows down other computers. We've done lots of 
research into optimizing NFS but always come back to the same issues (hanging 
mounts that don't recover w/o admin interaction). We would love to know what 
other people are doing. We are experimenting with ceph at the moment for future 
large storage needs.