[Beowulf] Troubleshooting NFS stale file handles
pbisbal at pppl.gov
Thu Apr 20 14:14:21 PDT 2017
On 04/19/2017 05:52 PM, Bernd Schubert wrote:
> On 04/19/2017 07:58 PM, Prentice Bisbal wrote:
>> Here's the sequence of events:
>> 1. First job(s) run fine on the node and complete without error.
>> 2. Eventually a job fails with a 'permission denied' error when it tries
>> to access /l/hostname.
> So you don't get ESTALE, but you get EACCESS? You *might* be able to fix
> this by setting the 'no_subtree_check' in your /etc/exports. I don't
> remember the details exactly anymore, but nfsd/exportfs check more
> intensively if a dentry is valid if this option is not given.
I don't remember seeing either ESTALE or EACCESS, just that there was a
message about stale file handles. I didn't save the messages I with
tcpdump, and I had to delete my /var/log/message files because when
turned all the logging I could with rpcdebug, it filled up /var in less
than a day, and I needed to free up space in /var. I should have copied
them somewhere else instead of just deleting them, in hindsight.
I rebooted the systems yesterday, and the problem has gone away since
the reboot, so I can't reproduce the problem and send you the relevant
messages. I"m not a smart man.
> I don't think that networking can be a cause for this, but if a
> dentry/inode is evicted from the server side cache, the NFS file handle
> has to be used to create inode and dentry on the server side on the
> underlying file system. I think EACCESS is then used if something goes
> wrong connecting the dentry to the parent-dentry (I need to look up the
> exact details again, it's been while I had to deal with this).
Are these meanings of EACESS and ESTALE defined in the NFS RFCs? If so,
may need to read that.
> You could try to set /proc/sys/vm/vfs_cache_pressure to a very low value
> (don't set it to 0, though). Depending on your file system and kernel
> version this might help to keep dentries/inode in the cache and to avoid
> running into this (there was bug until 3.10, which prevented that this
> worked properly, I'm not sure if the related patch series has been
> backported into vendor kernels).
Thanks for the tip. I'll keep it in mind.
> Btw, which kernel version and file system is your nfs server running on?
Both servers and clients are running the same exact version of
everything, since they are using the same NFS root filesystem:
$ cat /etc/redhat-release
CentOS release 6.8 (Final)
$ cat /proc/version
Linux version 2.6.32-642.11.1.el6.x86_64
(mockbuild at c1bm.rdu2.centos.org) (gcc version 4.4.7 20120313 (Red Hat
4.4.7-17) (GCC) ) #1 SMP Fri Nov 18 19:25:05 UTC 2016
$ rpm -qa | grep -i nfs
More information about the Beowulf