[Beowulf] NFS lock recovery and diskless nodes

Ansgar Esztermann-Kirchner aeszter at mpibpc.mpg.de
Fri Jan 17 06:44:34 PST 2020


Hello List,

recently, I've looked into some dangling lock problems we've had after
partial power loss.
Here's my analysis of what happens:
-A user application on a compute node requests a lock for a file on a
 NFS-mounted file system;
-the NFS server grants the lock;
-a partial power loss (just one phase affected for a few ms) causes
 the compute node to reboot, whereas the server runs on;
-if the compute node is stateful, it will look through the entries in
 /var/lib/nfs/sm (the "monitor list") to discover from which server(s)
 it had mounted NFS shares, and send each of them an NSM notify message;
-notified servers drop locks from the affected compute node.

However, this does not work for diskless compute nodes since upon
reboot, their monitor list will be empty, leaving dangling locks
around. 

One could work around the problem by triggering a round of notify
messages from the server, causing all nodes that didn't reboot to
re-request any pertinent locks and dropping all others.
However, a more automatic solution would be nice, especially when more
than one or two NFS servers are involved.

How do you deal with this?

Thanks,

A.
-- 
Ansgar Esztermann
Sysadmin Dep. Theoretical and Computational Biophysics
http://www.mpibpc.mpg.de/grubmueller/esztermann
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3638 bytes
Desc: not available
URL: <http://beowulf.org/pipermail/beowulf/attachments/20200117/ca0d6480/attachment.bin>


More information about the Beowulf mailing list