memory leak

Wed Dec 18 10:51:20 PST 2002

On Wed, Dec 18, 2002 at 07:18:21AM -0800, Brian LaMere wrote:
> Yes, I am aware.
> 
> It was not until a month ago that performance started becoming an issue,
> however.  And it was not until yesterday that the cluster almost crippled
> the NFS server.
> 
> The file in particular they were hitting when this occurred has been the
> same since Sep24.
> 
> I am fully aware that the memory is still available.  The problem is that
> the buffers are not - and as such, it grabs the file *each and every time*,
> as I said.  If I reboot them, they do not grab the file each and every time.
> 
> I would love it if the buffers would get released, but they're not.  I
> thought I said this before, however?  Jobs get completed about 2 a minute
> when the memory is listed as "full."  They get completed about 200 a minute
> when the memory isn't.  I'm really not sure how to paint a clearer picture
> than that.  What /should/ occur in theory is not, in fact, occurring.  A
> node can sit there unused for as much as 24 hours, and still exhibit the
> problem.
> 
> When memory is not listed as full, the nodes slam the NFS server for a
> moment, just long enough to grab whatever flatfile database the current tool
> is running against.  Then there is almost no network traffic at all for
> hours, esp to the NFS server.  This behavior changes after about a week.
> This behavior never changed before.
> 
> I'm repeating myself simply because I obviously wasn't clear before.  Yes, I
> know buffers are held until something else is needed to be buffered, based
> on a retention policy.  The only insult to my intelligence is in my lack of
> clarity in the description of what is going on.

We were experiencing a similar symptom (Linux clients crippling a NFS server)
with our Netapp filers. In that case the cause was a bug Linux's NFS 
implementation that leads to a flooding the reassembly queue of the NFS
server with the same effect that you were describing: basically no traffic
to the NFS server, but the server 100% busy. 

I better description of that bug can be found on the Linux NFS mailing
list:

http://marc.theaimsgroup.com/?l=linux-nfs&m=102515480929805&w=2

To my knowledge the bug is fixed in 2.4.20 but present in all previous
2.4.x kernels.

For kernels 2.4.x, x < 20, the solution is to switch to NFS over tcp
and/or limiting rsize, wsize to 8k (note the RedHat by default uses
NFS over udp with 32k rsize, wsize therefore triggering the bug by
default). We switched to tcp and the problem disappeared.

I do not know whether this has anything to do with your problem, but
switching NFS to tcp may be worth a try.

Martin

========================================================================
Martin Siegert
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
========================================================================