memory leak

Wed Dec 18 12:37:52 PST 2002

for testing purposes, I can put most of the datafiles on the nodes, yes.
However, these files are smaller than they were before (I ran into a sorta
similar problem in the past when the files became simply too large to be
cached).  However, they get updated nightly, and for some of the datafiles,
they get updated as soon as the workers are finished running the first phase
of a job, and they then need access to the same updated datafile to run the
second phase.  And unfortuantely, there is no real way to pause between very
well.  Over the course of a day, 60K-120K jobs will be run, using a total of
generally 3-5 large (500-900M) datafiles (though there are more that are
potential).

Latency doesn't /seem/ to be an issue, since the queue has worked near
flawlessly in its current setup (with only major change being scyld upgrade
from previous version).  Also, it runs perfectly for a few days after a
reboot.  Its not until there is a buildup - somewhere - that performance
starts taking a hit.  Performance also doesn't take a graduate hit,
generally.  Its normally a fairly sharp change in the performance curve
(though it can change due to the varying size of the files being requested).

I'm going to be walking around and inspecting cables today, and seeing about
testing the routers.  <shrug>

-----Original Message-----
From:	Robert G. Brown [mailto:rgb at phy.duke.edu]
Sent:	Wed 12/18/2002 9:39 AM
To:	Brian LaMere
Cc:	beowulf at beowulf.org
Subject:	RE: memory leak
On Wed, 18 Dec 2002, Brian LaMere wrote:

> When memory is not listed as full, the nodes slam the NFS server for a
> moment, just long enough to grab whatever flatfile database the current
tool
> is running against.  Then there is almost no network traffic at all for
> hours, esp to the NFS server.  This behavior changes after about a week.
> This behavior never changed before.
> 
> I'm repeating myself simply because I obviously wasn't clear before.  Yes,
I
> know buffers are held until something else is needed to be buffered, based
> on a retention policy.  The only insult to my intelligence is in my lack
of
> clarity in the description of what is going on.

Can you rearrange your program(s) to not use NFS (which sucks in oh, so
many ways)?  E.g. rsync the DB to the nodes?  Have you tried tuning the
nfsd, e.g. increasing the number of threads?  Have you tried tuning the
NFS mounts themselves (rsize, wsize)?  Have you considered that the
problem could be with file locking -- if you have lots of nodes trying
to open and read, but especially to write, to the same file, there could
be a all sorts of queues and problems being created with file locking
(rpc.lockd).  Have you tried to resolve this by (perhaps) maintaining
several copies of the files in contention and spreading the open/close
load around?  

Have you considered the problems associated with plain old latency --
e.g., suppose that application a on node A opens file X on the server,
reads a bunch of stuff from it, and then writes a bit onto the end, and
closes it.  In the meantime, application b on node B is trying to open
it.  I >>think<< that NFS is required to flush the modified image
through to disk before it can reissue the image to another request (part
of its being a "reliable" protocol, so that application b doesn't see
the "wrong image" of the file).  This can take anything from hundredths
of seconds to seconds, depending on file size and server load, so you
might not see any problem at all as long as demand is lower than some
threshhold and then "suddenly" start seeing it as you start to encounter
"collision storms".  This used to happen a lot on shared 10 Mbps
ethernet, especially thinwire when the lengths were borderline too long
and to heavily laden with hosts (so the probability of collisions was
relatively high) -- an entire network could be nonlinearly brought to
its knees by a single host inching the total network traffic up over a
critical level, causing error recoveries and retransmissions to start to
pile up with positive feedback (re:  "packet storm").

Of course nobody can tell you which of these problems is the critical
one in your particular situation, but maybe the list of the above will
help you debug it.  The key thing to do is to try to learn about the
particular subsystem(s) associated with the delays.  Sure, maybe it's
just "a kernel bug" (and the kernel list may be the right place to seek
help:-).  OTOH, it could very easily be something that is your "fault"
in that you have pushed your network out of the regime where stable
operation can ever be realistically expected for your particular
task architecture.  In that case, you'll both have to debug it yourself
(figure out what is failing) and figure out how to re-architect it so
that it no longer is a problem.  

Not easy, actually -- takes a lot of trial and effort and can even end
up being something REALLY trivial like a bad network cable or bad switch
port so that errors you thought were "broken kernel" or even "broken
software" were really "bad hardware" and impossible to EVER fix without
replacing the bad hardware.  nfsstat, vmstat, cat /proc/stat, plain old
stat, netstat, and perhaps tools like wulfstat/xmlsysd (available at
www.phy.duke.edu/brahma/xmlsysd.html) are your friends.  Try clever
experiments.  Try to isolate the proximate cause of the problem or the
precise conditions where it occurs.

HTH,

   rgb

> 
> 
> -----Original Message-----
> From: John Hearns [mailto:John.Hearns at cern.ch] 
> Sent: Wednesday, December 18, 2002 2:45 AM
> To: Brian LaMere
> Cc: beowulf at beowulf.org
> Subject: Re: memory leak
> 
> Brian, please forgive me if I am insulting your intelligence.
> 
> But are you sure that you are not just noticing the disk
> buffering behaviour of Linux?
> The Linux kernel will use up spare memory as disk buffers -
> leading an (apparently) lack of free memory.
> This is not really the case - as the memory will be released
> again when needed.
> (Ahem. Was caught out by this too the first time I saw it...)
> 
> 
> Anyway, if this isn't the problem, maybe you could send
> us some of the stats from your system?
> Maybe use nfsstat?
> 
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu