memory leak
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Brian LaMere blamere at diversa.comWed Dec 18 12:37:52 PST 2002
- Previous message: memory leak
- Next message: memory leak
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
for testing purposes, I can put most of the datafiles on the nodes, yes. However, these files are smaller than they were before (I ran into a sorta similar problem in the past when the files became simply too large to be cached). However, they get updated nightly, and for some of the datafiles, they get updated as soon as the workers are finished running the first phase of a job, and they then need access to the same updated datafile to run the second phase. And unfortuantely, there is no real way to pause between very well. Over the course of a day, 60K-120K jobs will be run, using a total of generally 3-5 large (500-900M) datafiles (though there are more that are potential). Latency doesn't /seem/ to be an issue, since the queue has worked near flawlessly in its current setup (with only major change being scyld upgrade from previous version). Also, it runs perfectly for a few days after a reboot. Its not until there is a buildup - somewhere - that performance starts taking a hit. Performance also doesn't take a graduate hit, generally. Its normally a fairly sharp change in the performance curve (though it can change due to the varying size of the files being requested). I'm going to be walking around and inspecting cables today, and seeing about testing the routers. <shrug> -----Original Message----- From: Robert G. Brown [mailto:rgb at phy.duke.edu] Sent: Wed 12/18/2002 9:39 AM To: Brian LaMere Cc: beowulf at beowulf.org Subject: RE: memory leak On Wed, 18 Dec 2002, Brian LaMere wrote: > When memory is not listed as full, the nodes slam the NFS server for a > moment, just long enough to grab whatever flatfile database the current tool > is running against. Then there is almost no network traffic at all for > hours, esp to the NFS server. This behavior changes after about a week. > This behavior never changed before. > > I'm repeating myself simply because I obviously wasn't clear before. Yes, I > know buffers are held until something else is needed to be buffered, based > on a retention policy. The only insult to my intelligence is in my lack of > clarity in the description of what is going on. Can you rearrange your program(s) to not use NFS (which sucks in oh, so many ways)? E.g. rsync the DB to the nodes? Have you tried tuning the nfsd, e.g. increasing the number of threads? Have you tried tuning the NFS mounts themselves (rsize, wsize)? Have you considered that the problem could be with file locking -- if you have lots of nodes trying to open and read, but especially to write, to the same file, there could be a all sorts of queues and problems being created with file locking (rpc.lockd). Have you tried to resolve this by (perhaps) maintaining several copies of the files in contention and spreading the open/close load around? Have you considered the problems associated with plain old latency -- e.g., suppose that application a on node A opens file X on the server, reads a bunch of stuff from it, and then writes a bit onto the end, and closes it. In the meantime, application b on node B is trying to open it. I >>think<< that NFS is required to flush the modified image through to disk before it can reissue the image to another request (part of its being a "reliable" protocol, so that application b doesn't see the "wrong image" of the file). This can take anything from hundredths of seconds to seconds, depending on file size and server load, so you might not see any problem at all as long as demand is lower than some threshhold and then "suddenly" start seeing it as you start to encounter "collision storms". This used to happen a lot on shared 10 Mbps ethernet, especially thinwire when the lengths were borderline too long and to heavily laden with hosts (so the probability of collisions was relatively high) -- an entire network could be nonlinearly brought to its knees by a single host inching the total network traffic up over a critical level, causing error recoveries and retransmissions to start to pile up with positive feedback (re: "packet storm"). Of course nobody can tell you which of these problems is the critical one in your particular situation, but maybe the list of the above will help you debug it. The key thing to do is to try to learn about the particular subsystem(s) associated with the delays. Sure, maybe it's just "a kernel bug" (and the kernel list may be the right place to seek help:-). OTOH, it could very easily be something that is your "fault" in that you have pushed your network out of the regime where stable operation can ever be realistically expected for your particular task architecture. In that case, you'll both have to debug it yourself (figure out what is failing) and figure out how to re-architect it so that it no longer is a problem. Not easy, actually -- takes a lot of trial and effort and can even end up being something REALLY trivial like a bad network cable or bad switch port so that errors you thought were "broken kernel" or even "broken software" were really "bad hardware" and impossible to EVER fix without replacing the bad hardware. nfsstat, vmstat, cat /proc/stat, plain old stat, netstat, and perhaps tools like wulfstat/xmlsysd (available at www.phy.duke.edu/brahma/xmlsysd.html) are your friends. Try clever experiments. Try to isolate the proximate cause of the problem or the precise conditions where it occurs. HTH, rgb > > > -----Original Message----- > From: John Hearns [mailto:John.Hearns at cern.ch] > Sent: Wednesday, December 18, 2002 2:45 AM > To: Brian LaMere > Cc: beowulf at beowulf.org > Subject: Re: memory leak > > Brian, please forgive me if I am insulting your intelligence. > > But are you sure that you are not just noticing the disk > buffering behaviour of Linux? > The Linux kernel will use up spare memory as disk buffers - > leading an (apparently) lack of free memory. > This is not really the case - as the memory will be released > again when needed. > (Ahem. Was caught out by this too the first time I saw it...) > > > Anyway, if this isn't the problem, maybe you could send > us some of the stats from your system? > Maybe use nfsstat? > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: memory leak
- Next message: memory leak
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
