memory leak

Robert G. Brown rgb at phy.duke.edu
Thu Dec 19 07:11:58 PST 2002


On Wed, 18 Dec 2002, Brian LaMere wrote:

> The NFS server is running the proprietary OS from EMC named "dart" they use
> on their Celeras (and possibly other things).  It had a firmware-ish update
> during November to the NAS code for fix a user mapping bug, but that's about
> it.  The Celera is a cabinet that does nothing other than nfs and cifs.
> While I didn't cripple the whole cabinet, I did cripple a datamover inside
> it (the primary datamover for the filesystems I was accessing).
> 
> I just checked, and there have been no configuration changes on there in the
> last couple months

Just as a matter of extreme humorosity, you might try picking a box with
enough disk to hold the file(s) you are serving, copy them over, export
them, and redirect all your nodes to mount from it instead (which
probably wouldn't take as long as it sounds -- it is pretty trivial to
set up an NFS server and push a hacked /etc/fstab to the nodes).

Run it that way for a day or eight and see if it matters (problem
resurfaces).

BTW, you've just revealed (if I understand you correctly) that you're
using a proprietary OS, closed source, black box NFS server.
Ordinarily, I'd say that this is a bad idea, for precisely the reasons
you are now in trouble:  it may be IMPOSSIBLE for you to positively
determine where your problem lies, unless you are fortunate enough to
find a trivial problem and fix it.  If it is a very slow memory leak or
other "deep bug", or even a disagreement/incompatibility between your
server and the mounting clients, how could >>you<< ever tell?  How could
you fix it?  How can you even convince the vendor/mfr that the bug
exists and is their fault so THEY can fix it?  The answer is pretty
universally not without a lot of work and finger pointing on everybody's
part.

One thing I would NO LONGER suggest that you do is take the problem to
the kernel list.  They tend to be a tiny bit intolerant of bug reports
involving proprietary interfaces (hardware, software, peripheral)
because of the obvious difficulties in determining who owns the problem
and where it has to be fixed.  Sometimes they'll listen, but sometimes
they just don't want to waste their time.

Sigh.  It is going to be very difficult to debug this if it isn't (your)
hardware.  With a black box, it will be very difficult to debug if it IS
hardware -- inside the black box.  With a black box, you'll never debug
it if it is a bug in the black box software -- at best you'll be able to
convince yourself that it isn't a problem with your nodes per se and
find a workaround (e.g. build your own NFS server, which is cheap'n'easy
enough, and give your BB server to the poor) or MAYBE convince the
company that they own the problem and stimulate a fix.

Open source vs closed source, hmmm....;-)

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu






More information about the Beowulf mailing list