Linux memory leak?

Thu Feb 28 13:54:24 PST 2002

As far as I can tell, on later kernels (2.4.10, 2.4.13, 2.4.17) this is
mostly due to aggressive caching. (?) You can get an idea for what's going
on by running a program to eat up lots of memory (e.g, malloc then write
over and over) and check how long it takes. You should notice that when
"free" reports lots of "used" memory but nothing is really running, the
program will nearly run as fast as after a fresh boot. The "used" pages are
easily given up (not swapped out). This also has the side-effect of making
"free" report a reasonable number.

However on older kernels this didn't happen; lots of swapping out activity
would ensue and the malloc-and-write program would really bog down.

Reid Huntsinger

Date: Thu, 28 Feb 2002 14:58:07 -0500
From: Josip Loncaric <josip at icase.edu>
Reply-To: josip at icase.edu
Organization: ICASE
To: Beowulf mailing list <beowulf at beowulf.org>
Subject: Linux memory leak?

On our heterogeneous cluster, we run Red Hat 7.2 updated to stock i686
Linux kernels 2.4.9-21 or 2.4.9-21smp.  Sometimes (e.g. after 14 days of
normal operation) our nodes report unusually high memory usage even
without any user processes active.  This can happen on both single CPU
and on dual CPU machines, and it used to happen with previous 2.4
kernels.  Here is an example:

# free
             total       used       free     shared    buffers    
cached
Mem:        512444     449196      63248          0      70164     
76332
-/+ buffers/cache:     302700     209744
Swap:      1060272     285492     774780

If I add up all RSS numbers reported by 'ps -e v' I get only about
20,500 KB, and yet this dual CPU system reports 302,700 KB RAM used
(without even counting buffers or cache).  Apparently, only 'reboot' can
recover the missing 282,200 KB.  Any ideas on tracking down where the
missing memory went?

Sincerely,
Josip

P.S. Here is more detail:

# cat /proc/meminfo
        total:    used:    free:  shared: buffers:  cached:
Mem:  524742656 460013568 64729088        0 71929856 369848320
Swap: 1085718528 292343808 793374720
MemTotal:       512444 kB
MemFree:         63212 kB
MemShared:           0 kB
Buffers:         70244 kB
Cached:          76332 kB
SwapCached:     284848 kB
Active:         242464 kB
Inact_dirty:    188960 kB
Inact_clean:         0 kB
Inact_target:   131068 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       512444 kB
LowFree:         63212 kB
SwapTotal:     1060272 kB
SwapFree:       774780 kB

# ps -e v 
  PID TTY      STAT   TIME  MAJFL   TRS   DRS  RSS %MEM COMMAND
    1 ?        S      0:05    139    23  1392  480  0.0 init
    2 ?        SW     0:00      0     0     0    0  0.0 [keventd]
    3 ?        SWN    0:01      0     0     0    0  0.0 [ksoftirqd_CPU0]
    4 ?        SWN    0:01      0     0     0    0  0.0 [ksoftirqd_CPU1]
    5 ?        SW     0:08      0     0     0    0  0.0 [kswapd]
    6 ?        SW     0:00      0     0     0    0  0.0 [kreclaimd]
    7 ?        SW     0:00      0     0     0    0  0.0 [bdflush]
    8 ?        SW     0:00      0     0     0    0  0.0 [kupdated]
    9 ?        SW<    0:00      0     0     0    0  0.0 [mdrecoveryd]
   13 ?        SW     0:13      0     0     0    0  0.0 [kjournald]
   88 ?        SW     0:00      0     0     0    0  0.0 [khubd]
  154 ?        SW     0:01      0     0     0    0  0.0 [kjournald]
  428 ?        S      0:00     41    46  1485  504  0.0 /sbin/pump -i et
  453 ?        S      0:00     79    23  1452  644  0.1 syslogd -m 0
  458 ?        S      0:00     46    18  2077  508  0.0 klogd -2
  478 ?        S      0:00     83    25  1538  604  0.1 portmap
  506 ?        S      0:00    110    21  1590  616  0.1 rpc.statd
  631 ?        SL     0:03     24   234  1705 1936  0.3 ntpd -U ntp
  685 ?        S      0:00     20    12  1439  508  0.0 /usr/sbin/atd
  703 ?        S      0:00     32   232  2451  656  0.1 /usr/sbin/sshd
  736 ?        S      0:00    143   133  2138  820  0.1 xinetd -stayaliv
  795 ?        S      0:00     75    18  1573  624  0.1 crond
  843 tty1     S      0:00    109     6  1381  368  0.0 /sbin/mingetty t
  844 tty2     S      0:00    109     6  1381  368  0.0 /sbin/mingetty t
  845 tty3     S      0:00    109     6  1381  368  0.0 /sbin/mingetty t
  846 tty4     S      0:00    109     6  1381  368  0.0 /sbin/mingetty t
  847 tty5     S      0:00    109     6  1381  368  0.0 /sbin/mingetty t
  848 tty6     S      0:00    109     6  1381  368  0.0 /sbin/mingetty t
  849 ?        S      2:25    162    10  1429  584  0.1 /opt/sbin/cnm -i 
  850 ?        S      1:39    243   484  1747  928  0.1 /bin/bash /opt/s
 1105 ?        SW     0:14      0     0     0    0  0.0 [rpciod]
 1106 ?        SW     0:00      0     0     0    0  0.0 [lockd]
11105 ?        S      0:51    125   149  1794 1072  0.2 /usr/PBS/sbin/pb
24146 ?        S      0:00      9   423  4804 1996  0.3 sendmail: accept
27052 ?        S      0:00      0   400    39  172  0.0 /sbin/dhcpcd -n 
27219 ?        S      0:00    289    12  2243 1064  0.2 in.rlogind
27220 pts/0    S      0:00    288    16  2339 1120  0.2 login --
root                             
27221 pts/0    S      0:00    288   484  2047 1360  0.2 -bash
27314 ?        S      0:00    168     9  1934  680  0.1 sleep 60
27315 pts/0    R      0:00    175    59  2588  716  0.1 ps -e v

# uptime
  2:53pm  up 14 days, 17:03,  1 user,  load average: 0.00, 0.00, 0.00

-- 
Dr. Josip Loncaric, Research Fellow               mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134

--__--__--

Message: 10
Date: Thu, 28 Feb 2002 15:42:09 -0500 (EST)
From: Joshua Baker-LePain <jlb17 at duke.edu>
To: Josip Loncaric <josip at icase.edu>
cc: Beowulf mailing list <beowulf at beowulf.org>
Subject: Re: Linux memory leak?

On Thu, 28 Feb 2002 at 2:58pm, Josip Loncaric wrote

> # free
>              total       used       free     shared    buffers    
> cached
> Mem:        512444     449196      63248          0      70164     
> 76332
> -/+ buffers/cache:     302700     209744
> Swap:      1060272     285492     774780
> 
> If I add up all RSS numbers reported by 'ps -e v' I get only about
> 20,500 KB, and yet this dual CPU system reports 302,700 KB RAM used
> (without even counting buffers or cache).  Apparently, only 'reboot' can
> recover the missing 282,200 KB.  Any ideas on tracking down where the
> missing memory went?

I've seen this behavior even after very little uptime.  All you have to do 
is have a process swap heavily.  When that process goes away, it seems as 
if what's left in swap also stays in memory.  Further memory pressure 
makes stuff then get paged *out* of swap.

I tracked it down to an existing bugzilla report:

http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=59002

There doesn't seem to be an official resolution from RedHat yet.  But a 
custom compiled 2.4.17 didn't show this behavior.

-- 
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University

--__--__--

Message: 11
Date: Thu, 28 Feb 2002 16:07:25 -0500 (EST)
From: "Robert G. Brown" <rgb at phy.duke.edu>
To: Beowulf Mailing List <beowulf at beowulf.org>
Subject: Motherboard query...

Dear Liststers,

I'd like to request comments on a couple of dual Athlon motherboards.
We are considering both the Tyan Tiger 2466N (760 MPX) and the MSI K7D
Master (MS-6501) (also 760 MPX).  Our local vendor "supports" MSI
motherboards (which just means that we deal with them rather than Tyan
in the event of a return, but which makes it reasonable to use the MSI
all things being equal).  We are going with 760 MPX to get the 64/66 PCI
slots, of course -- we actually have a small stack of 2460 Tigers which
are not totally painless but which we've more or less tamed.

Any experiences yet, good or bad, with either motherboard?  The vendor
is probably going to loan us an MSI-based dual to test, but there's
nothing like the experience of somebody actually running a cluster if
there is anybody out there already doing so.

I'd also like comments on RAID alternatives.  We have a group who needs
about 500 GB of RAID.  We just got a Promise UltraTrak100 TX8 (IDE-SCSI)
RAID chassis that advertised decent itself as OS-independent plug and
play -- attach to SCSI bus and go.  The first unit we were shipped
didn't work under any OS.  The second we were shipped we got the vendor
(Megahaus) to verify function before shipping and it does "work", but it
returns unbelieveably poor performance at RAID 5 -- a (very) few MB/sec
-- under bonnie.  From this we learned (among many things:-) that
vendors often quote performance numbers on a RAID from its RAID 0
configuration, which would kind of funny if it weren't for the murderous
impulses it creates when you learn that their numbers are some sort of
cruel joke under RAID 5.

We are twisting Megahaus's arm to take it back and give us our money
back (they are complaining that it is more than thirty days since they
delivered the FIRST unit, but we've only had a working unit for about
two weeks and do not want it if its SCSI performance is that abysmal).
We are then stuck looking for an alternative at roughly the same cost.

Our alternatives seem to be:

   a) Another IDE-RAID enclosure, perhaps from a better manufacturer.
However, at this point we're more than a bit concerned about the gap
between vendor performance claims and reality.  There are vendors that
assert 100 MB/sec read times, but we are concerned that they mean "at
RAID 0" which is useless to us.  We need real-world loaded numbers at
RAID 5 (e.g. multiple instances of bonnie).  Folks we know locally who
have e.g.  zero-d chassis report real world throughput more like 20
MB/sec RW, but their boxes are a year or two old and may not reflect
current rates.  20 MB/sec is pretty much the LOWEST rate we could
tolerate in this application under multithreaded load, and we'd like
something better.  Any enclosure/controllers out there that give good-to
excellent performance that you'd care to recommend?

  b) md-raid, either ide or scsi, on a straight linux server.  We know
that this works remarkably well.  We run md raid in the departmental
server (scsi, with a stack of 36 GB disks in RAID 5) and get excellent
performance -- ~40 MB/sec write throughput and even better for read.
Unfortunately large SCSI disks are still excessively expensive and we
don't have the budget to reach 500 GB with SCSI disks for this cluster.
IDE is cheap and easy, but we would like a bit of assurance that linux
won't have (e.g.  DMA) problems when dealing with 6-8 ide controllers on
one bus.  Is anyone doing this?  Good, bad experiences, hardware
recommendations or gotchas all welcome.

  c) SCSI RAID.  Definitely works, definitely high performance, but also
the most expensive and again, we won't be able to afford to reach our
design spec with the money allocated to this ($5-6K total).

If we have to fall back to SCSI we will and will live with a smaller
RAID than we had hoped, but we'd very much like to first find out if
IDE-based RAID solutions (RAID 5 on ~500GB total disk) with >20 MB sec
worst case write rates under heavy load exist.

TIA,

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

--__--__--

_______________________________________________
Beowulf mailing list
Beowulf at beowulf.org
http://www.beowulf.org/mailman/listinfo/beowulf

End of Beowulf Digest