bpsh and memory leak - wien

Florent Calvayrac Florent.Calvayrac at univ-lemans.fr
Wed Oct 2 08:24:07 PDT 2002


Donald Becker wrote:
> 
> On Tue, 1 Oct 2002, Florent Calvayrac wrote:
> 
> > We try to use WIEN97 on our Scyld beowulf cluster, and
> > the following happens : the program lapw1 (more or less
> > pure fortran 77), run interactively on the front node,
> > happily grows to, say, 30Mo and then runs until completion.
> > When run with bpsh on a remote node, the available memory
> > just shrinks down until the system swaps to stall.
> 
> You can use 'top' or 'ps' on the master to monitor memory usage of the
> process

thanks a lot 8-) !


> What is using the memory?
> 

I do not know !

To summarize : 

on the front node, we type 

/home/wien/lapw1 lapw1.def 

hit enter and it just runs fine

bpsh 0 /home/wien/lapw1 lapw1.def

and a "bpsh 0 free -t" shows that available memory runs down
to 0, then the red light of the hard disk starts, 
(and "bpsh 0 cat /proc/meminfo" confirms that the system starts
to swap). I do not understand a single thing on the output of slabinfo.


"ps -efl" gives the same result on the front node 
than on the remote nodes (with "bpsh 0 ps -efl" (in our case, "186888", amounting 
to something like 30 Mbytes, knowing that we have 512 MBytes on the 
front and remote nodes). The process takes about 2 minutes
however to transform itself from "init" to the actual "lapw1" we run,
and sometimes fails with a "BProc move failed", maybe because
NFS is hit hard in the process.

We gave a look to the open files with "lsof", and they are the same 
on the front node and the remote nodes. There does not seem to
be any error files opened on the remote node ;  however I wrote
a small C program to fill the ramdisk...and when the ramdisk was full,
nothing peculiar happened, the node did not start to swap !

By the way, "top" does not run properly, indicating a size of "0"
for remote processes when ran on the front node, and is not willing to run on remote nodes
("bpsh 0 top" fails because of an unknown TERM error - I
guess termcap is not installed properly on the 
remote nores, I even tried with a TERM=vt100) 


Thanks for the help anyway

If anyone has any ideas...

-- 
Florent Calvayrac                          | 
Laboratoire de Physique de l'Etat Condense | 
UMR-CNRS 6087         | http://www.univ-lemans.fr/~fcalvay 
Universite du Maine-Faculte des Sciences   |
72085 Le Mans Cedex 9



More information about the Beowulf mailing list