[Beowulf] PVM log limitations

Fri Nov 26 09:44:35 PST 2004

On Thu, 25 Nov 2004, Patrick Begou wrote:

> We are working on a ten nodes beowulf cluster and the application is 
> running with PVM. Time to time one of the slave crashes (not pvmd).
> On the slave nodes, the /tmp/pvml.xxx file do not contains any 
> information, just saying that the process seems to be died.
> 
> On the master node, the /tmp/pvml.xxx file ends by "login truncated" and 
> it is only loging the first hours of calculation. The crash occur after 
> several days of calculation.
> 
> I was unable to find any information about how to force login not to be 
> interrupted on the master node, nor in the PVM doc, nor in the numerous 
> Web documents.
> 
> Does somebody knows how to force pvm not stopping login the standart 
> output in the master's /tmp/pvml.xxx file ?

I don't know that offhand, but there are multiple more direct
alternatives open to you for debugging the slave crashes, if the slaves
have local disk resources (e.g. a local /tmp).

One is to open a file (on each slave) and redirect stdout and stderr
from the slave process into the file, possibly wrapping the actual slave
binary in a wrapper shell script for that purpose.  Another is to
instrument your slave to open a log file and direct debugging output
into that file.  A third is to debug the slave application core dumps,
if any are generated, to see what they were doing when they died.  A
fourth thing that MAY help is to keep an eye on e.g. slave memory
consumption while the process is running -- a crash after several days
of otherwise ordinary computation sounds like it could be caused by a
memory leak, and if you are allocating an array (for example) in one of
your core loops and then not freeing it before reallocating it, it would
gradually eat all available memory until the application and/or the node
itself crashed.

I've done a bit of all of these to debug parallel applications, which is
generally NOT a terribly easy thing to do because the fault could be in
the slave (these suggestions assume that it is) or it could be in the
master -- there are plenty of opportunities to misuse buffers and so
forth to cause a slave to crash due to something the master is passing
it.

Hope this helps some, although it doesn't actually answer your question.

   rgb

> 
> Thanks for your help
> 
> Patrick
> -- 
> ===============================================================
> |  Equipe M.O.S.T.         | http://most.hmg.inpg.fr          |
> |  Patrick BEGOU           |       ------------               |
> |  LEGI                    | mailto:Patrick.Begou at hmg.inpg.fr |
> |  BP 53 X                 | Tel 04 76 82 51 35               |
> |  38041 GRENOBLE CEDEX    | Fax 04 76 82 52 71               |
> ===============================================================
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu