[Beowulf] PVM log limitations
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduFri Nov 26 09:44:35 PST 2004
- Previous message: [Beowulf] PVM log limitations
- Next message: [Beowulf] 5 DAYS left until the deadline: WSEAS World Congress on FLUID MECHANICS and HEAT and MASS TRANSFER. Send us your Abstract now. Special Issue of IASME / WSEAS Transactions (participates in Science Citation Indexes)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 25 Nov 2004, Patrick Begou wrote: > We are working on a ten nodes beowulf cluster and the application is > running with PVM. Time to time one of the slave crashes (not pvmd). > On the slave nodes, the /tmp/pvml.xxx file do not contains any > information, just saying that the process seems to be died. > > On the master node, the /tmp/pvml.xxx file ends by "login truncated" and > it is only loging the first hours of calculation. The crash occur after > several days of calculation. > > I was unable to find any information about how to force login not to be > interrupted on the master node, nor in the PVM doc, nor in the numerous > Web documents. > > Does somebody knows how to force pvm not stopping login the standart > output in the master's /tmp/pvml.xxx file ? I don't know that offhand, but there are multiple more direct alternatives open to you for debugging the slave crashes, if the slaves have local disk resources (e.g. a local /tmp). One is to open a file (on each slave) and redirect stdout and stderr from the slave process into the file, possibly wrapping the actual slave binary in a wrapper shell script for that purpose. Another is to instrument your slave to open a log file and direct debugging output into that file. A third is to debug the slave application core dumps, if any are generated, to see what they were doing when they died. A fourth thing that MAY help is to keep an eye on e.g. slave memory consumption while the process is running -- a crash after several days of otherwise ordinary computation sounds like it could be caused by a memory leak, and if you are allocating an array (for example) in one of your core loops and then not freeing it before reallocating it, it would gradually eat all available memory until the application and/or the node itself crashed. I've done a bit of all of these to debug parallel applications, which is generally NOT a terribly easy thing to do because the fault could be in the slave (these suggestions assume that it is) or it could be in the master -- there are plenty of opportunities to misuse buffers and so forth to cause a slave to crash due to something the master is passing it. Hope this helps some, although it doesn't actually answer your question. rgb > > Thanks for your help > > Patrick > -- > =============================================================== > | Equipe M.O.S.T. | http://most.hmg.inpg.fr | > | Patrick BEGOU | ------------ | > | LEGI | mailto:Patrick.Begou at hmg.inpg.fr | > | BP 53 X | Tel 04 76 82 51 35 | > | 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 | > =============================================================== > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] PVM log limitations
- Next message: [Beowulf] 5 DAYS left until the deadline: WSEAS World Congress on FLUID MECHANICS and HEAT and MASS TRANSFER. Send us your Abstract now. Special Issue of IASME / WSEAS Transactions (participates in Science Citation Indexes)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
