[Beowulf] mpirun issue
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Reuti reuti at staff.uni-marburg.deWed Oct 22 11:17:47 PDT 2008
- Previous message: [Beowulf] deskside Cray prices
- Next message: [Beowulf] Re: "hobbyists"es
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Am 21.10.2008 um 22:21 schrieb Luis Alejandro Del Castillo Riley: > And with the ps -e f shows that is running fine until they crash > with the error broken pipe and killing signal From this I would assume that one processes crashed and you are facing only the follow-up error. Maybe because it ran out of memory or disk space. It might depend on the application, how it will distribute the data and maybe with ten nodes some array or so was getting too big over the runtime of the job. When you can spot the node which crashes, maybe you can find something in /var/log/messages of the node. -- Reuti > On Tue, Oct 21, 2008 at 2:50 PM, Luis Alejandro Del Castillo Riley > <acastillo22 at gmail.com> wrote: > hi > yes i have 10 nodes each ones with intel xeon quad core so basicaly > are 4 processors per each node > > > > On Tue, Oct 21, 2008 at 7:53 AM, Reuti <reuti at staff.uni-marburg.de> > wrote: > Hi, > > Am 21.10.2008 um 01:18 schrieb Luis Alejandro Del Castillo Riley: > > > hi fellows i have a cluster with 1 master 10 nodes with intel Xeon > Quad core. > Fedora core 6 > PGI 7.0-7 > mpich 1.2.5.2 > > the last version of MPICH from 2005 is 1.2.7p1. For newer > installations I would suggest to look into Open MPI. > > > machines.x86_64 with a 10 node names > > Means only the 10 nodes? > > > when i try to run: > mpirun -v -arch x86_64 -keep_pg -nolocal -np 9 mm5.mpp > > i had no error but when a run with > mpirun -v -arch x86_64 -keep_pg -nolocal -np 10 mm5.mpp > > they take around 40 min to send me and error : > bm_list_4667: (1526.781250) wakeup_slave: unable to interrupt slave > 0 pid 4666 > > With so many time, I would suggest to login to all nodes and check > with: > > $ ps -e f > > (f w/o -) the ditribution and startup of the porcesses. Is it doing > nothing for 40 minutes or running fine until it crashes? > > -- Reuti > >
- Previous message: [Beowulf] deskside Cray prices
- Next message: [Beowulf] Re: "hobbyists"es
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
