[Beowulf] Weird problem with mpp-dyna
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Joe Landman landman at scalableinformatics.comWed Mar 14 11:51:29 PDT 2007
- Previous message: [Beowulf] Weird problem with mpp-dyna
- Next message: [Beowulf] What is a "proper" machine count for a cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Joshua Baker-LePain wrote: > > Running a simulation via 'mpirun -np 12' works just fine. Running the > same sim (on the same virtual machine, even, i.e. in the same 'lamboot' > session) with -np > 12 leads to the following output: [...] > *** Error the number of solid elements 13929 > defined on the thermal generation control > card is greater than the total number > of solids in the model 12985 > connect to address $ADDRESS: Connection timed out > connect to address $ADDRESS: Connection timed out When you set up that VM via LAM, you did a lamboot .... Could you send the output of tping -c 3 N for the larger VM? Also, what does your machine file look like, and could you share what lamboot -d machinefile returns for N>12? Note, that is a big bit of output, so you might want to send that offline. > where $ADDRESS is the IP address of the *public* interface of the node > on which the job was launched. Has anybody seen anything like this? Yes, with a borked DNS server on a head node, coupled to an incorrectly setup queuing system. We have seen this at a few customer sites. > Any ideas on why it would fail over a specific number of CPUs? It doesn't sound like it is failing on a specific number of CPUs, more like there is a public address, which likely has iptables on it, preventing that node from reaching back into the private space. > > Note that the failure is CPU dependent, not node-count dependent. > I've tried on clusters made of both dual-CPU machines and quad-CPU > machines, and in both cases it took 13 CPUs to create the failure. > Note also that I *do* have a user writing his own MPI code, and he has > no issues running on >12 CPUs. What do the machine files look like? Are they auto generated? > > Thanks. > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 or +1 866 888 3112 cell : +1 734 612 4615
- Previous message: [Beowulf] Weird problem with mpp-dyna
- Next message: [Beowulf] What is a "proper" machine count for a cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
