Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Weird problem with mpp-dyna

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Joshua Baker-LePain jlb17 at duke.edu
Wed Mar 14 06:33:17 PDT 2007


I have a user trying to run a coupled structural thermal analsis using 
mpp-dyna (mpp971_d_7600.2.398).  The underlying OS is centos-4 on x86_64 
hardware.  We use our cluster largely as a COW, so all the cluster nodes 
have both public and private network interfaces.  All MPI traffic is 
passed on the private network.

Running a simulation via 'mpirun -np 12' works just fine.  Running the 
same sim (on the same virtual machine, even, i.e. in the same 'lamboot' 
session) with -np > 12 leads to the following output:

Performing Decomposition -- Phase 3 03/12/2007
11:47:53


*** Error the number of solid elements 13881
defined on the thermal generation control
card is greater than the total number
of solids in the model 12984

*** Error the number of solid elements 13929
defined on the thermal generation control
card is greater than the total number
of solids in the model 12985
connect to address $ADDRESS: Connection timed out
connect to address $ADDRESS: Connection timed out

where $ADDRESS is the IP address of the *public* interface of the node on 
which the job was launched.  Has anybody seen anything like this?  Any 
ideas on why it would fail over a specific number of CPUs?

Note that the failure is CPU dependent, not node-count dependent.
I've tried on clusters made of both dual-CPU machines and quad-CPU
machines, and in both cases it took 13 CPUs to create the failure.
Note also that I *do* have a user writing his own MPI code, and he has no 
issues running on >12 CPUs.

Thanks.

-- 
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University



More information about the Beowulf mailing list