MPICH hangs with dual cpu computers when np is using all cpus
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Nina Nitroy nina at microway.comThu Aug 2 10:52:27 PDT 2001
- Previous message: IBM goes grid
- Next message: multiple ethernet ports, random order
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I build many different kinds of Beowulf clusters and have noticed a trend of increased problems on clusters using MPICH and dual cpu computers. The problem(s) presents itself this way: MPI programs with an np parameter asking for the total number of cpus (i.e. for 20 dual cpu computers, using -np 40) will run for several hours and then hang. The errors are along the lines of: p4_error Timeout establishing connection to remote process:0 interrupt SIGINT: 2 or Trying to receive a message when there are no connections. A simple restart of the job starts up ok and continues because network and rsh connectivity is still working fine. You can ping and rsh every node. A more dramatic example is that one of the nodes will no longer be rsh-able and have to be restarted. In that case, the error messages are more like: p4_error net_recv read: probable eof on socket: I have dones tests using different switches, different ethernet (Kingstons, Intel Etherpro, etc), different device drivers, different motherboards. The permutations seem endless. I have tested a variety of 2.4+ kernels and gnu and PGI compilers. In all cases I am using REDHAT 7.0 and rsh. In all cases, the MPI source is version 1.2.0. In all cases, using an np amount equal to half the number of cpus instead (i.e. for 20 dual cpu computers, using -np 20), there are NO problems at all. In all cases, the computers pass rigorous stress testing and tests on the network itself fail to show a problem. I have used the Portland Group compilers to build MPI 1.2.0 from scratch as well as using their canned 3.2-4. MPICH was built on a LINUX 6.2 system. I haven't tried a fresh MPI build yet. To reproduce this problem, I am using the standard MPI examples: cpi and srtest, and executing them over and over again in a script. Smaller number of computers (5-15 computers) work fine. More than that will fail eventually, and generally, the higher the quantity of computers in the test, the more quickly it will fail. The machines list is very simple. It is a list of every computer for the test, listed once. I will build MPI 1.2.1 to see how it effects the situation but fail to understand what is different now than previously when I had dozens of clusters running without fail on all sorts of combinations. This all seemed to start getting worse upon the start of using RED HAT 7.0 but that could be a coincidence. Any ideas? Nina
- Previous message: IBM goes grid
- Next message: multiple ethernet ports, random order
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
