Basic channel bonding question/problem
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Martin Siegert siegert at sfu.caMon Feb 18 14:05:34 PST 2002
- Previous message: Basic channel bonding question/problem
- Next message: How to tell when a job is swapping?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Mon, Feb 18, 2002 at 04:22:36PM -0500, rhixon wrote: > > I'm trying to compare a 3x100T channel bonding vs. a single 1000T connection. The hardware/software for the channel bonding is: 6 Dell P4 1700/1800 boxes, all with Red Hat 7.2. Each box has 3 3Com 3C905B NIC cards, bonded together. I have them connected via 3 DLink 16-port 100T switches. The program I'm running is a Fortran 90 code using MPI; the MPI I'm using is LAM-6.5.6. The compiler is the Intel IFC 6.0 beta. > > Here's the problem: Using blocking sends/recvs (MPI_SEND), the code works fine, and performance is pretty good. However, when using nonblocking sends/recvs (MPI_ISEND), the code runs for a bit and appears to lock up semi-randomly (not at the same place every time). > > The code works fine on a single channel 100T or 1000T. It duplexes fine on a single channel, but on the channel bonding, when it does run, the performance is no better than the blocking calls (appears not to be duplexing). > > OK, so is this normal? Is there anything in the hardware/software I can fix? If I can't use nonblocking communications, my code is in serious trouble -- better to find out now with only 6 machines before I buy the next 64. > 1) If you just want to benchmark 3x100T vs. 1000T, the best thing is netpipe: www.scl.ameslab.gov/netpipe/ 2) If you are more interested in reliable MPI performance: you may have run into a problem that I have not been able to solve although I'd spend a huge amount of time debugging it: Some programs that do nonblocking send/recv just hang under LAM. The problem appears randomly (as in your case), however, (just by running the same code repeatedly) you can make a program hang with probability close to 1. The fix is actually quite simple: use MPICH (version 1.2.2 or later). At least in the case I've looked at it never happened with MPICH. Cheers, Martin ======================================================================== Martin Siegert Academic Computing Services phone: (604) 291-4691 Simon Fraser University fax: (604) 291-4242 Burnaby, British Columbia email: siegert at sfu.ca Canada V5A 1S6 ========================================================================
- Previous message: Basic channel bonding question/problem
- Next message: How to tell when a job is swapping?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
