Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

Basic channel bonding question/problem

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Martin Siegert siegert at sfu.ca
Mon Feb 18 14:05:34 PST 2002


On Mon, Feb 18, 2002 at 04:22:36PM -0500, rhixon wrote:
> 
> I'm trying to compare a 3x100T channel bonding vs. a single 1000T connection.  The hardware/software for the channel bonding is:  6 Dell P4 1700/1800 boxes, all with Red Hat 7.2.  Each box has 3 3Com 3C905B NIC cards, bonded together.  I have them connected via 3 DLink 16-port 100T switches.  The program I'm running is a Fortran 90 code using MPI; the MPI I'm using is LAM-6.5.6.  The compiler is the Intel IFC 6.0 beta.
> 
> Here's the problem:  Using blocking sends/recvs (MPI_SEND), the code works fine, and performance is pretty good.  However, when using nonblocking sends/recvs (MPI_ISEND), the code runs for a bit and appears to lock up semi-randomly (not at the same place every time).  
> 
> The code works fine on a single channel 100T or 1000T.  It duplexes fine on a single channel, but on the channel bonding, when it does run, the performance is no better than the blocking calls (appears not to be duplexing).
> 
> OK, so is this normal?  Is there anything in the hardware/software I can fix?  If I can't use nonblocking communications, my code is in serious trouble -- better to find out now with only 6 machines before I buy the next 64.
> 

1) If you just want to benchmark 3x100T vs. 1000T, the best thing is
   netpipe:

   www.scl.ameslab.gov/netpipe/

2) If you are more interested in reliable MPI performance: you may have
   run into a problem that I have not been able to solve although I'd
   spend a huge amount of time debugging it: Some programs that do
   nonblocking send/recv just hang under LAM. The problem appears randomly
   (as in your case), however, (just by running the same code repeatedly)
   you can make a program hang with probability close to 1.
   The fix is actually quite simple: use MPICH (version 1.2.2 or later).
   At least in the case I've looked at it never happened with MPICH.

Cheers,
Martin

========================================================================
Martin Siegert
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
========================================================================



More information about the Beowulf mailing list