Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] tcp error: Need ideas!

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Greg Lindahl lindahl at pbm.com
Wed Jan 21 14:49:33 PST 2009


On Wed, Jan 21, 2009 at 04:40:26PM -0600, Gerry Creager wrote:

> We're seeing the following error in WRF compiled with openMPI and the  
> PGI 7.2 compiler:
> mca_btl_tcp_frag_send:writev failed with errno=104

It's unfortunate that OpenMPI is following in the footsteps of MPICH
and doesn't print out that 104 = "Connection reset by peer".

The OpenMPI FAQ has some info about that:

http://open-mpi.basemirror.de/faq/?category=tcp

> While all nodes were accessible prior to the run and returned  
> appropriate "stuff" when queried with, eg., ssh and a command, two nodes  
> now return something like this:
> [gerry at brazos SCOOP12km]$ ssh c0522
> Received disconnect from 192.168.200.154: 2: Bad packet length 808464432.

That's kinda interesting. Perhaps the network chip got into a really
funny state, and is corrupting packets? Power off for a while.

-- greg




More information about the Beowulf mailing list