[Beowulf] tcp error: Need ideas!

Greg Lindahl lindahl at pbm.com
Wed Jan 21 14:49:33 PST 2009


On Wed, Jan 21, 2009 at 04:40:26PM -0600, Gerry Creager wrote:

> We're seeing the following error in WRF compiled with openMPI and the  
> PGI 7.2 compiler:
> mca_btl_tcp_frag_send:writev failed with errno=104

It's unfortunate that OpenMPI is following in the footsteps of MPICH
and doesn't print out that 104 = "Connection reset by peer".

The OpenMPI FAQ has some info about that:

http://open-mpi.basemirror.de/faq/?category=tcp

> While all nodes were accessible prior to the run and returned  
> appropriate "stuff" when queried with, eg., ssh and a command, two nodes  
> now return something like this:
> [gerry at brazos SCOOP12km]$ ssh c0522
> Received disconnect from 192.168.200.154: 2: Bad packet length 808464432.

That's kinda interesting. Perhaps the network chip got into a really
funny state, and is corrupting packets? Power off for a while.

-- greg




More information about the Beowulf mailing list