[Beowulf] tcp error: Need ideas!
lindahl at pbm.com
Wed Jan 21 14:49:33 PST 2009
On Wed, Jan 21, 2009 at 04:40:26PM -0600, Gerry Creager wrote:
> We're seeing the following error in WRF compiled with openMPI and the
> PGI 7.2 compiler:
> mca_btl_tcp_frag_send:writev failed with errno=104
It's unfortunate that OpenMPI is following in the footsteps of MPICH
and doesn't print out that 104 = "Connection reset by peer".
The OpenMPI FAQ has some info about that:
> While all nodes were accessible prior to the run and returned
> appropriate "stuff" when queried with, eg., ssh and a command, two nodes
> now return something like this:
> [gerry at brazos SCOOP12km]$ ssh c0522
> Received disconnect from 192.168.200.154: 2: Bad packet length 808464432.
That's kinda interesting. Perhaps the network chip got into a really
funny state, and is corrupting packets? Power off for a while.
More information about the Beowulf