[Beowulf] tcp error: Need ideas!
gerry.creager at tamu.edu
Wed Jan 21 14:40:26 PST 2009
History/background/description of the cluster
* 126 node Dell 1950 cluster with dual-quad core Xeons
* HP 5412zl switch for gigabit cluster backplane and 10GBE interconnect
to selected services (file server, etc)
* Gigabit interconnect
* Hand compiled 2.6.26 kernel
* bnx2 module loaded for the Broadcom onboard nics
* Switch, compute nodes, head node set to 9000 byte MTU
We're seeing the following error in WRF compiled with openMPI and the
PGI 7.2 compiler:
mca_btl_tcp_frag_send:writev failed with errno=104
While all nodes were accessible prior to the run and returned
appropriate "stuff" when queried with, eg., ssh and a command, two nodes
now return something like this:
[gerry at brazos SCOOP12km]$ ssh c0522
Received disconnect from 192.168.200.154: 2: Bad packet length 808464432.
I'm stumped and looking for causes and solutions. Yeah, the WRF as
compiled did run before the change to Jumbos.
Do I reduce the size of the frames to something smaller, like 8800
bytes? 7500? 1500?
I'm not completely out of ideas but stumped.
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843
More information about the Beowulf