Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] tcp error: Need ideas!

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Gerry Creager gerry.creager at tamu.edu
Wed Jan 21 14:40:26 PST 2009


History/background/description of the cluster
* 126 node Dell 1950 cluster with dual-quad core Xeons
* HP 5412zl switch for gigabit cluster backplane and 10GBE interconnect 
to selected services (file server, etc)
* Gigabit interconnect
* Hand compiled 2.6.26 kernel
* bnx2 module loaded for the Broadcom onboard nics
* Switch, compute nodes, head node set to 9000 byte MTU

We're seeing the following error in WRF compiled with openMPI and the 
PGI 7.2 compiler:
mca_btl_tcp_frag_send:writev failed with errno=104

While all nodes were accessible prior to the run and returned 
appropriate "stuff" when queried with, eg., ssh and a command, two nodes 
now return something like this:
[gerry at brazos SCOOP12km]$ ssh c0522
Received disconnect from 192.168.200.154: 2: Bad packet length 808464432.

I'm stumped and looking for causes and solutions.  Yeah, the WRF as 
compiled did run before the change to Jumbos.

Do I reduce the size of the frames to something smaller, like 8800 
bytes? 7500?  1500?

I'm not completely out of ideas but stumped.

Thanks, gerry
-- 
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843



More information about the Beowulf mailing list