[Beowulf] tcp error: Need ideas!
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Joe Landman landman at scalableinformatics.comWed Jan 21 15:23:17 PST 2009
- Previous message: [Beowulf] tcp error: Need ideas!
- Next message: [Beowulf] tcp error: Need ideas!
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Gerry Gerry Creager wrote: > History/background/description of the cluster > * 126 node Dell 1950 cluster with dual-quad core Xeons > * HP 5412zl switch for gigabit cluster backplane and 10GBE interconnect > to selected services (file server, etc) > * Gigabit interconnect > * Hand compiled 2.6.26 kernel > * bnx2 module loaded for the Broadcom onboard nics > * Switch, compute nodes, head node set to 9000 byte MTU We have had *lots* of problems with Broadcom nics and jumbo frames. From 2.6.9 timeframe onwards. > > We're seeing the following error in WRF compiled with openMPI and the > PGI 7.2 compiler: > mca_btl_tcp_frag_send:writev failed with errno=104 > > While all nodes were accessible prior to the run and returned > appropriate "stuff" when queried with, eg., ssh and a command, two nodes > now return something like this: > [gerry at brazos SCOOP12km]$ ssh c0522 > Received disconnect from 192.168.200.154: 2: Bad packet length 808464432. Hmmm... sounds like a link tried re-negotiating. Can you get on via serial/console and root at lightning:~# ethtool eth0 Settings for eth0: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: Twisted Pair PHYAD: 1 Transceiver: internal Auto-negotiation: on Supports Wake-on: g Wake-on: d Current message level: 0x000000ff (255) Link detected: yes it? You might want to ethtool eth0 autoneg off to force it not to renegotiate its speed. Also, look at root at lightning:~# ethtool -g eth0 Ring parameters for eth0: Pre-set maximums: RX: 511 RX Mini: 0 RX Jumbo: 0 TX: 511 Current hardware settings: RX: 200 RX Mini: 0 RX Jumbo: 0 TX: 511 See if you can do something like ethtool -G eth0 rx-jumbo 100 if you have zero jumbo ring rx entries. > I'm stumped and looking for causes and solutions. Yeah, the WRF as > compiled did run before the change to Jumbos. > > Do I reduce the size of the frames to something smaller, like 8800 > bytes? 7500? 1500? In the past I had heard that jumbo frames may work on Broadcom NICs around 6000 byte length. We haven't tried this in a while ... YMMV. > > I'm not completely out of ideas but stumped. > > Thanks, gerry -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615
- Previous message: [Beowulf] tcp error: Need ideas!
- Next message: [Beowulf] tcp error: Need ideas!
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
