[Beowulf] tcp error: Need ideas!

Fri Jan 23 05:49:23 PST 2009

First, thanks to all who've responded.  I've been looking a bit thins 
morning and am trying to grok the results.

Joe Landman wrote:
> Hi Gerry
> 
> Gerry Creager wrote:
>> History/background/description of the cluster
>> * 126 node Dell 1950 cluster with dual-quad core Xeons
>> * HP 5412zl switch for gigabit cluster backplane and 10GBE 
>> interconnect to selected services (file server, etc)
>> * Gigabit interconnect
>> * Hand compiled 2.6.26 kernel
>> * bnx2 module loaded for the Broadcom onboard nics
>> * Switch, compute nodes, head node set to 9000 byte MTU
> 
> We have had *lots* of problems with Broadcom nics and jumbo frames. From 
> 2.6.9 timeframe onwards.

Marvelous.  I'd prefer to not have to back-rev if I can avoid it...

>>
>> We're seeing the following error in WRF compiled with openMPI and the 
>> PGI 7.2 compiler:
>> mca_btl_tcp_frag_send:writev failed with errno=104
>>
>> While all nodes were accessible prior to the run and returned 
>> appropriate "stuff" when queried with, eg., ssh and a command, two 
>> nodes now return something like this:
>> [gerry at brazos SCOOP12km]$ ssh c0522
>> Received disconnect from 192.168.200.154: 2: Bad packet length 808464432.
> 
> Hmmm... sounds like a link tried re-negotiating.  Can you get on via 
> serial/console and

My guess is that the driver wandered across memory boundaries.  This 
stinks of a buffer problem to me.  Typically, after this happens, I 
can't log into the node via any interface, nor on console.  It requites 
an ipmi or physical reboot.

> root at lightning:~# ethtool eth0

-bash-3.2# ethtool eth1
Settings for eth1:
         Supported ports: [ TP ]
         Supported link modes:   10baseT/Half 10baseT/Full
                                 100baseT/Half 100baseT/Full
                                 1000baseT/Full
         Supports auto-negotiation: Yes
         Advertised link modes:  10baseT/Half 10baseT/Full
                                 100baseT/Half 100baseT/Full
                                 1000baseT/Full
         Advertised auto-negotiation: Yes
         Speed: 1000Mb/s
         Duplex: Full
         Port: Twisted Pair
         PHYAD: 1
         Transceiver: internal
         Auto-negotiation: on
         Supports Wake-on: g
         Wake-on: d
         Link detected: yes

> You might want to
> 
>     ethtool eth0 autoneg off
> 
> to force it not to renegotiate its speed.  Also, look at

-bash-3.2# ethtool -A eth1 autoneg off
autoneg unmodified, ignoring
no pause parameters changed, aborting

> root at lightning:~# ethtool -g eth0

-bash-3.2# ethtool -g eth1
Ring parameters for eth1:
Pre-set maximums:
RX:             1020
RX Mini:        0
RX Jumbo:       4080
TX:             255
Current hardware settings:
RX:             255
RX Mini:        0
RX Jumbo:       765
TX:             255

> See if you can do something like
> 
>     ethtool  -G eth0 rx-jumbo 100
> 
> if you have zero jumbo ring rx entries.

Doesn't look like this requires much change.

Also, while I'm in the neighborhood, to respond to Mark's suggestions:

-bash-3.2# ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off

Hmmm Might be worth changing tcp segmentation here.

-bash-3.2# ethtool -S eth1
NIC statistics:
      rx_bytes: 43454
      rx_error_bytes: 0
      tx_bytes: 51103
      tx_error_bytes: 0
      rx_ucast_packets: 231
      rx_mcast_packets: 0
      rx_bcast_packets: 329
      tx_ucast_packets: 250
      tx_mcast_packets: 0
      tx_bcast_packets: 4
      tx_mac_errors: 0
      tx_carrier_errors: 0
      rx_crc_errors: 0
      rx_align_errors: 0
      tx_single_collisions: 0
      tx_multi_collisions: 0
      tx_deferred: 0
      tx_excess_collisions: 0
      tx_late_collisions: 0
      tx_total_collisions: 0
      rx_fragments: 0
      rx_jabbers: 0
      rx_undersize_packets: 0
      rx_oversize_packets: 0
      rx_64_byte_packets: 365
      rx_65_to_127_byte_packets: 166
      rx_128_to_255_byte_packets: 20
      rx_256_to_511_byte_packets: 7
      rx_512_to_1023_byte_packets: 1
      rx_1024_to_1522_byte_packets: 1
      rx_1523_to_9022_byte_packets: 0
      tx_64_byte_packets: 42
      tx_65_to_127_byte_packets: 84
      tx_128_to_255_byte_packets: 31
      tx_256_to_511_byte_packets: 97
      tx_512_to_1023_byte_packets: 0
      tx_1024_to_1522_byte_packets: 0
      tx_1523_to_9022_byte_packets: 0
      rx_xon_frames: 0
      rx_xoff_frames: 0
      tx_xon_frames: 0
      tx_xoff_frames: 0
      rx_mac_ctrl_frames: 0
      rx_filtered_packets: 60
      rx_discards: 0
      rx_fw_discards: 0
-bash-3.2# ifconfig eth1
eth1      Link encap:Ethernet  HWaddr 00:1E:C9:AC:27:FB
           inet addr:192.168.200.154  Bcast:192.168.203.255 
Mask:255.255.252.0
           UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
           RX packets:574 errors:0 dropped:0 overruns:0 frame:0
           TX packets:265 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:44422 (43.3 KiB)  TX bytes:54606 (53.3 KiB)
           Interrupt:16 Memory:f4000000-f4012100

>> I'm stumped and looking for causes and solutions.  Yeah, the WRF as 
>> compiled did run before the change to Jumbos.
>>
>> Do I reduce the size of the frames to something smaller, like 8800 
>> bytes? 7500?  1500?
> 
> In the past I had heard that jumbo frames may work on Broadcom NICs 
> around 6000 byte length.  We haven't tried this in a while ... YMMV.
> 
>>
>> I'm not completely out of ideas but stumped.
>>
>> Thanks, gerry
> 
> 

-- 
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843