[Beowulf] tcp error: Need ideas!

Joe Landman landman at scalableinformatics.com
Sat Jan 24 11:53:04 PST 2009

I wonder if the switch could be implicated.  We have seen some (cheap) 
GbE switches not support (in practice) jumbo frames (irrespective of 

Nifty Tom Mitchell wrote:
> On Sat, Jan 24, 2009 at 09:36:09AM -0600, Gerry Creager wrote:
>> Couple of follow-up notes.
>> MTU=4500:  Had one node fall over with the same overflow errors.
>> MTU=3000:  A WRF model is running, but single timesteps are executing  
>> 2.5x slower than MTU=1500

Segment offload?  Is TSO on or off?

	ethtool -k eth0

will tell you.  You might also have one very reluctant machine, in the 
sense of being unwilling to switch their mtu.  Could you do an

	ifconfig eth0 | grep MTU

on each machine and verify that everyone is using the right MTU?

>> I'll go snag the new driver and compile it.  After all: What can it hurt!
>> Thanks, Guy!
>> Regards, Gerry
>> Guy Coates wrote:
>>> Hi,
>>> We have also seen problems with the bnx2 drivers.
>>> I got a more recent set of bnx2 drivers from Broadcom:
> ......
> Has the data been snooped for this data to see if all
> is as expected.
> If you are seeing a natural MTU running faster than a jumbo MTU
> then something is fragmenting or causing fragmentation of the data.  
> Should the MTU=4500 causes overflow errors it might be related to fragmentation.
> Both the sender and receiver have to keep all the bits on a reliable 
> transfer until the data has been acknowledged.   At one time fragmentation
> could only be done once to a minimum MTU in the life of a packet.
> In addition to snooping packets try "tracepath" to and from all 
> the involved boxes to discover what is going on.

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

More information about the Beowulf mailing list