[Beowulf] How to Diagnose Cause of Cluster Ethernet Errors?

Fri Mar 30 16:40:04 PDT 2007

One thing to check is that the switch and NIC are negotiating duplex
correctly... Duplex mis-negotiation (ie switch full, NIC half) used to
be a fairly common cause of FCS errors, although this is rare now as
drivers have gotten a lot better. What will happen is the Full duplex
station will transmit when the half duplex station is sending, causing
it to think it has seen a collision, whereupon it ceases transmission
and you have a packet fragment with no FCS. FCS errored frames will be
dropped by the switch, so performance will be horrible.

One easy way to fix this is to set duplex on both ends of the connection
to 10000/full and retest. 

HTH - Steve P

-----Original Message-----
From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]
On Behalf Of Jon Forrest
Sent: Wednesday, March 28, 2007 4:52 PM
To: beowulf at beowulf.org
Subject: [Beowulf] How to Diagnose Cause of Cluster Ethernet Errors?

I've been pulling out what little hair I have left while trying to
figure out a bizarre problem with a Linux cluster I'm running.  Here's a
short description of the problem.

I'm managing a 29-node cluster. All the nodes use the same hardware and
boot the same kernel image (Scientific Linux 4.4, linux 2.6.9). The
owner of this cluster runs a multi-node MPI job over and over with
different input data. We've been seeing strange performance numbers
depending on which nodes the job uses. These variations are not due to
the input data.
In some combinations the performance
is an order of magnitude slower than in others.
Fooling around with replacing the gigabit ethernet switch, replacing two
of the nodes, and running memtest all day long didn't result in anything
interesting.

However, today I took a look at the network statistics as shown on the
ethernet switch (a Netgear GS748T).
What I saw was 13 of the 29 switch ports had very large numbers of FCS
(Frame Checksum Sequence) errors. In fact, some had more FCS errors than
valid frames, and I'm talking about frame counts in the billions. All
the other ports showed 0 FCS errors. So, something is clearly wrong.

What I'm wondering is what's causing these FCS errors.
The cables are short and the equipment is new.
All the nodes use new SuperMicro H8DCR-3 motherboards with onboard
ethernet controllers so I'm having trouble believing that this problem
is caused by a faulty ethernet controller because this would mean that
13 out of 29 controllers are bad.
Running "ifconfig eth0" on the nodes show no errors but I'm not sure if
this kind of error is detectable by the sender, and I'm guessing that
packets with FCS errors are dropped by the switch. Could the switch be
making a mistake while under heavy load when computing the FCS values?

I'd like to find the definitive cause of the problem before I ask the
vendor to replace massive amounts of hardware. How would you isolate the
cause of this problem?

Cordially,
--
Jon Forrest
Unix Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org To change your subscription
(digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf