[Beowulf] How to Diagnose Cause of Cluster Ethernet Errors?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Douglas Eadline deadline at clustermonkey.netSat Mar 31 10:38:50 PDT 2007
- Previous message: [Beowulf] How to Diagnose Cause of Cluster Ethernet Errors?
- Next message: [Beowulf] OT? GPU accelerators for finite difference time domain
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Jon A few things, 1) You may find MPI Link Checker from Microway helpful in this situation. There was free beta version floating around at one point. Plus ethtool can be helpful to check and see how the nodes are connecting to the switch. 2) Also, I never have considered NetGear a high performance switch. I have seen big improvements in applications by replacing a cheap switch with a slightly more expensive better performing switch. Note this may or may not be your problem, so don't run out and buy a new switch, but not all switches hold up under the loads HPC applications throw at them. <Soapbox> I am constantly amazed at how many people buy the latest and greatest node hardware and then connect them with a sub-optimal switch (or cheap cables), thus reducing the effective performance of the nodes (for parallel applications). Kind "penny wise and pound foolish" as they say. </Soapbox> -- Doug > I've been pulling out what little hair I have left while > trying to figure out a bizarre problem with a Linux > cluster I'm running. Here's a short description of the > problem. > > I'm managing a 29-node cluster. All the nodes use > the same hardware and boot the same kernel image > (Scientific Linux 4.4, linux 2.6.9). The owner of this > cluster runs a multi-node MPI job over and over with > different input data. We've been seeing strange performance > numbers depending on which nodes the job uses. These > variations are not due to the input data. > In some combinations the performance > is an order of magnitude slower than in others. > Fooling around with replacing the gigabit ethernet switch, > replacing two of the nodes, and running memtest all > day long didn't result in anything interesting. > > However, today I took a look at the network statistics > as shown on the ethernet switch (a Netgear GS748T). > What I saw was 13 of the 29 switch ports had very large > numbers of FCS (Frame Checksum Sequence) errors. In fact, > some had more FCS errors than valid frames, and I'm talking > about frame counts in the billions. All the other ports > showed 0 FCS errors. So, something is clearly wrong. > > What I'm wondering is what's causing these FCS errors. > The cables are short and the equipment is new. > All the nodes use new SuperMicro H8DCR-3 motherboards > with onboard ethernet controllers so I'm having > trouble believing that this problem is caused by a > faulty ethernet controller because this would > mean that 13 out of 29 controllers are bad. > Running "ifconfig eth0" on the nodes show no errors > but I'm not sure if this kind of error is detectable > by the sender, and I'm guessing that packets with FCS > errors are dropped by the switch. Could the switch be making > a mistake while under heavy load when computing > the FCS values? > > I'd like to find the definitive cause of the problem > before I ask the vendor to replace massive amounts > of hardware. How would you isolate the cause > of this problem? > > Cordially, > -- > Jon Forrest > Unix Computing Support > College of Chemistry > 173 Tan Hall > University of California Berkeley > Berkeley, CA > 94720-1460 > 510-643-1032 > jlforrest at berkeley.edu > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > !DSPAM:460d8c15104951804284693! > -- Doug
- Previous message: [Beowulf] How to Diagnose Cause of Cluster Ethernet Errors?
- Next message: [Beowulf] OT? GPU accelerators for finite difference time domain
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
