Problems with Tulip ethernet on diskless cluster

Thu Sep 7 09:15:30 PDT 2000

I had a similar problem with a similar configuration.  The net card I was
using is the ACER ALN-315 with an Intel(DEC) 21143 chip.  Something strange
was causing the cards to produce 'garbled' packets when they were operated
in full duplex mode.  By 'garbled' I mean that the packets were dropped by
the switch since they failed the checksum tests that the switch was
performing.  The problem stopped when I operated the cards in half duplex
mode, and I never got to checking it further.  Something was suggested by
R.G. Brown (I think) about a 'udelay driver' which had to be manually tuned
to your specific setup.

I can't remember the kernel errors that I was getting, but what are your
netperf numbers for full and half duplex TCP?  My netperf numbers (through
the switch) for half duplex were good (but half duplex), however for full
duplex, I was getting about 1/100 wire speed, which is undoubtedly due to
the switch dropping lots of bad packets.

Regards,
Tom Lovie.

-----Original Message-----
From: beowulf-admin at beowulf.org [mailto:beowulf-admin at beowulf.org]On
Behalf Of Franz Marini
Sent: Thursday, September 07, 2000 10:38 AM
To: beowulf at beowulf.org
Subject: Problems with Tulip ethernet on diskless cluster

Hi all,

  we have a 16 diskless-nodes cluster working since Dec 1999. Trying fftw
(library for FFT) I noticed a strange behaviour (that is, running the
benchmark in mpi, I get a lower performance than running it on a single
machine, even using 8 nodes).
 The configuration is :

   1 server w/ DLink 4 port fast ethernet card, using 4 tulip chips, 2
UW Scsi2 IBM hard drive in software RAID 1, p III 500, 128 Mb

  16 diskless nodes w/ 3com fast ethernet card, p III 500, 128 Mb

   1 3com Superstack II 3300 XM switch.

 The ethernet drivers are all updated to the latest version, we're using
RedHat 6.1 as Linux distro and LAM-Mpi 6.3 for parallel comms (but we
tried with mpich with the same results).

 The only strange thing I noticed is the output from "cat
/proc/net/dev" on the server :

eth0:199467368 1848969 0 0 0 0 0 0 268603237 550016 405696 0 0 0 405696 0
eth1:1015790525 10879129 0 0 0 0 0 0 4262257137 9926748 419750 0 0 0
419750 0
eth2:56208013 216130 0 0 0 0 0 0 73798 312 317023 0 0 0 317023 0
eth3:28227811 118504 0 0 0 0 0 0 99299 1036 254667 0 0 0 254667 0

 that is, in Rx we have no prob at all, in Tx almost avery packet get a
carrier error. Note : the switch management soft doesn't report any error
on the ports connected to eth1,2 and 3.

 All ports are configured as 100baseTx-FD (as reported from mii-diag) and
so the switch.

 I have no clue on what is happening, especially considering the fact that
the network apparently is working correctly.

 Any idea ?

 Thank you all in advance,

Franz.

---------------------------------------------
Franz Marini
Sys Admin and Software Analyst,
Dept. of Physics, University of Milan, Italy.
email : marini at pcmenelao.mi.infn.it
---------------------------------------------

_______________________________________________
Beowulf mailing list
Beowulf at beowulf.org
http://www.beowulf.org/mailman/listinfo/beowulf