[Beowulf] RX-polling in sk98lin driver

Wed Feb 15 10:08:18 PST 2006

Hi all,

On January we posted a message about very low performance that we
detected on a home-made cluster with P4 machines (D915PGN
mother-board), 3c2000t NIC cards. 

http://thread.gmane.org/gmane.comp.clustering.beowulf.general/14343

In brief, we found that for very small size packets the time spent in
t sending the messages was too high. Please note that this is not
related to latency, which is in the order of 50musec, but for every 30
packets or so, the round-trip time takes 0.1sec or so. This drops the
effective bandwith too much, and our applications (mainly a self
written Finite Element code for fluids (CFD)
http://www.cimec.org.ar/petscfem) run slower than in a Fast Etherne
network. 

The cluster was mounted with WareWulf diskless package on top of
Fedora Core 3 (Kernel 2.6.15), which comes with the `sk98lin' driver
fot the 3c200t NIC card. The code used MPICH-1.2.6 and PETSC-2.1.6
(http://www.mcs.anl.gov/petsc). 

At that time we suspected mainly from the TCP layer of the Linux
kernel, since the symptoms were very similar to those reported at
http://www.icase.edu/coral/LinuxTCP.html
After trying a lot of things, we found that upgrading to MPICH2
(version mpich2-1.0.3) almost fixed the problem. 

But recently we found a similar fault. When solving large linear
systems in parallel with PETSc, the code was very slow when using the
GMRES method. Again, after trying a lot of things, including tunning
the TCP kernel parameters, we found that upgrading the NIC driver to
sk98lin 8-23 from the 3COM site (or also 8-30 from www.syskonnect.com)
and disbaling the RX-polling option, eliminates the fault. 

However, when using this kernel we find that the nodes hang randomly,
at a frequency of 2 nodes hanged out of 12 per day. We tried several
things. 

* The same kernel with both versions of the driver (8-23 and 8-30)
  with RX polling 1 is stable (but is slow). 

* Using a 2.4 kernel instead of the 2.6 doesn't change things. 

* We tried to disable NFS by building a large VNFS ramdisk with all
  the files needed but couldn't perform the experiment well, and so we
  are unable to say if the fault is related to NFS or not.  (Note:
  Nodes are diskless but NFS traffic is reduced by loading most files
  (almost all except for /usr..) in a ramdisk. )

* Nodes hang even when not under load. They may hang even when they
  are idle. 

RX polling seems to be an option present in many drivers and (if I
understand well) tries to gather incoming packets in larger ones.
The help that we obtain in `$ make menuconfig' about the RX polling
option, is the following. 

>   Use Rx polling (NAPI) 
>     CONFIG_SK98LIN_NAPI:                                                   
>    NAPI is a new driver API designed to reduce CPU and interrupt load
>    when the driver is receiving lots of packets from the card.

Any hints are welcome, TIA, Mario

-- 
-------------------------
Mario Alberto Storti     [cel. +54-342-156144983]
CIMEC (INTEC/CONICET-UNL), Guemes 3450 - 3000 Santa Fe, Argentina
Tel: +54-342-4511594 (ext 1015), Tel/Fax: +54-342-4511169
e-mail: mstorti at intec dot unl dot edu dot ar
http://www.cimec.org.ar/mstorti
-------------------------