eepro100 frame errors with SMP

Rogier Wolff R.E.Wolff@BitWizard.nl
Fri Jun 25 04:39:31 1999


Alan Curry wrote:
> Two different machines, each with an eepro100, are racking up frame errors,
> according to the statistics on the cisco switch they are connected to.
> Swapping cards between these and another machine, and booting many different
> kernels, leads us to believe that the problem only exists when more than one
> processor is being used. This smells like a driver bug to me.
> 
> These errors show up as "TX overruns" from ifconfig on the Linux side and
> frame/CRC errors from `show interfaces' on the cisco side.

There are RX overruns, and TX underruns. I think you're seeing tx
underruns.

The eepro100 gets the data to be transmitted from main memory. TX
underrun happens when the main memory can't keep up with the sending
of the data onto the ethernet.

So, hardware-wise your machine is misconfigured: The eepro100 cannot
get 10Mbyte per second of throughput from main memory at times.

This is NOT a driver problem. 

You could look into decreasing the "max_lat" value of all other
devices on the PCI bus, and increasing it on the eepro100.

> We told cisco, and they're so concerned about their switch they currently
> have a team trying to reproduce the problem, but as far as we know they are
> still in the "how do we install RedHat?" stage.

I suggest telling them not to worry. You've found the problem, and it
is your machine that is misconfigured....
 
> What can I do to further track down this problem?


/*  
 *  perform_memcpy.c
 *
 *
 *
 *  written by R.E.Wolff -- R.E.Wolff@BitWizard.nl
 * 
 *
 *              date          by     what
 *  Written:    Apr 23 1997   REW    Initial revision.
 *  changes:
 *
 * $Log: perform_memcpy.c,v $
 * Revision 1.2  1997/11/13 14:56:59  wolff
 * Created RCS Log.
 *
 *
 *
 *  who-is-who:
 *    initials full name                 Email address
 *    REW      Roger E. Wolff            R.E.Wolff@BitWizard.nl
 *
 * This program allows you to test wether the zoran chip is bothered
 * by a rep; movsl instruction.
 *
 * */

#include <sys/time.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

#ifndef SIZE 
#define SIZE 0x800000
#endif

unsigned char *makebuf (int size)
{
  unsigned char *t;

  t = malloc (size);
  return t;
}

int main (int argc, char **argv)
{
  unsigned char *p, *q, *r;
  struct timeval start, stop;
  int n=1;
  int count=0;

  if (argc > 1)
    n = atoi (argv[1]);

  p = makebuf (SIZE);
  q = makebuf (SIZE);
 
  while (n--) {
    gettimeofday (&start, NULL);
    __builtin_memcpy (p, q, SIZE);
    gettimeofday (&stop, NULL);
    
    printf ("Elapsed time %d: %d usecs.\n",
            count++,
            (stop.tv_sec  - start.tv_sec) * 1000000 +
            (stop.tv_usec - start.tv_usec));
    r = p;
    p = q;
    q = r;
  }
  exit (0);
}

------------

Note: "SIZE" is not a parameter of the program, because it needs to
be a constant, for maximum effect. 

I expect that even on single processor systems you will see the
problems. I expect that on dual processor systems you will be able to
halt almost all network activity by running two copies of this...


			Roger.

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
------ Microsoft SELLS you Windows, Linux GIVES you the whole house ------