2.0 kernels, tulip driver, crashes and reboots (long)

Al Youngwerth alberty@apexxtech.com
Fri Jan 8 13:31:56 1999


At 10:44 AM 1/8/99 -0500, Robert G. Brown wrote:
>On Thu, 7 Jan 1999, Al Youngwerth wrote:
>
>> We make an embedded system that uses linux and a headless PC. We're trying
>> to qualify a new hardware platform that use a VIA VPX based motherboard
>> (from Epox), Intel Pentium 133, 16MB RAM, and a PNIC-based 10/100 tulip
>> clone. Part of our qualification testing is to get a bunch of systems
>> running in a room without any crashes or spontaneous reboots for over two
>> weeks. We've been having some trouble.
>...
>
>Two or three remarks:
>
>  a) If the systems crash (lockup or not) with different network cards
>in them (you cited both ne2000's and PNIC tulips) then it is not too
>likely that network cards are the source of the crash.  The two drivers
>don't share much code and a P133 isn't exactly a high stress
>environment.
>
>  b) The PNIC cards with the tulip driver may well be unstable -- I've
>never tried them.  However, NE2K cards and "true" 21140 tulip cards are
>awesomely stable.  I routinely achieve 100+ day uptimes with every
>kernel after 2.0.33, and was well on my way to 100 days with 2.1.131
>before I decided to convert to Red Hat on the system in question and
>haven't spent the time to figure out how to build/install 2.1.x under RH
>since.  This is in a far more demanding environment -- SMP systems with
>very heavy CPU and network loads.  You can always swap in a true tulip
>card or five and get statistics on them.  But as I said, I doubt that
>your problem is the network card.

I mostly agree. Based on our test data though, I strongly believe there is
a lockup problem with the .90f driver, PNIC-based tulip cards and perhaps
the VIA VPX. Others have also reported this. Previously, I had submitted a
patch to Donald Becker for an skbuff kernel panic that happens in
tulip_rx() because the PNIC very occasionally gives the driver a totally
bogus rx packet size and the tulip driver tries to allocate too much
memory. This patch showed up in .89K.

The PNIC is definitely a sketchy chipset. Unfortunately, I have a few
thousand of them in the field that are stable in the platform and software
they are running with. We may not move forward with this chipset but I have
to be backwards compatible.

>
>  c) To me, your symptoms sound like a very low level configuration
>problem of one sort or another -- perhaps a BIOS or memory issue, or, as
>you note, an APM issue.  linux runs stably on way to many P5/P6 systems
>to make it likely that there is a serious problem with the kernel
>itself, but certain hardware combinations or BIOS setups can certainly
>destabilize the system.  Is the system caching anything at the
>hardware/bios level?  This can be a problem.  What do your
>/proc/[ioport,interrupt,device,pci] look like?

I agree. We've been over the BIOS settings quite extensively. If there's
anything in particular you think is important, let me know and I'll post.
We are mostly default settings with a few exceptions (don't halt on
keyboard errors, BIOS non-cacheable, internal and external cache enabled,
APM enabled).

In particular, as I noted previously, APM settings seem to make a
difference in the behavior between kernels. Under version 2.0.35, disabling
APM in the BIOS reduced the frequency of lockups significantly. With
version 2.0.36, APM BIOS settings do not effect lockups (again 2.0.36
doesn't lockup at all unless I use the .90f tulip driver). I've scanned the
diffs in the APM code and based upon how we compile and launch the kernel,
the APM changes between the two kernels are virtually nothing. This is
perplexing to me.

Mostly, I want to focus on the reboot problem with 2.0.36. Here are my proc
files (note that I plugged an ISA video card in this system to review the
bios settings again).

TEAMInternet# cat /proc/pci
PCI devices found:
  Bus  0, device  10, function  0:
    Ethernet controller: Lite-on LNE100TX (rev 32).
      Medium devsel.  Fast back-to-back capable.  IRQ 12.  Master Capable.
Latency=64.
      I/O at 0x6b00.
      Non-prefetchable 32 bit memory at 0xe0000000.
  Bus  0, device   7, function  3:
    Bridge: VIA Technologies VT 82C586B Apollo ACPI (rev 16).
      Medium devsel.  Fast back-to-back capable.
  Bus  0, device   7, function  2:
    USB Controller: VIA Technologies VT 82C586 Apollo USB (rev 2).
      Medium devsel.  IRQ 11.  Master Capable.  Latency=64.
      I/O at 0x6700.
  Bus  0, device   7, function  1:
    IDE interface: VIA Technologies VT 82C586 Apollo IDE (rev 6).
      Medium devsel.  Fast back-to-back capable.  Master Capable.  Latency=64.
      I/O at 0x6300.
  Bus  0, device   7, function  0:
    ISA bridge: VIA Technologies VT 82C586 Apollo ISA (rev 65).
      Medium devsel.  Master Capable.  No bursts.
  Bus  0, device   0, function  0:
    Host bridge: VIA Technologies VT 82C585 Apollo VP1/VPX (rev 35).
      Medium devsel.  Fast back-to-back capable.  Master Capable.  Latency=64.

TEAMInternet# cat /proc/devices
Character devices:
 1 mem
 2 pty
 3 ttyp
 4 ttyS
 5 cua
 6 lp
 7 vcs
10 misc

Block devices:
 3 ide0

TEAMInternet# cat /proc/ioports
0000-001f : dma1
0020-003f : pic1
0040-005f : timer
0060-006f : keyboard
0070-007f : rtc
0080-009f : dma page reg
00a0-00bf : pic2
00c0-00df : dma2
00f0-00ff : npu
01f0-01f7 : ide0
02f8-02ff : serial(set)
0378-037f : lp
03c0-03df : vga+
03f6-03f6 : ide0
03f8-03ff : serial(set)
6b00-6b7f : Lite-On 82c168 PNIC

TEAMInternet# cat /proc/interrupts
 0:      45038   timer
 1:          8   keyboard
 2:          0   cascade
 8:          0 + rtc
12:        537   Lite-On 82c168 PNIC
13:          1   math error
14:      15873 + ide0

>
>  d) Another possibility is that your memory itself is marginal.  The
>absolute worst cases of hardware debugging I have encountered have
>centered on bad/marginal memory.  We recently acquired a dual 450 MHz
>PII system that crashed every time we put a significant load on it and
>crashed anyway (after a longer time) even WITHOUT a load on it -- mean
>uptime before a crash of perhaps a day or two (an hour or less under
>load).  We were pulling out our hair -- we tried swapping cards, CPUs,
>and were close to trying another motherboard when we decided to swap
>SDRAM DIMMS instead.  Turned out our "certified" PC100 memory sucked --
>we put in over the counter PC100 memory from a local vendor and have had
>zero crashes under load or otherwise.  Sounds like you got all of those
>motherboards at once from somebody, and presumably got the same memory
>on all of them.  You might try getting some memory from a DIFFERENT
>(EDO?)  vendor and swapping it into a few of the systems and see if they
>crash.  Some motherboards are far less tolerant than others of "bad" or
>marginally spec'd memory.

I doubt memory. I say this because we're running memory from two different
vendors (Legacy and Crucial (our neighbors from Micron)). My experience
with bad memory most of the time (not always) shows itself as kernel page
faults or the infamous sig 11 from gcc. In fact part of our manufacturing
burnin process to test memory is to run a make of some old beta spellcaster
ISDN drivers in a loop, this particular make seems to be able to catch
memory errors like nothing else I've ever used. All of these systems passed
our burn-in process before we started testing them.

>
>  e) The final possibility (that I can think of) is to see if your
>problem is peculiar to the Epox MoBo you are using, if it is at all
>possible to swap it for another "equivalent".  Usually fundamentally
>stable vs unstable is very easy to identify even with small samples.
>Just because a MoBo has been used successfully with Windoze is no reason
>to believe that it is reliable -- Windows typically has a mean uptime
>measured in days under load anyway, so hardware problems are "invisible"
>against the dominant software problems (memory leaks and so forth).
>Even NT is none too stable for the purposes of validating hardware.
>I've used linux on Pentia, AMD's, PPro's, PII's, Celerons in single and
>dual configurations (probably several hundred systems total in a couple
>dozen different hardware configurations) and have literally never
>encountered a system (yet) on which I could not totally stabilize it,
>but there is always a first time.  It may be that your motherboard has
>some feature that just won't work with linux unless/until you hack the
>kernel itself.

Again, we are now running three different vendor's motherboards with the
same results (although all of them are VIA chipsets with Award bios). At
the end of the day, we'll have 10 new systems with Intel chipsets so that
will help us rule out the VIA chips. Also, we booted 10 systems from a DOS
floppy last night and will watch these for reboots.

>
>  Hope this helps, and hang in there.  As I said, if you persevere (and
>eliminate any possible hardware/bios problems by systematic swaps and
>the process of elimination) you have an excellent chance of beating the
>problem.

Thanks for the support. I'm confident that we will solve our problems and
hope our experiences will contribute to the overall stability of the Linux
platform.

Thanks,

Al Youngwerth
alberty@apexxtech.com