[3c509] 3c509B hang after "too much work in interrupt"

Thu Jul 11 01:47:01 2002

On 11 Jul 2002, David Rochberg wrote:

> To: 3c509@scyld.com
> Subject: [3c509] 3c509B hang after "too much work in interrupt"
..
> Under heavy loads I will see an occasional message:
>   Jul  9 15:57:47 farm-1 kernel: eth0: Too much work in interrupt, status e401.

Uhmmm, you likely wanted to send this to the vortex list, not the ISA
3c509 list.

> which seems mostly-benign on its own.  However, occasionally (>3, <10
> times in the last month), we'll come in in the morning to find the
> machine unreachable over its ethernet with "Too much work" as the last
> message in the syslog.  An ifdown/ifup will bring connectivity back.

Run 'vortex-diag' to report the NIC state when this happens.

> My big question is "Is there a fix for this"?  

It depends on what the specific problem is.

> 1.  How do status notifications get turned back on after a 'too much
> work'?

Line 1583 in 0.99Xc
	if (status & IntReq) {		/* Restore all interrupt sources.  */
		outw(vp->status_enable, ioaddr + EL3_CMD);
		outw(vp->intr_enable, ioaddr + EL3_CMD);
		vp->restore_intr_mask = 0;
	}

> When "too much work" happens, the driver turns off status
> notification ("indications" in 3com's terminology) for all the
> currently-asserted interrupt sources.  The comments say "The timer
> will reenable interrupts".  When I look at vortex_timer(), though, I
> can't figure out how indications get turned back on. I see the
> FakeIntr request, but no obvious SetStatusEnb/SetIndicationEnable.

The command e.g. SetIntrEnb  is stored with the sources to be enabled in
vp->status_enable.

> 2.  Will raising max_interrupt_work help?

It will avoid the message, but there is a reason that this code exists:
to protect the system from an overload that prevents any work from
getting done.

> To that end, I ran some experiments with an instrumented version of
> the driver that printed out the number of iterations through the loop
> in boomerang_interrupt (when the number of iterations exceeded a
> threshold).  I was a little surprised to see that this number was very
> rarely large; even under loads that brought the machine to its knees,
> I rarely saw iteration counts as high as 10, let alone the 32
> necessary to hit the max_interrupt_work threshold and trigger a "too
> much work".  I tried to generate extra interrupts by catting files to
> /dev/null during the tests, and this didn't seem to increase the
> number of loop iterations.

You have duplicated a test that I run to see how the driver behaves.
There is no network load in normal operation that causes the "too much
work" message.  It's caused by some other subsystem blocking the
network interrupt handler from getting work done in a timely manner.
This might be another device drivers, or a kernel lock.

> 3.  Could this be caused by the FakeIntr from vortex_timer getting
> dropped somehow?

Perhaps, but it shouldn't be possible to lose interrupts.

> I mention the IPIP encapsulation because my casual reading of the
> source (I've not yet instrumented to make sure) suggests that every IP
> packet that gets IPIP-encapsulated must be copied to make additional
> headroom in the skbuff, and this would further increase kernel CPU
> usage.

This skbuff allocation might be the problem.  The Linux kernel memory
allocation code in 2.4 is unpredictably bad.

-- 
Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993