[realtek-bug] Another 1.10 bug (more important - with fix)

Paul Campbell paul@taniwha.com
Fri, 14 Jul 2000 15:21:10 -0700


SYMPTOMS

	periodically under reasonable NFS load  a random machine
	in my farm would stop talking - you could ping it but the
	responses were slow - often delayed untill after subsequent
	pings were received. You could log in (mine are headless) 
	with abysmal response times and poke around. Always the 
	following error was the last thing in the kernel buffer:

		eth0: Transmit error, Tx status 400820aa.   

	(a transmit abort). I found you could unwedge a stuck
	machine by pinging it with longish packets (say 10k bytes)
	at which time it printed:

		eth0: Transmit timeout, status 0d 0000 media 00.
		eth0: Tx queue start entry 56110  dirty entry 56106, full.
		eth0:  Tx descriptor 0 is 000804ae.
		eth0:  Tx descriptor 1 is 0008042d.
		eth0:  Tx descriptor 2 is 00082442. (queue head)
		eth0:  Tx descriptor 3 is 00082441.
		eth0: MII #32 registers are: 1000 782d 0000 0000 05e1 40a1 0001 0000.     

SOLUTION

	The problem is in the transmit interrupt service routine's response to the
	transmit abort state, in the 1.10 driver line 1066 it does:

		outl((TX_DMA_BURST<<8)|0x03000001, ioaddr + TxConfig);    

	I believe this should be replaced with:
		
		outl((TX_DMA_BURST<<8), ioaddr + TxConfig);    

	Note the missing constant - the '1' in it - I believe, according to
	the chip's docs, this causes the aborted packet to be retransmitted
	but further down in the ISR the driver assumes that the packet
	is done and discards the buffer and allows the xmt entry to be reused
	I think that this is the cause of the hang - the tx timeout clears
	this and resets this state. Also the '3' value appears to put
	the transmitter into a state where it uses an illegal interframe gap
	(another possible cause of problems)

NOTES

	This fixed my problem - it may also fix the mysterious hang other
	people have reported - to get to this point I ported a lot of 
	the rtl8139too.c driver into my linux 2.2 driver (spinlocks,
	the BSD fixes, the extra 4-byte problem etc etc) - this was the change
	that fixed my problem so there may be other stuff that should be 
	fixed too - I'm loathe to fork a 3rd set of driver source - Donald, 
	is it appropriate for me to pass you my annotated source for you 
	to pick and choose changes from?


	Paul Campbell
	paul@taniwha.com