Mysterious kernel hangs

Thu Mar 15 06:39:17 PST 2001

On Thu, 15 Mar 2001, Felix Rauch wrote:

> We recently bought a new 16 node cluster with dual 1 GHz PentiumIII
> nodes, but machines mysteriously freeze :-(
>
> The nodes have STL2 boards (Version A28808-301), onboard adaptec SCSI
> controllers (7899P), onboard intel Fast Ethernet adapters (82557
> [Ethernet Pro 100]) and additional Packet Engines Hamachi GNIC-II
> Gigabit Ethernet cards.
>
> We tried kernels 2.2.x, 2.4.1 and now even 2.4.2-ac20, but it seems to
> be the same problem with all kernels: When we run experiments which
> use the network intensively, any of the machines will just freeze
> after a few hours. The frozen machine does not respond to anything and
> up to now we were not able to see any log-entries related to the
> freeze on virtual console 10 :-(   We switched now on all the "Kernel
> Hacking" stuff in the kernel configuration (especially the logging)
> and we will try again, hopefuly we will at least see some log outputs.
>
> The freezes do also happen if we let non-network-intensive jobs run on
> the machines (e.g. SETI at home), but clearly they happen less often.
>
> Does anyone of you have any ideas what could go wrong or what we could
> try to find the cause of the problems?.

Dear Felix,

If this is happening on all 16 nodes, it sounds very, very much like a
kernel deadlock, although problems with the specific motherboard/chipset
cannot be ruled out.  Can't help you with the motherboard if that turns
out to be a problem, but you might check with the kernel list to see if
there are known problems. To debug the possibility of a bad device
driver or SMP deadlock, try the following:

  a) Boot half the boxen with UP kernels.  See if the freezes still
occur on the UP boxen.  If they don't, you almost certainly have a
deadlock problem within drivers in the SMP kernel(S).  Join (at least
temporarily) the linux SMP kernel list and seek help there and on the
relevant driver list(s).

  b) Since the problem occurs across kernel revisions (and since the
kernels are generally SMP-stable) it is almost certainly in a driver,
whether or not the UP-kernel systems lock up.  If it isn't in the
motherboard.

  c) Of the devices you are running, I'd suspect the onboard adaptec or
the Gigabit card; although Don can probably offer a more informed
opinion on the eepro I have the general impression that it is pretty
stable (it certainly works fine for us in many boxes).  Legend has it
that the aic7xxx driver is in a state of upheaval currently as the
entire scsi stack is being rebuilt and fixed in the 2.4.x series -- I
cannot even get the aic7xxx module to load, for example, on a dual PIII
that I've been trying to install with RH 7.1beta/wolverine.  I don't
know if this would affect the 2.2.x kernels, though.

However, I don't know for sure what devices the aic7xxx supports well
these days because I finally exited the aic7xxx list because current
UDMA controllers and big, fast drives are obsoleting SCSI for all but
the most demanding server applications -- all I have are a few legacy
Adaptec controllers to support and (knock on wood) they work fine in the
2.2.16 kernels.  I do remember that Doug Ledford added some very handy
debugging features to the driver module to help debug serious (and very
similar) problems I encountered with e.g. the onboard 7890 in our Dell
Poweredge 2300's two years ago when the device was first released --
turn these on and see if they help at all.

Can't help you at all with the Packet Engines driver.

  d) To identify and repair the "problem child", all I can suggest is
the usual trick of removing components one at a time until the systems
(hopefully) magically stabilize.  Then either replace the component
(which may be the cheapest solution even if you have to throw the bad
components away - time is expensive and replacement is fast and easy) or
(more responsibly) join the relevant device list or kernel list and
communicate with the device/kernel maintainer(s).

Remember that IDE drives are cheap, fast, and work just fine for most
local disk needs on a node, so just disabling all your adaptec
controllers (if that turns out to be the problem) and putting IDE drives
in would cost you maybe $1.5-2K but could save you days or even weeks of
systems programming effort screwing around with the onboard controllers.
You can always be a good citizen with one box as a holdout and help Doug
Ledford fix the driver while using all the rest.  In fact in the short
run you could likely/maybe run diskless with 15 nodes (if that
stabilized the systems) and help work on the driver.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu