Request on advice on which kernel? 2.2 or 2.4?

Martin Siegert siegert at sfu.ca
Wed Oct 3 15:15:57 PDT 2001


On Wed, Oct 03, 2001 at 09:57:44AM -0400, Donald Becker wrote:
> On Wed, 3 Oct 2001, Michelle Kuttel wrote:
> 
> > I would like to request some opinions/advice on which kernel is best for
> > my Beowulf cluster.  We have a cluster of 16 Dual processor PentiumIII-866
> > MHz work nodes (head node AMD athlon 1Ghz CPU, single processor).  It has
> > been running for a few months now (computational chemistry CHARMM code
> > principally).
> >  I have installed both 2.2.14-5 kernel (with Loncaric's
> > tcpfix kernel patch)
> 
> We use and recommend this TCP patch.  Josip did excellent work.
> 
> > and the 2.2.4 kernel at different times.
> 
> The biggest advantage of 2.4 kernel is the SMP improvements to the
> network stack.  You'll see less benefit with your single processor
> nodes, with most of the benefit on four processor nodes.

This brings up another issue: the APIC code (bugs?) in the 2.4 series
of kernels. I encouter the following problem: when using 2.4 kernels
(I have tried almost every version starting from RedHat's 2.4.3-12 smp
kernel over 2.4.5 - 2.4.10 including various ac versions as well) and
the LAM MPI distribution some MPI programs will hang almost every time.
These are mostly parallel FFT jobs (from the fftw library) using global
communication patterns (MPI_Alltoall). I am using dual Athlon 1.2GHz nodes
each with 4 3com NICs, three of which are channel bonded.
I make the following observations:

- the program hangs when executing a r = read(sock, buf, nbytes) statement
  over and over again. Typically: r=56 or r=696 and nbytes=116765796, i.e.,
  if you decrease 116765796 in steps of 56 or 696, the program hangs for
  practical purposes.

- when using mpich the program does not hang.

- when using the 2.2.19 smp kernel the program does not hang.

- using the append="noapic" setting in /etc/lilo.conf with a 2.4.x kernel 
  reduces the failure rate but still the program hangs with a probability
  that is unacceptable for a production environment.



More information about the Beowulf mailing list