[Beowulf] HPL as a learning experience

Gus Correa gus at ldeo.columbia.edu
Tue Mar 16 10:00:07 PDT 2010


Hi Carsten

The problem is most likely mpich 1.2.7.
MPICH-1 is old and no longer maintained.
It is based on the P4 lower level libraries, which don't
seem to talk properly to current Linux kernels and/or
to current Ethernet card drivers.

There were several postings on this list, on the ROCKS Clusters list,
on the MPICH list, etc, reporting errors very similar to yours:
a p4 error followed by a segmentation fault.
The MPICH developers recommend upgrading to MPICH2 because of
these problems, besides performance, ease of use, etc.

The easy fix is to use another MPI, say, OpenMPI or MPICH2.
I would guess they are available as packages for Debian.

However, you can build both very easily
from source using just gcc/g++/gfortran.
Get the source code and documentation,
then read the README files, FAQ (OpenMPI),
and Install Guide, User Guide (MPICH2) for details:

OpenMPI
http://www.open-mpi.org/
http://www.open-mpi.org/software/ompi/v1.4/
http://www.open-mpi.org/faq/
http://www.open-mpi.org/faq/?category=building

MPICH2:
http://www.mcs.anl.gov/research/projects/mpich2/
http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads
http://www.mcs.anl.gov/research/projects/mpich2/documentation/index.php?s=docs

I compiled and ran HPL here with both OpenMPI and MPICH2
(and MVAPICH2 as well), and it works just fine,
over Ethernet and over Infiniband.

I hope this helps.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Carsten Aulbert wrote:
> Hi all,
> 
> I wanted to run high performance linpack mostly for fun (and of course to 
> learn more about it and stress test a couple of machines). However, so far 
> I've had very mixed results.
> 
> I downloaded the 2.0 version released in September 2008 and managed it to 
> compile with mpich 1.2.7 on Debian Lenny. The resulting xhpl file is 
> dynamically linked like this:
> 
>         linux-vdso.so.1 =>  (0x00007fffca372000)
>         libpthread.so.0 => /lib/libpthread.so.0 (0x00007fb47bca8000)
>         librt.so.1 => /lib/librt.so.1 (0x00007fb47ba9f000)
>         libgfortran.so.3 => /usr/lib/libgfortran.so.3 (0x00007fb47b7c4000)
>         libm.so.6 => /lib/libm.so.6 (0x00007fb47b541000)
>         libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fb47b32a000)
>         libc.so.6 => /lib/libc.so.6 (0x00007fb47afd7000)
>         /lib64/ld-linux-x86-64.so.2 (0x00007fb47bec4000)
> 
> Then I wanted to run a couple of tests on a single quad-CPU node (with 12 GB 
> physical RAM), I used
> 
> http://www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html
> 
> to generate files for a single and a dual core test [1] and [2].
> 
> Starting the single core run does not pose any problem:
> /usr/bin/mpirun.mpich -np 1  -machinefile machines /nfs/xhpl
> 
> where machines is just a simple file containing 4 times the name of this host. 
> So far so good. 
> ============================================================================
> T/V                N    NB     P     Q               Time             Gflops
> ----------------------------------------------------------------------------
> WR11C2R4       14592   128     1     1             407.94          5.078e+00
> ----------------------------------------------------------------------------
> ||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0087653 ...... PASSED
> ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0209927 ...... PASSED
> ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0045327 ...... PASSED
> ============================================================================
> 
> When starting the two core run, I receive the following error message after a 
> couple of seconds (after RSS hits the VIRT RAM value in top):
> 
> /usr/bin/mpirun.mpich -np 2  -machinefile machines /nfs/xhpl
> p0_20535:  p4_error: interrupt SIGSEGV: 11
> rm_l_1_20540: (1.804688) net_send: could not write to fd=5, errno = 32
> 
> SIGSEGV with p4_error indicates a seg fault within hpl - that's as far as I've 
> come with google, but right now I have no idea how to proceed. I somehow doubt 
> that this venerable program is so buggy that I'd hit it on my first day ;)
> 
> Any ideas where I might do something wrong?
> 
> Cheers
> 
> Carsten
> 
> [1]
> single core test
> HPLinpack benchmark input file
> Innovative Computing Laboratory, University of Tennessee
> HPL.out      output file name (if any) 
> 8            device out (6=stdout,7=stderr,file)
> 1            # of problems sizes (N)
> 14592         Ns
> 1            # of NBs
> 128           NBs
> 0            PMAP process mapping (0=Row-,1=Column-major)
> 1            # of process grids (P x Q)
> 1            Ps
> 1            Qs
> 16.0         threshold
> 1            # of panel fact
> 2            PFACTs (0=left, 1=Crout, 2=Right)
> 1            # of recursive stopping criterium
> 4            NBMINs (>= 1)
> 1            # of panels in recursion
> 2            NDIVs
> 1            # of recursive panel fact.
> 1            RFACTs (0=left, 1=Crout, 2=Right)
> 1            # of broadcast
> 1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
> 1            # of lookahead depth
> 1            DEPTHs (>=0)
> 2            SWAP (0=bin-exch,1=long,2=mix)
> 64           swapping threshold
> 0            L1 in (0=transposed,1=no-transposed) form
> 0            U  in (0=transposed,1=no-transposed) form
> 1            Equilibration (0=no,1=yes)
> 8            memory alignment in double (> 0)
> ##### This line (no. 32) is ignored (it serves as a separator). ######
> 0                               Number of additional problem sizes for PTRANS
> 1200 10000 30000                values of N
> 0                               number of additional blocking sizes for PTRANS
> 40 9 8 13 13 20 16 32 64        values of NB
> 
> [2]
> dual core setup
> HPLinpack benchmark input file
> Innovative Computing Laboratory, University of Tennessee
> HPL.out      output file name (if any) 
> 8            device out (6=stdout,7=stderr,file)
> 1            # of problems sizes (N)
> 14592         Ns
> 1            # of NBs
> 128           NBs
> 0            PMAP process mapping (0=Row-,1=Column-major)
> 1            # of process grids (P x Q)
> 1            Ps
> 2            Qs
> 16.0         threshold
> 1            # of panel fact
> 2            PFACTs (0=left, 1=Crout, 2=Right)
> 1            # of recursive stopping criterium
> 4            NBMINs (>= 1)
> 1            # of panels in recursion
> 2            NDIVs
> 1            # of recursive panel fact.
> 1            RFACTs (0=left, 1=Crout, 2=Right)
> 1            # of broadcast
> 1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
> 1            # of lookahead depth
> 1            DEPTHs (>=0)
> 2            SWAP (0=bin-exch,1=long,2=mix)
> 64           swapping threshold
> 0            L1 in (0=transposed,1=no-transposed) form
> 0            U  in (0=transposed,1=no-transposed) form
> 1            Equilibration (0=no,1=yes)
> 8            memory alignment in double (> 0)
> ##### This line (no. 32) is ignored (it serves as a separator). ######
> 0                               Number of additional problem sizes for PTRANS
> 1200 10000 30000                values of N
> 0                               number of additional blocking sizes for PTRANS
> 40 9 8 13 13 20 16 32 64        values of NB
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list