[Beowulf] trouble running the linpack xhpl benchmark
ballen at gravity.phys.uwm.edu
Fri May 5 14:04:43 PDT 2006
I've built three other large clusters in the past, but was never motivated
to do a Top500 linpack benchmark for them. This time around, for our new
Nemo cluster, I want to have linpack results for the Top500 list. So Kipp
Cannon, one of our group's postdocs, has spent a few days setting up and
We have 640 dual-core 2.2 GHz opteron 175 nodes with 2GB per node and a
good gigE network.
We're having problems getting xhpl to run on the entire cluster, and are
wondering if someone on this list might have insight into what might be
going wrong. At the moment, the software combination is gcc + lam/mpi +
atlas + hpl. Note that in our normal use the cluster runs standalone
executables managed via condor (trivially parallel code!) so this is our
first use of MPI or any MPI code in at least three years.
Testing on up to 338 nodes (676 cores), the benchmark runs fine and we are
getting above 60% of peak floating-point performance. But, attempting to
use the entire cluster (640 nodes, 1280 cores) seems to trigger the
out-of-memory killer on some nodes. The jobs never really seem to start
running, they are killed before calling mpi_init (which is the error
message we see from LAM: "job exited before calling mpi_init()").
The jobs die very quickly, so we have not been able to see how much memory
they try to allocate. We are using a spreadsheet given to us by David
Cownie at AMD for calculating the problem size based on the maximum usable
RAM per core, and have found that that spreadsheet works correctly:
running on 20 cores, 196 cores, and 676 cores with problem sizes chosen by
that spreadsheet show the same, predicted, RAM used per core in all cases.
Could there be some threshold in xhpl, where above some problem size it's RAM
usage increases for other reasons?
What about the "PxQ" parameters? For 676 cores we are using square P=Q
but change this to use 1280 cores. Does anyone know of problems with
running xhpl when P != Q on x86_64?
More information about the Beowulf