[Beowulf] trouble running the linpack xhpl benchmark

Bruce Allen ballen at gravity.phys.uwm.edu
Fri May 5 14:04:43 PDT 2006


I've built three other large clusters in the past, but was never motivated 
to do a Top500 linpack benchmark for them.  This time around, for our new 
Nemo cluster, I want to have linpack results for the Top500 list.  So Kipp 
Cannon, one of our group's postdocs, has spent a few days setting up and 
running linpack/xhpl.

We have 640 dual-core 2.2 GHz opteron 175 nodes with 2GB per node and a 
good gigE network.

We're having problems getting xhpl to run on the entire cluster, and are 
wondering if someone on this list might have insight into what might be 
going wrong.  At the moment, the software combination is gcc + lam/mpi + 
atlas + hpl.  Note that in our normal use the cluster runs standalone 
executables managed via condor (trivially parallel code!) so this is our 
first use of MPI or any MPI code in at least three years.

Testing on up to 338 nodes (676 cores), the benchmark runs fine and we are 
getting above 60% of peak floating-point performance. But, attempting to 
use the entire cluster (640 nodes, 1280 cores) seems to trigger the 
out-of-memory killer on some nodes.  The jobs never really seem to start 
running, they are killed before calling mpi_init (which is the error 
message we see from LAM: "job exited before calling mpi_init()").

The jobs die very quickly, so we have not been able to see how much memory 
they try to allocate.  We are using a spreadsheet given to us by David 
Cownie at AMD for calculating the problem size based on the maximum usable 
RAM per core, and have found that that spreadsheet works correctly: 
running on 20 cores, 196 cores, and 676 cores with problem sizes chosen by 
that spreadsheet show the same, predicted, RAM used per core in all cases.

Could there be some threshold in xhpl, where above some problem size it's RAM 
usage increases for other reasons?

What about the "PxQ" parameters?  For 676 cores we are using square P=Q 
but change this to use 1280 cores.  Does anyone know of problems with 
running xhpl when P != Q on x86_64?

Cheers,
 	Bruce Allen



More information about the Beowulf mailing list