Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] trouble running the linpack xhpl benchmark

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Craig Tierney ctierney at hypermall.net
Fri May 5 14:24:11 PDT 2006


Bruce Allen wrote:
> I've built three other large clusters in the past, but was never 
> motivated to do a Top500 linpack benchmark for them.  This time around, 
> for our new Nemo cluster, I want to have linpack results for the Top500 
> list.  So Kipp Cannon, one of our group's postdocs, has spent a few days 
> setting up and running linpack/xhpl.
> 
> We have 640 dual-core 2.2 GHz opteron 175 nodes with 2GB per node and a 
> good gigE network.
> 
> We're having problems getting xhpl to run on the entire cluster, and are 
> wondering if someone on this list might have insight into what might be 
> going wrong.  At the moment, the software combination is gcc + lam/mpi + 
> atlas + hpl.  Note that in our normal use the cluster runs standalone 
> executables managed via condor (trivially parallel code!) so this is our 
> first use of MPI or any MPI code in at least three years.

Use Goto's blas library.  It is faster than Atlas.

> 
> Testing on up to 338 nodes (676 cores), the benchmark runs fine and we 
> are getting above 60% of peak floating-point performance. But, 
> attempting to use the entire cluster (640 nodes, 1280 cores) seems to 
> trigger the out-of-memory killer on some nodes.  The jobs never really 
> seem to start running, they are killed before calling mpi_init (which is 
> the error message we see from LAM: "job exited before calling mpi_init()").
> 
> The jobs die very quickly, so we have not been able to see how much 
> memory they try to allocate.  We are using a spreadsheet given to us by 
> David Cownie at AMD for calculating the problem size based on the 
> maximum usable RAM per core, and have found that that spreadsheet works 
> correctly: running on 20 cores, 196 cores, and 676 cores with problem 
> sizes chosen by that spreadsheet show the same, predicted, RAM used per 
> core in all cases.
> 
> Could there be some threshold in xhpl, where above some problem size 
> it's RAM usage increases for other reasons?
> 
> What about the "PxQ" parameters?  For 676 cores we are using square P=Q 
> but change this to use 1280 cores.  Does anyone know of problems with 
> running xhpl when P != Q on x86_64?

Have you tried running xhpl on both halves of the system?  This will 
tell you if you have hardware problems on one side of the system.

Also, try setting N to a small number, like 10000, for the entire 
cluster.  You can start isolate what the problem is that way as well.

Just make sure that P<Q.  Keep it as square as possible.  32x40 should
work well for your system.

Craig

> 
> Cheers,
>     Bruce Allen
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list