[Beowulf] HPL on an ad-hoc cluster

Wed Mar 7 08:12:33 PST 2007

I'm currently evaluating the possibility of building a ad-hoc cluster 
(aka. flash mob) at a large computer hobbyist event using Linux live 
CDs. The "cluster" would potentially feature well over a thousand 
personal computers connected by a good GigE -network.

While thinking up ideas for potential demos, running HPL naturally came 
up. However the traditional MPI implementation will not cut it as the 
"cluster" in question is very volatile. It's fairly certain that a 
number of nodes will drop out from the cluster during the time it would 
take to run a reasonably-sized HPL benchmark on the system. I have 
thought up some possible workarounds for this:

-Making a purpose-built implementation of HPL with elaborate software 
checkpointing and migration mechanism. Probably too demanding.

-Using FT-MPI to make the HPL more resilient to node failures. I don't 
have hands-on experience with FT-MPI so I'm not sure how much effort 
this would take.

-Running a short subset (single iteration of the main loop?) of HPL 
repeatedly until we get lucky and a run completes. Not that elegant but 
obviously the simplest choice. How well would the single iteration be 
representative of running the complete benchmark on the system?

So, do you think that is this a pipe dream or a feasible project? Which 
path would you take to implement this?

Olli-Pekka
-- 
Olli-Pekka Lehto, Systems Specialist, Systems Services, CSC
PO Box 405 02101 Espoo, Finland; tel +358 9 457 2215, fax +358 9 4572302
CSC is the Finnish IT Center for Science, www.csc.fi,
e-mail: Olli-Pekka.Lehto at csc.fi