[Beowulf] HPL on an ad-hoc cluster
oplehto at csc.fi
Wed Mar 7 08:12:33 PST 2007
I'm currently evaluating the possibility of building a ad-hoc cluster
(aka. flash mob) at a large computer hobbyist event using Linux live
CDs. The "cluster" would potentially feature well over a thousand
personal computers connected by a good GigE -network.
While thinking up ideas for potential demos, running HPL naturally came
up. However the traditional MPI implementation will not cut it as the
"cluster" in question is very volatile. It's fairly certain that a
number of nodes will drop out from the cluster during the time it would
take to run a reasonably-sized HPL benchmark on the system. I have
thought up some possible workarounds for this:
-Making a purpose-built implementation of HPL with elaborate software
checkpointing and migration mechanism. Probably too demanding.
-Using FT-MPI to make the HPL more resilient to node failures. I don't
have hands-on experience with FT-MPI so I'm not sure how much effort
this would take.
-Running a short subset (single iteration of the main loop?) of HPL
repeatedly until we get lucky and a run completes. Not that elegant but
obviously the simplest choice. How well would the single iteration be
representative of running the complete benchmark on the system?
So, do you think that is this a pipe dream or a feasible project? Which
path would you take to implement this?
Olli-Pekka Lehto, Systems Specialist, Systems Services, CSC
PO Box 405 02101 Espoo, Finland; tel +358 9 457 2215, fax +358 9 4572302
CSC is the Finnish IT Center for Science, www.csc.fi,
e-mail: Olli-Pekka.Lehto at csc.fi
More information about the Beowulf