[Beowulf] HPL on an ad-hoc cluster
Robert G. Brown
rgb at phy.duke.edu
Thu Mar 8 07:13:19 PST 2007
On Wed, 7 Mar 2007, Olli-Pekka Lehto wrote:
> I'm currently evaluating the possibility of building a ad-hoc cluster (aka.
> flash mob) at a large computer hobbyist event using Linux live CDs. The
> "cluster" would potentially feature well over a thousand personal computers
> connected by a good GigE -network.
> While thinking up ideas for potential demos, running HPL naturally came up.
> However the traditional MPI implementation will not cut it as the "cluster"
> in question is very volatile. It's fairly certain that a number of nodes will
> drop out from the cluster during the time it would take to run a
> reasonably-sized HPL benchmark on the system. I have thought up some possible
> workarounds for this:
> -Making a purpose-built implementation of HPL with elaborate software
> checkpointing and migration mechanism. Probably too demanding.
I agree. A whole fairly nasty project there all by itself. Although it
MAY be possible if you use e.g. condor and its special checkpointing
library that "can" permit checkpointing and node migration smoothly -- I
don't know if HPL is in the class of computations that can be recompiled
in this way but somebody probably does on the condor list or this list.
The problem is that even if you can get a degree of node failover,
you'll also have to deal with variability in the cluster size and that's
a really tough problem with an app that probably expects to be told how
many nodes to run on. Do you tell it to use more than you'll ever have
or less? In the former case it probably won't work, at least if there
are any node dependencies you'll end up in deadlock. In the latter you
waste resources, although you maybe become robust. You'd have to keep
the number of nodes used in the computation below the least possible
number of systems participating, though, or you transition to state A
deadlock unless you really, really work hard.
> -Using FT-MPI to make the HPL more resilient to node failures. I don't have
> hands-on experience with FT-MPI so I'm not sure how much effort this would
> -Running a short subset (single iteration of the main loop?) of HPL
> repeatedly until we get lucky and a run completes. Not that elegant but
> obviously the simplest choice. How well would the single iteration be
> representative of running the complete benchmark on the system?
Don't know and don't know.n
> So, do you think that is this a pipe dream or a feasible project? Which path
> would you take to implement this?
I'd look into the condor issue, at least briefly. If somebody has
condorified HPL to make it moderately fault-tolerant, you're golden and
can even play with node drops ahead of time in a sandbox of your own
creation. These days condor has a real open source license, although it
is still a bit of a pain working through the project site because there
are still some legacy features of their dreams of proprietariness or
Two other suggestions. There are other apps out there that are a bit
more coarse grained or even embarrassingly parallel and that are by
design highly resilient to nodes popping in and dropping out. The
venerable RC5 keycrack comes to mind, as does SETI at home. Then there are
the eternal povray and mandelbrot apps, although at this point
mandelbrot has gotten to where it is really hard to challenge it without
rewriting the front end somehow to use higher precision as one can
descend to the "bottom" (in terms of float resolution) of any
interesting part of the pattern in a matter of seconds on any single CPU
nowadays, no cluster need apply. Doug Eadline and clustermonkey may
have some articles or suggestions for good demoware as this is a
perrenial issue and there really needs to be a standard drop for this
sort of thing somewhere (clusterdemo-0.0.1-1.i386.rpm anyone?).
The second suggestion is to look at LTSP, warewulf, and virtualization
as an even niftier way to build at least part of your flash cluster. If
you set up a PXE server you can very likely boot systems into a
"standard network appliance" with your cluster on it without touching
their disks OR needing a CD drive -- nearly all systems these days have
PXE-capable NICs. As an extension or slightly different variant of this
idea, look at real virtualization of the environment. vmware currently
distributes vmware-player for free, and there are several "virtual
cluster node" appliances it can play on the vmware website, including a
ready-to-run condor node. The advantage there is that people can even
leave their systems running (yuk!) Windows and see linux running a
cluster node in a virtual window. Or leave them running THEIR favorite
distro of linux but see the appliance ditto. Xen and kvm are also very
interesting projects in this space.
Allow me to editorialize for just a minute. In my opinion,
virtualization IS the next "killer app". In particular, Windows killer
app -- it will hurt Microsoft more than all the Lin servers ever
installed because it extends to the holy desktop and gives people
instant choices there. I'm still struggling with just how to implement
this into my own toy clusters and personal networks, but it is rapidly
becoming a major league fact of life in large scale computing
environments because it at long last enables the near-mythical "thin
client" to roll out across corporate desktops everywhere. Managing 100
individual workstations in a Windows LAN type environment is
horrendously expensive: if you manage them one at a time labor and
security kill you, if you manage them using Citrix or Terminal Server
software costs will kill you -- estimate them at perhaps $400 per node
ON TOP OF basic windows licenses, per node, by the time you buy all the
servers, CALS, and so on required. Double that if the nodes need MS
Virtualization has the capacity to "instantly" change the management CBA
forever. Buy a cheap thin client -- no local disk at all, just PXE
and/or flash boot and a barebones motherboard with lots of memory,
really. HP sells them, Neoware sells them, soon evvybody sell them or
you can easily and even more cheaply roll your own. Then boot them into
whatever you want or need. Today. Or this minute, this hour. With
internally virtualized environments you can even boot them into config
A, but run a complete virtualized B inside A for just the specific
applications it provides that you need.
Obviously this has TREMENDOUS impact on the future of one class of
cluster design, which is why I have the temerity to raise the issue
here. I can easily see a day approaching where an organization boots
all its enterprise workstations into Windows, but runs a canned,
completely free linux cluster node as a windows task under e.g. vmware.
Desktop users can do whatever they want -- nothing one can do with a web
browser, email client, or office suite uses more than a tiny bit of CPU
and a moderate block of memory -- and the organization has a built in,
absolutely "free" compute cluster with a large number of nodes for doing
all sorts of numerical or background work. Or invert it, and run linux
desktops (MUCH cheaper to install and manage) but run a virtualized
windows at a few selected seats that really need it. Virtualization can
provide the same kind of lockdown for the environment that now can be
had only with great expense.
Editorial over, but leaving behind the suggestion that thinking out a
demo of this sort would be very fashion-forward of you and in many ways
more exciting than "just" HPL on a flash cluster. Run it on a
>>virtual<< flash cluster.
To "win" the eternal Windows War, Linux doesn't need to engage in a
direct frontal assault on the desktop -- all they need is ways of
putting the camel's nose into the tent. Virtualizing is capable of
letting the whole damned camel in, but it doesn't LOOK like the camel
has come in as it lives in this adorable little cage on the tent's
primary desktop. Only as time passes will it become apparent that the
camel is actually the thing in the tent and it is the old desktop that
is inside the cage and shrinking, shrinking, shrinking, until it either
is gone or changes in very fundamental ways to survive.
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf