Beowulf Questions

Randall Jouett rules at bellsouth.net
Sat Jan 4 23:27:35 PST 2003


On Sat, 2003-01-04 at 11:58, Donald Becker wrote:
> 
> Not at all!  MPI does not handle faults.  Most MPI applications just
> fail when a node fails.  A few periodically write checkpoint files, and
> a subset ;-) of those can be re-run from the last checkpoint.

Checkpoint files? BLAH!!! :^). Admittedly, I'm a total neophyte
when it comes to parallel processing and the beowulf architecture,
but computin is computin, and I think I might have a "el-cheapo,
ham-operator solution." (Hams are infamous be being TOTAL cheapskates
:^). 


Off the top of my head, why couldn't you just plug in an old
10Base-T card to each node. Add a server node that specifically
polls each machine via hardware latch and software response.
Just a quick, "Hey, I'm still here." This fault server would
then send the root/head node a quick "we're running, boss!"
message, or it would tell the root/head node that a particular
machine was down. If the root machine sees a fault message,
it parses the packet, ignores the broken node, then reschedules
the task for execution. It could also send an e-mail to the
sysadmin, page him, and even play a "RED ALERT!" sample from
Trek :^).


Now, if your REALLY wanted to be cheap :^), you could do something
like this with a USB hub, although I'm pretty sure it wouldn't
be as fast as the 10Base-T setup. OTOH, 10Base-T gear (e.g. hub,
switch, NICs) can probably be had for the asking at most
institutions, I'd imagine.


BTW, has anyone bothered to calculate all the wasted cycles
used up by check-point files? :^). BTW, I guess you could
also implement something like this in software, having the
root node poll each compute node every so often, but I'm
pretty sure this would probably be kinda "chatty" on the
network and be a waste of bandwidth. I guess you could
monitor network traffic via tcpdump or something and set
the polls to a reasonable level, though. (Shrug.) Hey,
I'm not paid to do this, so I'm not going to get out the
calculator and strain the brain :^).

> With the POV-Ray port I used application specific knowledge and explicit
> code to re-issue the work and handle duplicate results.

Well, I like my "mainly hardware" version better :^p. :^) :^)

> You can use the same idea (but unique code) with other MPI
> applications that don't have side effects within the time step.

Kewl, and I'd imagine that time is everything in a beowulf
setup.

> Although the program completes the rendering, there is still much
> ugliness when a partially-failed MPI program tries to finish.

Hmmm. Why aren't folks flagging the node as dead and ignoring
any other output until the node is back up and saying it's
ready to run. This would have to be verified by the sysadmin,
of course.

Best Regards,

Randall
--
Randall Jouett
Amateur Radio: AB5NI

P.S.

The model I mentioned does have its flaws, of course, such
as a switch or hub going down, or maybe a busted CAT-5 cable
here or there. Something tells me, though, that it HAS to be
infinitely superior to check-point files and the like :^).
That is, if I'm understanding your meaning here of check-point
files. If I'm off base here, Donald, maybe you could clarify?





More information about the Beowulf mailing list