Beowulf Questions

Mon Jan 6 08:27:43 PST 2003

On Mon, 6 Jan 2003, John Burton wrote:

> > BTW, has anyone bothered to calculate all the wasted cycles
> > used up by check-point files? :^).
> 
> Yup, and it is significantly less than the number of cycles that would 
> be wasted having to rerun 24 hours worth of processing because a machine 
> hiccuped and the process died...

Or to emphasize the point even more strongly, it is easy enough to
estimate a priori and determine empirically when it makes sense to
checkpoint a process and when it makes sense not to, given that it may
be moderately difficult to accomplish.

Don's point about MPI jobs should not be taken lightly.  Suppose one is
running a tightly coupled job (one where all the nodes advance
"together" and where failure of any node and the state data it contains
implies failure of the overall job) that will take one month to complete
on 100 nodes.  Let us further suppose that (not unreasonably) the
probability that at least one node will "fail" and require at least a
restart during that month is essentially unity.

The time required to complete the project without checkpointing is
basically infinity.  The time required to complete the project with a
checkpoint generated once a day, at the cost of 1/30'th of a day's work
(close to an hour!) is likely to be about 31 (best of all worlds, no
failure) and maybe 35 (1-3 failures) days, depending on the number of
actual failures that occur and how rapidly you are able to repair the
downed node(s) and restart the job.

BIG difference between 35 days and infinity...hmmmm

This, BTW, is one of the reasons that there are relatively few WinXX
clusters out there.  At least some implementations and installations of
WinXX (where XX is nearly any flavor you like) have reportedly had
reliable uptimes on the order of a day under heavy load.  If true, one
would damn near have to checkpoint every fifteen minutes to get through
the aforementioned computation at all and it would take a year.  Even a
single failure per day per 100 nodes is enough to significantly affect
time of completion of synchronous tasks.

Without checkpointing (and a lot of folks do run without it as it IS
often a PITA to implement) one is basically gambling that one's cluster
will stay up through a computation cycle, and one sets one's
computational cycle accordingly, making it a form of "checkpointing".
Experience and arithmetic rapidly teaches one when this is a good bet --
and when it is not.  The first time you run for a month, only to have a
node (and the entire computation!) crash a few hours before completion
when you were COUNTING on the results to complete the paper you're
presenting at a conference the following week the work to checkpoint may
not seem so very much after all...;-)

Last remark:  Randy, you very definitely should take the time to skim
through the list archives, a book or two on parallel computing and
beowulfery in general, and maybe the howtos or FAQs before making hard
pronouncements on what does and doesn't make sense in cluster computing.
This is for a variety of reasons, and you should learn them.  This is
not intended as a flame, just as a suggestion.  Note the following Great
Truths:

  a) nearly anything simple has been discussed a dozen times in full
detail and is in the list archives not once but a dozen times.

  b) a great deal that is very complex and involved indeed has ALSO been
discussed a dozen times in full detail ditto.  This list has been around
for what, eight years now (Don?) and good archives go back for at least
three or four.

  c) your mileage may vary; your task is unique; the only good benchmark
is your own application; there is no substitute for careful thought
about the parallelization process; there is no simple one-size-fits-all
recipe for building a successful cluster (defined as one that does YOUR
work acceptably well with something approximating a linear speedup in
the number of nodes).  These are all "list adages" that in summary mean
that any simple rule for cluster computing is probably "wrong" -- right
for THIS case, but wrong for THAT case -- which is why most of the list
experts will carefully qualify their answers rather than make sweeping
statements.

  d) In that spirit, parallel computing isn't "like" serial computing in
too many ways.  It is a deep and complex subject, and it is well worth
your while to (as suggested) read some books by Real Computer Scientists
on the subject.  It's a case where there are often simple, obvious --
and wrong -- implementations of nearly any important numerical task in
parallel form.  It's also the case that if you don't understand Amdahl's
Law (or know what it is!) and the related improved estimates associated
with parallel scaling, if you don't know what superlinear speedup is, if
you don't understand how and why both network latency and bandwidth are
important to parallel task completion -- some of the nitty-gritty
associated with homemade parallel computers to accomplish particular
tasks -- you'll end up wasting a lot of the list's time having these
ideas explained to you when you could just as easily read about them and
learn about them yourself.

One of may possible starting places (one that is free, in any event:-)
is http://www.phy.duke.edu/brahma.  In particular, check out the online
book.  I'm not a Real computer scientist, just a Sears computer
scientist, and I do need to update this book in the light of all sorts
of recent list discussions and wisdom as well as finish it off in terms
of planned topics, but even as it is it will make a lot of this stuff
clear to you.  There are also links to other resources including (IIRC)
an online book on parallel algorithms.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu