[Beowulf] non-stop computing

John Hanks griznog at gmail.com
Tue Oct 25 20:45:14 PDT 2016


We routinely run jobs that last for months, some are codes that have an
endpoint others are processes that provide some service (SOLR,
ElasticSearch, etc,...) which have no defined endpoint. Unless you have
some seriously flaky hardware or ongoing power/cooling issues there is
nothing special needed to get high uptimes. I'd suggest making NFS mounts
hard, so processes can recover from an NFS server reboot. Another good idea
is to start several copies of an important run on several different nodes,
preferably in different racks/PDUs/UPS.

The frame of your question as a thought exercise does open up the
possibility for commentary though, challenge accepted.

A question like this will pique the interest of anyone seeking to justify
their existence through the application of BS, myself included. To a
project manager, this problem you have is a veritable goldmine of
opportunity. To a vendor, you have opened up pandora's box and are the
potential sale that will allow them to buy their kid the GI Joe with the
kung-fu grip this Christmas.  No solution is too costly or too complex to
apply to this challenge, and the planning and execution must be detailed,
require committees, phone calls, copious emails, a procurement process,
budgetary analysis, more training, most certainly additional staffing....

Fortunately xkcd has a cartoon for this: http://xkcd.com/1445/

You should take this opportunity to perform an experiment for us. You'll
need two groups, first the experimental group. Contact a local project
manager (just follow the trail of 6 sigma motivational posters) and explain
your problem. Be detailed, spend some time googling up as many buzzwords as
possible for your explanation. Look earnest and sincere as you present your
case and drop hints about how you too may be interested in some of that
"wonderful sigma training". Start a timer.

Now, the control group. When you get back to your desk go drag out some of
those old workstations that were destined for surplus. Put them on some
scrounged up, half dead desk-side UPS and start a half dozen copies of your
code. Start a second timer, then go back to your normal duties.

In a few months report back which approach produced the best results as
measured by time-to-run-completion and cost in dollars per completed run.
Don't forget to have the project manager provide a detailed time record of
hours spent by all the people they involve in the process.

I look forward to the results.

jbh

On Wed, Oct 26, 2016 at 1:57 AM Skylar Thompson <skylar.thompson at gmail.com>
wrote:

> Assuming you can contain a run on a single node, you could use
> containers and the freezer controller (plus maybe LVM snapshots) to do
> checkpoint/restart.
>
> Skylar
>
> On 10/25/2016 11:24 AM, Michael Di Domenico wrote:
> > here's an interesting thought exercise and a real problem i have to
> tackle.
> >
> > i have a researchers that want to run magma codes for three weeks or
> > so at a time.  the process is unfortunately sequential in nature and
> > magma doesn't support check pointing (as far as i know) and (I don't
> > know much about magma)
> >
> > So the question is;
> >
> > what kind of a system could one design/buy using any combination of
> > hardware/software that would guarantee that this program would run for
> > 3 wks or so and not fail
> >
> > and by "fail" i mean from some system type error, ie memory faulted,
> > cpu faulted, network io slipped (nfs timeout) as opposed to "there's a
> > bug in magma" which already bit us a few times
> >
> > there's probably some commercial or "unreleased" commercial product on
> > the market that might fill this need, but i'm also looking for
> > something "creative" as well
> >
> > three weeks isn't a big stretch compared to some of the others codes
> > i've heard around the DOE that run for months, but it's still pretty
> > painful to have a run go for three weeks and then fail 2.5 weeks in
> > and have to restart.  most modern day hardware would probably support
> > this without issue, but i'm looking for more of a guarantee then a
> > prayer
> >
> > double bonus points for anything that runs at high clock speeds >3Ghz
> >
> > any thoughts?
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> >
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-- 
‘[A] talent for following the ways of yesterday, is not sufficient to
improve the world of today.’
 - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20161026/1cdd93f9/attachment.html>


More information about the Beowulf mailing list