[Beowulf] non-stop computing

Prentice Bisbal pbisbal at pppl.gov
Wed Oct 26 06:54:13 PDT 2016


I would be laughing if this wasn't so true.

The sad thing is, the person who took on this convoluted, BS-heavy 
approach would probably get promoted for managing a "large, complicated 
project with many moving parts" while the guy who took Gavin's approach 
would continue to toil away in his basement office for quietly getting 
the job done quickly and saving money.

Prentice

On 10/25/2016 11:45 PM, John Hanks wrote:
> We routinely run jobs that last for months, some are codes that have 
> an endpoint others are processes that provide some service (SOLR, 
> ElasticSearch, etc,...) which have no defined endpoint. Unless you 
> have some seriously flaky hardware or ongoing power/cooling issues 
> there is nothing special needed to get high uptimes. I'd suggest 
> making NFS mounts hard, so processes can recover from an NFS server 
> reboot. Another good idea is to start several copies of an important 
> run on several different nodes, preferably in different racks/PDUs/UPS.
>
> The frame of your question as a thought exercise does open up the 
> possibility for commentary though, challenge accepted.
>
> A question like this will pique the interest of anyone seeking to 
> justify their existence through the application of BS, myself 
> included. To a project manager, this problem you have is a veritable 
> goldmine of opportunity. To a vendor, you have opened up pandora's box 
> and are the potential sale that will allow them to buy their kid the 
> GI Joe with the kung-fu grip this Christmas.  No solution is too 
> costly or too complex to apply to this challenge, and the planning and 
> execution must be detailed, require committees, phone calls, copious 
> emails, a procurement process, budgetary analysis, more training, most 
> certainly additional staffing....
>
> Fortunately xkcd has a cartoon for this: http://xkcd.com/1445/
>
> You should take this opportunity to perform an experiment for us. 
> You'll need two groups, first the experimental group. Contact a local 
> project manager (just follow the trail of 6 sigma motivational 
> posters) and explain your problem. Be detailed, spend some time 
> googling up as many buzzwords as possible for your explanation. Look 
> earnest and sincere as you present your case and drop hints about how 
> you too may be interested in some of that "wonderful sigma training". 
> Start a timer.
>
> Now, the control group. When you get back to your desk go drag out 
> some of those old workstations that were destined for surplus. Put 
> them on some scrounged up, half dead desk-side UPS and start a half 
> dozen copies of your code. Start a second timer, then go back to your 
> normal duties.
>
> In a few months report back which approach produced the best results 
> as measured by time-to-run-completion and cost in dollars per 
> completed run. Don't forget to have the project manager provide a 
> detailed time record of hours spent by all the people they involve in 
> the process.
>
> I look forward to the results.
>
> jbh
>
> On Wed, Oct 26, 2016 at 1:57 AM Skylar Thompson 
> <skylar.thompson at gmail.com <mailto:skylar.thompson at gmail.com>> wrote:
>
>     Assuming you can contain a run on a single node, you could use
>     containers and the freezer controller (plus maybe LVM snapshots) to do
>     checkpoint/restart.
>
>     Skylar
>
>     On 10/25/2016 11:24 AM, Michael Di Domenico wrote:
>     > here's an interesting thought exercise and a real problem i have
>     to tackle.
>     >
>     > i have a researchers that want to run magma codes for three weeks or
>     > so at a time.  the process is unfortunately sequential in nature and
>     > magma doesn't support check pointing (as far as i know) and (I don't
>     > know much about magma)
>     >
>     > So the question is;
>     >
>     > what kind of a system could one design/buy using any combination of
>     > hardware/software that would guarantee that this program would
>     run for
>     > 3 wks or so and not fail
>     >
>     > and by "fail" i mean from some system type error, ie memory faulted,
>     > cpu faulted, network io slipped (nfs timeout) as opposed to
>     "there's a
>     > bug in magma" which already bit us a few times
>     >
>     > there's probably some commercial or "unreleased" commercial
>     product on
>     > the market that might fill this need, but i'm also looking for
>     > something "creative" as well
>     >
>     > three weeks isn't a big stretch compared to some of the others codes
>     > i've heard around the DOE that run for months, but it's still pretty
>     > painful to have a run go for three weeks and then fail 2.5 weeks in
>     > and have to restart.  most modern day hardware would probably
>     support
>     > this without issue, but i'm looking for more of a guarantee then a
>     > prayer
>     >
>     > double bonus points for anything that runs at high clock speeds
>     >3Ghz
>     >
>     > any thoughts?
>     > _______________________________________________
>     > Beowulf mailing list, Beowulf at beowulf.org
>     <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>     > To change your subscription (digest mode or unsubscribe) visit
>     http://www.beowulf.org/mailman/listinfo/beowulf
>     >
>
>     _______________________________________________
>     Beowulf mailing list, Beowulf at beowulf.org
>     <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>     To change your subscription (digest mode or unsubscribe) visit
>     http://www.beowulf.org/mailman/listinfo/beowulf
>
> -- 
> ‘[A] talent for following the ways of yesterday, is not sufficient to 
> improve the world of today.’
>  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20161026/debfcd05/attachment-0001.html>


More information about the Beowulf mailing list