[Beowulf] non-stop computing

Skylar Thompson skylar.thompson at gmail.com
Tue Oct 25 15:56:43 PDT 2016


Assuming you can contain a run on a single node, you could use
containers and the freezer controller (plus maybe LVM snapshots) to do
checkpoint/restart.

Skylar

On 10/25/2016 11:24 AM, Michael Di Domenico wrote:
> here's an interesting thought exercise and a real problem i have to tackle.
> 
> i have a researchers that want to run magma codes for three weeks or
> so at a time.  the process is unfortunately sequential in nature and
> magma doesn't support check pointing (as far as i know) and (I don't
> know much about magma)
> 
> So the question is;
> 
> what kind of a system could one design/buy using any combination of
> hardware/software that would guarantee that this program would run for
> 3 wks or so and not fail
> 
> and by "fail" i mean from some system type error, ie memory faulted,
> cpu faulted, network io slipped (nfs timeout) as opposed to "there's a
> bug in magma" which already bit us a few times
> 
> there's probably some commercial or "unreleased" commercial product on
> the market that might fill this need, but i'm also looking for
> something "creative" as well
> 
> three weeks isn't a big stretch compared to some of the others codes
> i've heard around the DOE that run for months, but it's still pretty
> painful to have a run go for three weeks and then fail 2.5 weeks in
> and have to restart.  most modern day hardware would probably support
> this without issue, but i'm looking for more of a guarantee then a
> prayer
> 
> double bonus points for anything that runs at high clock speeds >3Ghz
> 
> any thoughts?
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 



More information about the Beowulf mailing list