[Beowulf] New member, upgrading our existing Beowulf cluster
prentice at ias.edu
Tue Dec 8 10:52:28 PST 2009
Lux, Jim (337C) wrote:
> On 12/8/09 9:22 AM, "james bardin" <jbardin at bu.edu> wrote:
>> On Tue, Dec 8, 2009 at 10:50 AM, Prentice Bisbal <prentice at ias.edu> wrote:
>>> You'd hope that. Most of my current clusters users are scientific
>>> researchers in academia, not computer scientists. While some are
>>> extremely computer savvy, others have learned just enough about
>>> programming to do their calculations. Expecting the latter to write code
>>> with checkpointing is unrealistic, and working in academia, I can't
>>> force them to. Which is why taking down 4 nodes instead of just one is
>>> less than ideal.
>> I find it's still advantageous to push them to learn it. A researcher
>> working with a tight deadline for a grant will often see the light
>> when a hardware failure loses them a month or more of data processing.
>> It really is in their own best interests to learn about their tools.
> What about some form of "image checkpoint" like "hibernation"... Should be
> application unaware, just snapshots memory.
That's fine when the problem is on one system and there's only one
system image to worry about check pointing once you start spreading the
job around to multiple systems, things get complicated, especially if
your node is heterogeneous w.r.t hardware.
I fear we're straying off the topic of the original post...
More information about the Beowulf