[Beowulf] Checkpointing using flash
Ellis H. Wilson III
ellis at cse.psu.edu
Tue Sep 25 05:19:49 PDT 2012
On 09/24/2012 12:57 PM, Andrew Holway wrote:
>> Haha, I doubt it -- probably the opposite in terms of development cost.
>> Which is why I question the original statement on the grounds that
>> "cost" isn't well defined. Maybe the costs just performance-wise, but
>> that's not even clear to me when we consider things at huge scales.
> 40 years ago an army of cheap software developers were needed to
> service a single very expensive box. Now the boxes are super cheap and
> the price for decent software developers is very high.
40 years ago the demand for this type of job was...what? Incredibly
limited, I'd bet, if not a downright niche (supercomputing,
defense-related calculations, business apps, maybe a handful of other
purposes). And the boxes aren't super cheap because things have been
"solved" in hardware rather than software -- the fabs for modern
processors are much, much more expensive than they used to be, but the
laws of sales at scale, if you will, kick in to make things cheap since
so many want PCs.
> With hardware, you just have to solve the problem once. With this
I am totally unconvinced about this...if I solve something in software,
don't I only need to solve it once as well, opensource my code, and
share it? While I agree certain things are downright destined for
hardware (computer vision problems, arithmetic, etc), it is completely
unclear to me that something as unsolved and as high-level as parallel
programming for exascale computing should even be attempted to be dealt
with in hardware. What are you expecting the developers to code like
then, if they cannot understand parallel programming? Serial codes?
Good luck finding or writing a compiler (also software) that will turn a
serial code into a parallel code perfectly. That's many decades down
> Checkpointing to some kind of non volatile disk might work for some
> codes but its not a universal solution. Some MPI tricks might work for
Uhh...I think it's the opposite. We've been discussing Checkpointing in
this thread as a general solution that almost always works (I mean
you're literally snapshotting your memory, I cannot think of an instance
where that would not work), but it's not a solution that we'd like to
continue using for most of our codes in the future. It's just inefficient.
> another code. What about QCD codes that are almost completely I/O
> bound....I cant wrap my head around how either solution would work in
> that circumstance but then again I am not a computer scientist and
> have a moderately weak grasp on the mechanics.
What does I/O-bound or CPU-bound have to do with correctness of a
checkpoint? Do you mean data continues to be streamed in real-time like
from a collider so we have to deal with that during the checkpoint? Or
are you referring to something else entirely?
> Its easy to underestimate the golden rule of HPC! "Never underestimate
> the crappyness of the code!". It is our task to provide a safe an
> elegant playground for our users so that this crappyness matters a bit
> less :)
On a related note (I assume a majority of your users are scientists),
regarding your or somebody else's post a bit back about how poor
scientists are at coding -- I've witnessed the exact opposite. Now,
this is going on limited experience and all, but when I interned at
Argonne National Labs by Chicago I saw some absolutely amazing code
written by people without a computer science background that ran on what
was then one of the top supers in the country (Intrepid). The point is,
they need to get their work done, and they know just how painful and
long poor code will be and take. Moreover, their careers rest on the
premise that their calculations and resultant code are correct, and they
have deadlines like the rest of us that they have to meet, which means
therefore their code has to complete by. My golden rule of HPC is
therefore quite the opposite: "Never underestimate the cleverness of
your users." Their code might do "weird" things, but it's simply
because your framework wasn't adaptive enough. I have supreme respect
for most of the "users" I've dealt with, but as I said before, this is
admittedly going on limited experience and I could be an exceptional case.
More information about the Beowulf