updating the Linux kernel

Kathy Haigh Hutchinson K.Haigh-Hutchinson at Bradford.ac.uk
Mon Jun 12 08:35:15 PDT 2000


On Mon, 12 Jun 2000, David Lombard wrote:

> Crutcher Dunnavant wrote:
> > 
> > Now, I might completly miss something here, but shouldn't all *distibuted*
> > parallel programs assume that a node may not return. After all, what do you
> > assume about hardware failures? ...
> 

How can you tell the difference between 'never return' and 'take a very
long time to return.

Assuming you have done that, you could have the master periodically check
whether the whole machine was still there.


If a node has failed what do you do about it?

A message passing parallel program often does this by splitting an array
over several machines, each machine then processing part of the array.
When machines need data from their neighbours the appropriate segment of
data is passed between them.

The bulk of the data remains on the nodes on which it is originally
distributed. The intermediate results of computation remain on those nodes
until they are ready to report back to the master.

If a node goes down it is generally catastrophic, requiring a restart of
the program.

I do have a program which factorises large numbers, each node is
independant and it can cope with a failed node. 

But a finite difference time domain problem, if a node crashes, then all
the intermediate results of a segment of the problem space are lost. A
node failure requires restarting the entire problem.

How do I solve this?

Do I save all results at every program statement to a file?
Do I have the slaves send the results back to the master after every step
in the calculation?

Do I duplicate the entire program so that a copy of the data set can be
picked up from another node? Thus only effectively use half of the nodes?

The way parallelism is done, the way data is distributed, generally means
that when a node crashes its data is lost. There is then no way the rest
can carry on.

Task based parallelism could recover, my factorisation is essentially task
based. But parallelism is more often about data parallelism. Preserving
copies of the data every time something changes would be so much effort in
storage and in time to do the storing, that benefits of distributing the
data in the first place would be significantly reduced, or am I missing
something?

Kathy HH






More information about the Beowulf mailing list