updating the Linux kernel
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Kathy Haigh Hutchinson K.Haigh-Hutchinson at Bradford.ac.ukMon Jun 12 08:35:15 PDT 2000
- Previous message: updating the Linux kernel
- Next message: updating the Linux kernel
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Mon, 12 Jun 2000, David Lombard wrote: > Crutcher Dunnavant wrote: > > > > Now, I might completly miss something here, but shouldn't all *distibuted* > > parallel programs assume that a node may not return. After all, what do you > > assume about hardware failures? ... > How can you tell the difference between 'never return' and 'take a very long time to return. Assuming you have done that, you could have the master periodically check whether the whole machine was still there. If a node has failed what do you do about it? A message passing parallel program often does this by splitting an array over several machines, each machine then processing part of the array. When machines need data from their neighbours the appropriate segment of data is passed between them. The bulk of the data remains on the nodes on which it is originally distributed. The intermediate results of computation remain on those nodes until they are ready to report back to the master. If a node goes down it is generally catastrophic, requiring a restart of the program. I do have a program which factorises large numbers, each node is independant and it can cope with a failed node. But a finite difference time domain problem, if a node crashes, then all the intermediate results of a segment of the problem space are lost. A node failure requires restarting the entire problem. How do I solve this? Do I save all results at every program statement to a file? Do I have the slaves send the results back to the master after every step in the calculation? Do I duplicate the entire program so that a copy of the data set can be picked up from another node? Thus only effectively use half of the nodes? The way parallelism is done, the way data is distributed, generally means that when a node crashes its data is lost. There is then no way the rest can carry on. Task based parallelism could recover, my factorisation is essentially task based. But parallelism is more often about data parallelism. Preserving copies of the data every time something changes would be so much effort in storage and in time to do the storing, that benefits of distributing the data in the first place would be significantly reduced, or am I missing something? Kathy HH
- Previous message: updating the Linux kernel
- Next message: updating the Linux kernel
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
