Beowulf Questions
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Randall Jouett rules at bellsouth.netMon Jan 6 08:31:42 PST 2003
- Previous message: Beowulf Questions
- Next message: Beowulf Questions
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hello again, Donald/gang. On Sun, 2003-01-05 at 15:13, Donald Becker wrote: > On 5 Jan 2003, Randall Jouett wrote: > > On Sat, 2003-01-04 at 11:58, Donald Becker wrote: > > > > > > Not at all! MPI does not handle faults. Most MPI applications just > > > fail when a node fails. A few periodically write checkpoint files, and > > > a subset ;-) of those can be re-run from the last checkpoint. > > > > Checkpoint files? BLAH!!! :^). Admittedly, I'm a total neophyte > > Application-specific checkpoint files are sometimes the only effective > way to handle node crashes. > > > Off the top of my head, why couldn't you just plug in an old > > 10Base-T card to each node. Add a server node that specifically > > The problem isn't just detecting that a node has failed (which is either > trivial or impossible, depending on your criteria), the problems are > - handling a system failure during a multiple-day run. > - handling partially completed work issued to a node > - killing processes/nodes that you are think have failed, lest they > complete their work later. Ah. Ok. I understand now. Thanks for the info. > > BTW, has anyone bothered to calculate all the wasted cycles > > used up by check-point files? :^). > > Checkpointing is very expensive, and most of the time the checkpoint > isn't used. This is why only application-specific checkpoinging makes > sense: the application writer known which information is critical, and > when everything is consistent. Machines that save the entire memory > space have been known to take the better part of an hour to roll out a > large job. An hour? Dang. > > > Although the program completes the rendering, there is still much > > > ugliness when a partially-failed MPI program tries to finish. > > > > Hmmm. Why aren't folks flagging the node as dead and ignoring > > any other output until the node is back up and saying it's > > ready to run. This would have to be verified by the sysadmin, > > of course. > > The issue is the internal structure of the MPI implementation: there is > no way to say "I'm exiting sucessfully even though I know some processes > might be dead." Instead what happens is that the library call waits > around for the dead children to return. I take it that you're talking about a compute node when you're saying all of this, and I'm also reading processes here as "the other nodes." Remember, I'm a neophyte, Donald :^). Anywho, I was thinking that the lib call was written in an asynchronous fashion, with various flags being set on the root node when a compute node completed its computation. Also, the only way the root would continue on with the application is when all nodes sent a response saying that they're done. I also don't see why you couldn't make a few test runs, average out the response time of each node and the overall process (if necessary), and stick this info into a database for a given app on the root node. (Off the top of my head, of course. I'll have to think about this a while.) To me, this seems like you'd be adding a certain level of fault tolerance at the software level. Now, if we set up a response-time window for an individual compute node and the root node thinks that it has fallen out of the window, then it seems to me that the root node could flag the node as having temporary problems, and then it could shift that nodes work over to the first node that has completed its calculations/processing. Should the problematic node straighten itself out and start responding again -- let's say it finished its processing -- then the data is taken from that node verbatim and stored off to disk. That is, if the recovery node didn't finish its work already, of course. You'd also have to tell the original node that straightened itself out "Never mind," of course. (Said with Lilly Tomlin intonations :^) ). > This brings us back to the liveness test for compute nodes. When do we > decide that a node has failed? If it doesn't respond for a second? A > transient Ethernet failure might take up to three seconds to restart > link (typical is 10 msec.) Thirty seconds? If it gets out of its window, you set things up so that the first node to complete its computations takes over its work load. If node acting up straightens itself out, great. Then you just kill the request for node recovery and things should just "keep on trucking." (Wow. That last remark is really showing my age :^)). At this point, I guess you'd also want to increase the size of the window on the node that's acting up, too. Also, if the node doesn't respond in at twice the window size, I guess you could display a message on the console, remove the node from computations, and let the sysadmin take a look at the machine too see if anything is awry with the node or the network. More than likely, hiccups would involved latency, possibly do to fragmentation or the like. God forbid a memory leak :^). If you really wanted to get spiffy, I guess you could work a neural net into the system, having monitor network it traffic and such. A setup like this might be able to warn you if glitches were getting ready to rear their ugly heads. :^) > A machine running a 2.4 kernel before 2.4.17 might take minutes to > respond when recovering from a tempory memory shortage, but run just > > fine later. In the model I just described, I don't think this would be a problem. (Shrug.) I still have to think about all of this, of course. The one thing I really like about a model like this is that it would be asynchronous, and you could get away with simplistic levels of message passing. Just open a socket, read packets, and write packets. BTW, I have read and understood everything you've said, Donald, and I thank you wholeheartedly for the explanation. The way I responded, though, you'd think that I already knew what I was talking about. Witout any doubts -- I don't! :^). I wrote my response this way, though, so that you and others can straighten me out if I'm looking at parallel processing in a bass-ackwards fashion. If not, then maybe the model I descrided is exactly how things work already. If not, then maybe it might be worth looking into further. (Shrug.) Ok. I've been up all night with the flu. I need to try to get some sleep, and answer the rest of the my beowulf e-mails later on tonight, if I'm feeling better. 73 (Best Regards in morse code...ham lingo :^), Randall -- Randall Jouett Amateur Radio: AB5NI
- Previous message: Beowulf Questions
- Next message: Beowulf Questions
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
