Beowulf Questions
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
John Burton j.c.burton at gats-inc.comMon Jan 6 07:36:27 PST 2003
- Previous message: Beowulf Questions
- Next message: Beowulf Questions
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Randall Jouett wrote: > On Sat, 2003-01-04 at 11:58, Donald Becker wrote: > >>Not at all! MPI does not handle faults. Most MPI applications just >>fail when a node fails. A few periodically write checkpoint files, and >>a subset ;-) of those can be re-run from the last checkpoint. > > > Checkpoint files? BLAH!!! :^). Admittedly, I'm a total neophyte > when it comes to parallel processing and the beowulf architecture, > but computin is computin, and I think I might have a "el-cheapo, > ham-operator solution." (Hams are infamous be being TOTAL cheapskates > :^). Ummm...'all "computin" ain't equal'. While checkpoint files might not be useful for what you do, they save thousands of machine and man hours in my business. We have gigabytes of raw data from satellites being recorded per day. Processing a day's worth of data requires 2 days on a 2.5ghz P-4. So, divide the data into orbits and process the orbits in parallel. The mathematical model is such that fine-grained parallel processing is not practical at this time (massive redesign and the scientists don't understand parallel). If a process dies, then we can go back to the logs and correct the problem and restart from the last checkpoint (which was a minute or so ago) instead of starting over at the begining, which could be as much as 24 hours ago... > > > Off the top of my head, why couldn't you just plug in an old > 10Base-T card to each node. Add a server node that specifically > polls each machine via hardware latch and software response. > Just a quick, "Hey, I'm still here." This fault server would > then send the root/head node a quick "we're running, boss!" > message, or it would tell the root/head node that a particular > machine was down. If the root machine sees a fault message, > it parses the packet, ignores the broken node, then reschedules > the task for execution. It could also send an e-mail to the > sysadmin, page him, and even play a "RED ALERT!" sample from > Trek :^). > Apparently you are not current on cluster technology, or you wouldn't be proposing something that is common knowledge. > > Now, if your REALLY wanted to be cheap :^), you could do something > like this with a USB hub, although I'm pretty sure it wouldn't > be as fast as the 10Base-T setup. OTOH, 10Base-T gear (e.g. hub, > switch, NICs) can probably be had for the asking at most > institutions, I'd imagine. 10Base-T is too slow for typical parallel application. Switched 100Base-T is almost as inexpensive. > > > BTW, has anyone bothered to calculate all the wasted cycles > used up by check-point files? :^). Yup, and it is significantly less than the number of cycles that would be wasted having to rerun 24 hours worth of processing because a machine hiccuped and the process died... > Randall > -- > Randall Jouett > Amateur Radio: AB5NI > > P.S. > > The model I mentioned does have its flaws, of course, such > as a switch or hub going down, or maybe a busted CAT-5 cable > here or there. Something tells me, though, that it HAS to be > infinitely superior to check-point files and the like :^). > That is, if I'm understanding your meaning here of check-point > files. If I'm off base here, Donald, maybe you could clarify? > In my world a check point file is a "snapshot" of the state of running process at a given time. This "snapshot" is complete enough to restart the process at that point should it fail at a later point. John
- Previous message: Beowulf Questions
- Next message: Beowulf Questions
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
