updating the Linux kernel
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
j at cnb.uam.es j at cnb.uam.esMon Jun 12 11:23:01 PDT 2000
- Previous message: updating the Linux kernel
- Next message: updating the Linux kernel
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
-----BEGIN PGP SIGNED MESSAGE----- Kathy Haigh Hutchinson <K.Haigh-Hutchinson at Bradford.ac.uk> wrote: > On Mon, 12 Jun 2000, David Lombard wrote: > > Task based parallelism could recover, my factorisation is essentially task > based. But parallelism is more often about data parallelism. Preserving > copies of the data every time something changes would be so much effort in > storage and in time to do the storing, that benefits of distributing the > data in the first place would be significantly reduced, or am I missing > something? > > Kathy HH > I guess the answer is "it depends". It depends on whether you have a copy of the data somewhere else and whether you may do without analyzing some data. Two examples: If the data is kept at a redundant site, you can always restart the job. An example might be a central repository that handles data to workers and wait for answers, or a distributed data set split into pieces replicated on different nodes. You might start with a configuration like data = union of sets 1..n store set i at node i AND node[ (i+1) % n] <AND node...> then you start your job allocating node[i] for set[i]. If node[i] fails, you can then lookup which other nodes have a copy and restart its part at one of those (a suboptimal situation since one node would do double work, but still feasible). Then you would code it as: for i = 1 to n do { * node[i] = allocate_one_node(); * send_data(node[i]); * fd_set(socket[i], myset); } set_timeout(long_enough); do { done = YEPP; select(n, myset, NULL, NULL, long_enough); for i = 1 to n do { if ( ! FD_ISSET(i, myset)) { /* this node hasn't finished yet */ if (! isalive(node[i]) { /* failed to answer, e.g. UDP echo */ * node[i] = allocate_one_node(); * send_data (node[i]); * fd_set(socket[i], myset); done = NOT_YET; } else { /* Node is OK, assume it just takes longer */ done = NOT_YET; } } } } until (done == YEPP); YMMV. You would gain much more efficiency if you splitted data into n*n sets instead of n sets and replicated it such that if one node fails, its start data would be replicated among all other nodes. Then instead of only one node taking up the work of the deceased, all nodes would take 1/n-th of the dead node's work, and load would be split evenly. E.g. simplified: make n pieces for i = 0 to n-1 allocate piece i to to node i divide piece i into n-1 sub-pieces for j = 0 to i-1 allocate sub-piece j to node j for j = i to n allocate sub-piece j to node i+j It may be more elegant, I know and more complex schemes may be devised. Then, on failure of node i you know that its data is n*(i-1) to n*i and that it is split on sub-pieces spread over all other nodes evenly, so, there you go: run job/iter on n nodes with n pieces wait for answers if one node dies, collect all n-1 answers start job again on n-1 nodes with n-1 subpieces wait for sub-answers consolidate n-1 answers with n-1 sub-answers done. And time would be 1/n in the best case, or 1/n+1/n^2 in the second. Then for the next job/iter you may check if the node is back and repeat the scheme, or if it isn't and then increase work dataset of each remaining node with its sub-piece copy. The second example is when you may as well try to survive with a partial answer, e.g. you are looking for an answer and it *might* be close enough or even already found, e.g. in a database search: start_subjobs(); set_timeout(long_enough, my_sig_handler); wait_for_answers when an answer arrives { save_it_for_later } consolidate_all_answers show_user ... my_sig_handler() { check_all_nodes_are_alive if someone_failed { consolidate_answers_to_date show_user if (user_is_glad) OK else die_miserably } } A signal handler might as well have been used in the first case, as an asynchronous multiplexer would as well do on the second. It's up to you to decide which suits you best. A signal handler may be combined with a setjmp/longjmp jump if needed. The point is that there is always a way out. j -----BEGIN PGP SIGNATURE----- Version: 2.6.3i Charset: noconv iQEVAwUBOUUqh7gsTQLvQjxFAQGVRAf/SrfS9NfCvGNMbHpiUXbITA6s5xOycmwx Id9UpymyQm26D73bbF5loAOhmOQ/OBvhK3yfRahyD92Ubugdb7LSp1N6v28NRhBw VdzKq8mPWFl4Wgbpumm2grzJmDZLTRXsRwuYiXq12DtH/Yf+SnlhTbI7/OxLBWxO /J1wEzzPv+VHzeXhpBqZ86nb/97TZ118PPw+zgnvOxDt+SBNzbIrN/SWf4euZaxo engcTE1RV5su8zSr3bbd/gbIsWpPn7tmuz6oVTB5nBU3ioKzdnVI4dyqh1SeZfQ6 yY0v0XOno0Tn5qMW6qJFNpOIxVK3fAZdlx7IU3P0fVUDYE6sy44Lfg== =HuIX -----END PGP SIGNATURE-----
- Previous message: updating the Linux kernel
- Next message: updating the Linux kernel
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
