availability of Memory compression routine

Fri Jul 19 06:32:52 PDT 2002

Another suggestion:  From the sound of things, the master is doing very
little "work" except for farming out the base matrix for each node's
next unit of work.  You also don't indicate whether the work sequence is
independent or dependent (does the second matrix sent out depend on the
results sent back by all the nodes) or synchronous (so that all the
nodes have to finish at a barrier before the next work slice can be
begun).  It could be that the base matrix isn't horribly difficult to
compute and is independent for each slice OR that you have a big file on
disk of base matrices and are reading them in and sending them out one
at a time.  In either case you should likely reconsider your code
engineering.

One secret to happy parallelization is to try to push your code
structure towards the "ideal" of embarrassingly parallel.  You should
therefore sit down and think carefully about the architecture of the
problem and whether it is possible to reformulate it (possibly in an
"odd" way) so that it is more nearly EP.

You haven't indicated what the actual physics is of the problem you're
working on, but if the master is computing the matrices it is sending
out (perhaps generating them from a distribution and a random number
seed) and if it takes signficantly less time to compute a new matrix
than to send it via the net, you can very likely restructure your
problem so that it is EP, or nearly so.

For example, if your slaves had onboard the RNG and initial matrix
computation routines, all the master would need to do is send all the
slaves a starting RNG seed (once, at the very beginning) and an index
indicating where a given slave should start generating its next matrix.
The master would basically do trivial bookkeeping -- send out just
enough information for the slaves to do ALL the work and then kick back
and await results.

A second, even more effective way to make the problem EP exists if you
have large, independent blocks of parameter space that you are
interested in exploring.  Simply divvy it up among the nodes and let
each node run the job as a SERIAL task.  Node 1 runs the job(s) you were
going to do on Monday.  Node 2 does the ones you were going to do on
Tuesday.  Etc.  This gives nearly perfect parallel speedup unless
(again) the data required to initialize a day's work is really, really
big.  This will also work even if a given work sequence for a given day
has dependencies, so that the second matrix sent out depends on the
results sent back from all the nodes from the first one. 

Remember, although PVM and MPI are indeed very cool, if you write code
that REQUIRES their use you have in some sense already lost the parallel
wars, moving from EP or extremely coarse grained parallelism (which
scales well on nearly any network architecture) into medium to fine
grain parallel, which requires fairly careful thought and co-engineering
of task and beowulf to achieve the best possible scaling.  Master/slave
computations can often be engineered to be EP, and a proper account of
their parallel speedup must include the serial time required to generate
or read in the matrices you are working on.  

Just imagine that you have just one computer to do the computations on.
Now somebody gives you a second, but it isn't on the network at all --
it has maybe a cd drive you can use to move code onto it and perhaps
read in a big block of initialization data.  Can you use it
constructively?  Can you crank up a SERIAL computation on it that does
computations that are truly independent of those on the first, but still
mean that you finish all the work required for your publication in half
the time?  If so, you win -- architect your parallelized task the same
way, with the convenience of a network instead of a CD drive and you
should be able to achieve near-linear speedup for LOTS of nodes.

Even if the matrices are being read from disk, they had to get to disk
somehow.  If they were sent there from, e.g. an experiment with some
sort of network copy, it would be a simple matter to arrange for them to
get distributed directly to a scratch space on the nodes (disk is so
cheap nowadays it is very likely that each node has over 30 GB of
scratch disk anyway).  This would significantly improve execution time
-- obviously it is "bad" to send the matrices first to the master, which
then sends them to the slaves (two network copies), instead of sending
them directly to the slaves (one network copy).  Again, the master's
role is reduced to bookkeeping and collecting (hopefully much smaller)
results.

The only way you're likely to get stuck is if you're doing a single,
monolithic computation with complete dependencies and synchronicity (so
there is just a single thread of matrices you wish to explore and every
new matrix is generated from the results computed by all the nodes in
the previous step,) or if the matrices are e.g. on tapes being sent to
you from Fermilab or the like, in which case you've got a nasty double
bottleneck to consider -- the tape interface (slow) and the network
(also slow).  In the first the best you can do is speed up the network.

In the second you're up against Amdahl's Law here, generalized to
account for serial and parallel fractions of work.  Reading from tape is
a serial bottleneck that won't go away.  A high end tape has a data
transfer rate on the order of 6 MB/sec sustained, plus seek times which
can easily be seconds or longer.  Clearly the network (at 9-10 MB/sec)
is no longer a bottleneck, so going to gigabit ethernet would not help
overall.  

In any event, you definitely need to study your problem carefully,
especially the bottlenecks, to see how to redesign it to run efficiently
on large numbers of nodes.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu