heterogenous clusters

Mon Jul 3 09:47:22 PDT 2000

Thanks for the comments. I think I'll try varying the slice
lengths though - more convenient in my case.

As for Java speed, JDK 1.3 can be about as fast as C (!).
This is unlike older versions of the JVM that I agree were 
relatively slow.

Some hard evidence: these are the results of a 
calculation-intensive program using 64-bit FP, run on 
a single 166 MHz node with no networking.

	C under Linux : 28.66 s
	Java 1.1.6 under Linux : 51.01 s
	Java 1.1.6 under Windows : 85.32 s
	Java 1.3 under Windows : 23.68 s 

These are averages over several runs. Similar results were
obtained on an AMD 400. Though I haven't tried Java 1.3
under Linux yet, it should run quite well.

Best regards,
Alan Ward

----------
De: Luc.Vereecken at chem.kuleuven.ac.be
A: award at mypic.ad; beowulf at beowulf.org
Asunto: Re:  heterogenous clusters
Fecha: dilluns, 3 / juliol / 2000 06:07

I used MPI. It has the advantage of transparantly translating machine-
specific data representation (big/little-endian,...) while still being
fast enough (faster that Java I bet). While the heterogenous approach
doesn't allow vendor-optimised versions (IBM's MPI on SP,...) a
distribution
like MPICH runs on enough platforms to give you the interconnectivity
you want.

>2) When you slice up a job to distribute it between nodes,
>how does slice size affect performance when large speed
>differences exist (i.e. ratios of 1:2 up to 1:5)?
>My experience indicates that (a) too small a slice gives you a
>large communication overhead, and (b) too large a slice makes
>you lose time at overall completion while the slower nodes 
>finish up their last slice.

I use two ways to get around the speed difference problem (which in 
our setup is around 1:10 or 1:15, depending on which machines I use).

1. Don't use slices, but groups of slices, where the slices are fairly 
small (but e.g. fixed in size if it is easier to deal with in your 
application), and the groups contain a variable number of slices. Early in
the 
calculation you distribute groups with a large number of slices, and as the

number of remaining slices decreases, you decrease the number of slices
in the groups. The faster nodes finish faster, get new groups of slices
which early in the calculation are still fairly large (=low communication 
overhead).  As the calculation progresses, the outstanding groups get
smaller 
and smaller, so the slower nodes won't stall the calculation due to the
wait time for the last group (=little loss at end of calculation). If 
you use a good size of "early" group size, and a good function to 
decrease the size of the groups through time (ending with 1-slice groups),
you get most of the benefits of large groups (=low communication
overhead), but strongly reduce the idle time at the end of the global 
process since only small groups are outstanding by then. 
I found you get most of the benefits and little of the drawbacks even with 
fairly rough "initial size" and "decreasing function" guesses.
Example : 240 slices, 10 processes, I start with 10-slice-groups, and
decrease to 1 near the end.

Our program is also build on a Master-Slave paradigm. Just in case it
is run in a homogenous environment, I avoid idling the Master most 
of the time, and then flooding it all of a sudden, by not giving all 
of the Slave processes the same initial number of slices in a group : 
example : 240 slices, 10 processes, I start with 10-slice-groups, of which
I distribute a few, then some 9-slice groups, then some 8-slice groups,
until all processes are busy.  This distributes the number of
Slave-to-Master
requests through time. Since the size of the slice-groups changes through
time, the chances of the processes getting into sync on accident are
remote.
It's a fairly coarse method, I admit, but it works like a charm. 

2. Start a different number of Slave processes on different processors. On
processors which are 5 times faster, I start e.g. 3 Slave processes 
(reducing the relative speed of the processes running on this processor)
When the slaves are idle at the end of the global calculations, they
enter a sleep-mode (blocking communinication usually does this
automatically,
since it waits for a network event). The remaining processes then
get more of the fast processor, so the last Slaves tend to work through
their
last slice faster, again decreasing idle time on the other processors at 
the end of the calculation.
Of course, if you start e.g. polling at the end of the calculation, this 
method won't work of course. For the MPICH version I use (can't remember
the version#) the blocking communication doesn't use polling, so I just
make the Slaves blocking-receive a communication from the Master (which in
our case are the specificiations of the job they have to to next).

Starting 2 or more processes per processor also tends to work nicely
without
additional programming time if your Master has to work a bit on the 
results of the Slave before it can send it its next job. The waiting Slave
does nothing (make sure it doesn't do polling), so the second Slave gets
all (most) of the processor which is then in near 100% productive use.

I don't try to "benchmark" the nodes to determine the chunck of work
they have to work through. The first method above adjusts automatically if 
one of the fast processors becomes "slow" (e.g. due to an interactive
just sucking CPUtime all of a sudden)

Just some idea's that work nicely on our problem.

Luc Vereecken