Actually, right back on topic, not Re: Just slightly off topic?
Robert G. Brown
rgb@phy.duke.edu
Thu, 17 Sep 1998 14:12:01 -0400
On Wed, 16 Sep 1998, Philip J. Matheson wrote:
> > Yes, pending what you mean by "supercomputers". I mean, you have to
> > accept a priori that a current 400 MHz PII is at least 10 times faster
> > (if not considerably more) than a 100 MHz 486. So you have to buy 10
> > obsolete computers AND handle all the hassle of network IPC's and other
> > bottlenecks that have been considerably advanced in the meantime (e.g.
> > cache size, raw memory speed, memory bus bandwidth, etc. etc.) just to
> > break even -- on perfect code.
>
> It is my understanding that the performance gained from performing tasks in
> parallel is not linear as you suggest here. 10 100 MHz 486's should permform
> substantially better than 1 p2 400, shouldn't they?
Alas, you've got it exactly backwards. The nonlinearity in parallel
system design deflects DOWN from per CPU scaling, not UP. This will be
proven in excruciating detail below. In order to get ten times as much
work done, you have to have a task that splits into ten pieces that can
be executed in parallel WITH (little or) NO OVERHEAD. That immediately
eliminates 90% of all software tasks (list purists will argue with "90%"
but who cares:-) -- reading mail, running a text editor, browsing the
web, playing a game, most of this sort of thing is single-threaded at
the CPU level and not worth the effort to multithread even where some
parallelism is possible.
Now, on to class. Let us analyze, however crudely, some completion
times of archetypical parallelizable tasks.
a) Imagine that you have a task that DOES split into ten
neat pieces, each of which can be executed on different CPU's at the
same time. One then has a spectrum of possibilities. If the subtasks
can be executed "independently" with no communication (other than the
inevitable burden of launching the tasks) then the task is said to be
"coarse grained parallel".
We can assume that it takes time t_0 to execute on a (presumed 486)
node, and that it takes time t_1 to load the task across the network and
recover the results when it completes. Since each subtask could be run
on a single CPU, one after the other (where the load time is negligible
compared to t_1 for a native launch), we will take 10*t_0 as the
baseline time for task completion on a 486 with no parallelism, and t_0
on a presumed exactly 10x faster PII.
Running each subtask on 10 separate CPU's will complete in
t_0 + (10-1)*t_1
which is obviously slower than the PII by precisely the burden imposed
by the very limited IPC's required to launch the ten processes and
recover the results of the subtask. As long as t_1 << t_0, the speed
does scale pretty much linearly with the number of CPUs. This more or
less DEFINES coarse grain parallel subtasks. Note that we do have to be
careful if t_0 and (N-1)*t_1 are commensurate. We will BE careful
later on below, and figure out what happens then.
b) Now imagine a set of subtasks that, while they are running, have to
communicate some with each other, but only locally (that is, each node
needs to talk to maybe two other nodes during the execution of the
subtask) and that this communication can be confined to relatively small
and discrete intervals. This is what I would call a medium grained
parallel task, although there is no authoritative definition written
down that I know of.
To analyze this, we need to define some numbers. t_0 is still the time
required to do the actually numerical part of the calculation on a 486
node. t_1 is still the time required to launch the subtasks and recover
their results in the end. There is an additional time t_2 required to
send data between a pair of nodes. Finally, there is a time t_3 that
appears because sometimes when node A wants to send data to node B, node
B is still running the calculation and cannot accept it. node A then
waits idle. Or, sometimes node B finishes a step and cannot proceed
until node A ALSO finishes and sends it the data it needs to proceed.
It waits. During these waits, calculation time is lost.
In a medium grained parallel calculation, with just a couple of nodes
talking, a good programmer can usually keep this number fairly small on
average (by arranging it so the calculation phases remain fairly
synchronous) but there are usually other bits of latency and overhead
that also that scale a bit worse than linearly with the number of nodes
that can be wrapped into t_3. Completion time goes to:
t_0 + (10-1)*t_1 + 2*10*t_2 + 10^x*t_3
where the exponent x is a bit larger than unity to reflect the losses to
latency and contention and asynchronicity. In a sense, medium grained
parallel is DEFINED by x<2 (or at least by x \approx 1) and t_3
relatively small, so that one continues to gain some benefit in speed as
one adds nodes over a fairly wide range of N (the number of nodes).
There is always a strict upper bound on the number of nodes that can be
profitably used, but it is still usually far more nodes than you can
afford (however many that may be) for medium grained tasks:-).
Needless to say, our imaginary PII still completes the task in basically
t_0 time. Memory bus I/O is orders of magnitude faster than network
I/O, and we can run the task "all at once". Indeed, it might run in
less than t_0 -- there may well be extra numerical overhead in breaking
the task up into the subtasks that are ignored in our simplified
treatment. The one place where the 10 486's might win is if the task is
too BIG (in, say, physical memory) to fit on a single PII while it does
fit on 10 486's (or hits some other similar bottleneck). With GB RAM
PII's available and with the technological advantages available in any
PII system, this seems very unlikely, to me anyway.
c) Finally, we imagine subtasks that have to communicate ALL the time,
and have to communicate at LONG RANGE. Node A has to talk to ALL the
other nodes in each step of the calculation, and the subtask has a LOT
of steps and a LOT of data is transferred in each communications. Ack!
Fine Grain Parallelism. t_0 and t_1 are still defined to be the same,
but now there are some (10!/(8!*2!)) = 45 data communication pathways
for EACH t_2 step of the subtask. (This is the number of ways to take
ten objects two at a time). For large N, this is approximately N^2/2,
but we probably need to ship data both ways on each connection, so we
immediately see that completion time goes to:
t_0 + (10-1)*t_1 + (10*(10-1))*t_2 + 10^x*t_3
where alas, in addition to an exponent quadratic in N=10 appearing in
the t_2 scaling, x (the t_3 contention/latency exponent) is likely to be
even bigger than 2 AND the t_2 and t_3 times themselves are much larger.
If t_0 \approx t_2 \approx t_3, the single 10x faster PII blows away the
stack of 486's, completing in orders of magnitude less time even for a
lousy 10 nodes executing a simple fine grained task. The scaling is so
rotten it may actually be faster to run on a PII with too little real
memory and swap to a U2W SCSI disk than to distribute the calculation
over 10 486 nodes.
Putting all this together, we can deduce some approximate forms for the
scaling of parallel tasks. Let us continue to assume that t_0 is the
baseline subtask completion time when the task is broken into N pieces
to be executed in parallel on N nodes, and that N*t_0 is the time to
complete the full task on a single CPU (that is we will continue to
assume that the entire task can be made to fit in memory on a single
machine). Then our "speed factor"
F(N) = (Single CPU completion time)/(N CPU completion time)
for various task grainings has forms something like:
F_C(N) = N*t_0/(t_0 + (N-1)*t_1)
F_M(N) = N*t_0 / (t_0 + (N-1)*t_1 + 2*N*t_2 + N^x*t_3)
F_F(N) = N*t_0 / (t_0 + (N-1)*t_1 + N*(N-1)*t_2 + N^x*t_3)
for Coarse, Medium and Fine grained tasks, with x probably greater than
2 and t_2 and t_3 unfortunately large in the case of fine grain
parallel tasks.
In the case of coarse grain tasks we can safely assume that t_1 << t_0
and that N>1. Factoring t_0 and doing a binomial expansion, we see
that:
F_C(N) \approx N(1 - (N-1)*t_1/t_0 + ...)
Note well the concavity -- this curve quadratically bends AWAY from
linear gain (the leading N term) DOWN towards no gain at all, not ever
upward. And this is the best we can ever hope to do. We will get
linear gain with the number of nodes as long as (N-1)*t_1/t_0 remains
much less than 1. This is true by definition for CG tasks, so your
cluster of 486's will eventually BEAT almost any single CPU on coarse
grained parallel tasks if you get enough nodes. In fact, enough 486
nodes are perfectly capable of beating a Cray T3E for a certain class of
tasks even if t_1 involves carrying a floppy around to launch the
application on each 486, and then carrying it around to collect the
results. Obviously you'd have to make t_0 REALLY BIG to overwhelm a t_1
on the order of minutes, but this can be done.
In the case of medium grain tasks, we can also hope or try to arrange
via program design that t_0>>t_1,t_2 and t_3. If we succeed, we can
still do the binomial expansion trick to get:
F_M(N) \approx N(1 - {(N-1)*t_1/t_0 - 2*N*t_2/t_0 - N^x*t_3/t_0} + ...)
Again, we do ok as long the SUM of the terms in {} remains much less
than 1, but our departure is quadratic or worse because of the doggone
x>1. A big t_3 is especially "bad", and good medium-grain coding will
try to minimize it, while it is ALWAYS wise to minimize t_1 and t_2
relative to t_0.
Finally, for fine grain tasks we get:
F_F(N) \approx N(1 - {(N-1)*t_1/t_0 - N*(N-1)*t_2/t_0 - N^x*t_3/t_0} + ...)
and, given that by hypothesis t_2/t_0 and t_3/t_0 are not particularly
small (although they'd better be less than 1!) we see that our gain
curve is not even approximately linear in N except perhaps for very
small N. The number of nodes over which a given parallelizable task
realizes approximately linear gains depends very much on how small the
t_2 and t_3 times can be maintained relative to t_0.
For a "real" parallel supercomputer, these times are minimal -- a lot of
money and engineering effort is spent insuring that IPC's are fast and
efficient (low latency and unblocked). On a beowulf, typically IPC's
are network based and asynchronous. The network bandwidth is usually an
easy order of magnitude slower than any bus, and communications have an
anomalously high overhead (latency/CPU/interrupt/context switch) cost as
well (relative to shared memory or a single CPU machine). Good coding
practice is still to try to synchronize and distribute communications to
minimize blocking (keeping t_3 small and the exponent reasonable) and to
adjust the subtask breakup to scale t_0/t_2 (basically, the inverse CPU
to IPC ratio) as favorably as possible.
Now, before the real experts start bashing me, I freely acknowledge that
this is an oversimplified treatment and that better ones undoubtedly
exist in the literature. There is ignored N dependence in t_2 and t_3
as defined above, for example -- how MUCH one communicates between nodes
frequently depends on how many nodes one splits a task up on, frequently
with a surface-to-volume kind of N dependence. Also, a proper treatment
very rapidly stops being this general -- one starts to look at the
SPECIFIC rates and bottlenecks for your particular parallel architecture
and task.
Being a numerical kind of guy rather than a proper theorist, I'd be
inclined to write down some sort of general purpose scaling law for a
given cluster or beowulf design like:
F(N) = N/(R + A*N + B*N^2 + C*N^3 + D*N^x + E*N^y + F*N^z)
where R is usually one but who knows and FIT this form empirically to
the observed scaling for a given numerical program and parallel computer
design, using the minimum number of terms that yield decent conditioning
(that is setting A-F to zero whenever they don't really improve the
nonlinear fit) and even adding a logistic term or the like if one
appeared to improve the results. Heck, one could even publish the ten
fit coeficients (and any easily controllable functional variation
thereof on e.g. CPU speed) in a neat little table so that anybody could
take them and do a reasonable job of extrapolating expected performance
on that task or reasonable structural analogs of that task, and directly
measure and compare the effect of different design strategies and
hardware/networking configurations. This, in turn, would give us all a
much better insight into how to BEAT the relevant bottlenecks that make,
for example, B large or y=4 for a particular task and computer design.
This is the kind of thing that the high end beowulf dudes (real computer
scientists like Dr. Ligon) have to struggle with. Most of us on this
list are PROBABLY working in the empirically determined sweet spot of
our personal scaling curve (that is, where the gain is still roughly
linear on whatever architecture we happen to be working on). Still, it
seems very important for engineering purposes that folks realize that a
bit of reasoning lets one build a quantitative description for the
scaling of F(N) and that even a few measurements for various N MAY be
enough to fit the nonlinearity well enough to extrapolate it as far as
one ever would need to go.
rgb
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu