[Beowulf] multi-threading vs. MPI
diep at xs4all.nl
Sat Dec 8 07:09:46 PST 2007
Well there is a difference between, being lazy and writing something
in little time that' s generic working and embarrassingl parallel,
or something like a gametree search where you really want the maximum
out of it and are prepared to optimize at cycle level,
in which case you definitely want a 2 layer parallellism.
note that multiprocessing is far easier than multithreading under
linux. Of course from high level viewpoint seen that's not a real
because multiprocesing you can do in a way that you can call it
multithreading and vice versa.
Another concern is that in the megabytes of source code, basically
quite some communication happens between the different processes,
simply because that speeds up the software in exponential manner.
Communication ideally happens every node (position) you search. When
less communication is possible thanks to the several microseconds it
takes to get a cache line from a remote node, speedup gets worse.
A problem of doing all that communication between the nodes is that
in shared memory it's simply a single pointer you read/write, whereas
with MPI you'll have to do a lot in order to get it done. It totally
screws up your source code so to speak.
In the end, all depends entirely upon the software you intend to run.
Yet MPI has a huge overhead, which shared memory parallellism doesn't
have at all.
This is totally irrelevant at the moment that your software is
embarrassingly parallel though.
If it is relevant, then you'll have to search to creative manners to
parallellize your software in a good matter, as the latencies within
1 node are up to a factor 1000 faster
than between nodes.
to compare, over shmem/mpi i'm real happy when i get 50% speedup out
of a few nodes (and about 20-25% when number of cpu's grows to 512),
and at shared memory my diep chess program gets about 3.75 out of 4
at a quadcore intel,
whereas the scaling is roughly 3.8 out of 4. So that's 3.75 / 3.8 =
There is a huge difference between 95%+ speedup and 50%.
So for example at a 16 node quad core cluster in total 64 cpu's, if i
were to use mpi only, i'd get perhaps to 25% speedup or so.
25% * 64 = a bit less than 16 out of 64
Now using a 2 layer parallellism the calculation is more like:
3.75 speedup out of 1 node and 50% speedup out of 16 nodes = 3.75 * 8
= 30 out of 64
Of course don't forget the huge effort to first make that
parallellism that also runs on pc's, it's years of fulltime work that
has been put in it.
When you would have thousands of nodes, of course it's nearly
irrelevant to do this big effort. Yet you don't get hundreds of nodes
With the biggest effort you can perhaps buy a 64 core machine
yourself. As soon as big institutes start paying, things change though.
Why do the effort? If you want to get a bigger speedup, just ask for
more budget and get yourself more cpu's, as 25% out of 256 cores
more than huge effort of years to get 30 out of 64 at a cluster.
On Dec 7, 2007, at 1:26 PM, Toon Knapen wrote:
> On this list there is almost unanimous agreement that MPI is the
> way to go for parallelism and that combining multi-threading (MT)
> and message-passing (MP) is not even worth it, just sticking to MP
> is all that is necessary.
> However, in real-life most are talking and investing in MT while
> very few are interested in MP. I also just read on the blog of Arch
> Robison " TBB perhaps gives up a little performance short of
> optimal so you don't have to write message-passing " (here: http://
> environment-and-evolution/ )
> How come there is almost unanimous agreement in the beowulf-
> community while the rest is almost unanimous convinced of the
> opposite ? Are we just tapping ourselves on the back or is MP not
> sufficiently dissiminated or ... ?
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf