[Beowulf] RE: programming multicore clusters
landman at scalableinformatics.com
Fri Jun 15 06:57:08 PDT 2007
Toon Knapen wrote:
> Mark Hahn wrote:
>> unless most of your IPC is this kind of async, unsync, passive data
>> reference, I wouldn't think twice: go MPI. the current media frenzy
>> about multicore systems (nothing new!) doesn't change the picture much.
> Because of everybody going multi-core, everybody is pushing to go
> multi-threading to exploit these architectures (e.g. the gaming-world
> and many more). IIUC you're saying that MPI might better exploit these
> architectures? Interesting POV!
Multicore has some interesting up sides. The down sides,
oversubscription of memory bandwidth for the memory pipes out of the
sockets, remind me of the days of larger SMP boxes with big busses in
the early/mid 90s.
First, shared memory is nice and simple as a programming model.
Multicore suggests that shared memory should be very easy to exploit.
You have to worry about contention, affinity, and everything else we
used to have to worry about a decade ago with the big machines. Your
precious resources that you need to optimize utilization of are no
longer CPU cycles, but bandwidth.
Second, MPI is a more complex model. It forces you to reconsider how
the algorithm is mapped to the hardware. And it makes no assumptions
about the hardware, at least in the API. In the implementation, it
might be taught about multi-core, and optimizing communication within
boxes via shm sockets, and between boxes by other methods. I think a
few of the MPI toolkits do this today (Scali, Intel, OpenMPI, ...).
Neither one of these modalities take into account the fact that memory
bandwidth is finite out of a socket. Technically this is an
implementation issue, but as we hit larger and larger core sizes, some
codes, well, larger fractions of the parallel code base, are likely to
run into this resource contention issue.
We were seeing contention for fabric interconnects (e.g. bus contention)
with LAMMPS runs for a customer last year simply between single and dual
core. It was significant enough that the customer opted for single
core. This contention is not going to get better as you increase the
number of cores. Since MPI does, in part, depend upon resources being
contended for (interconnect), it is not at all clear to me that MPI will
be the *best* choice for programming all the cores, though it certainly
would be a simple choice.
Greg is right when he notes that the hybrid model is a challenge.
Unfortunately we appear to be facing a regime with multiple layers of
hierarchies. So this will need resolution. You can create a globally
"optimal" code via MPI, that may not be as efficient locally as you
like, and will likely grow less so with more cores, or a locally optimal
never-get-out-of-the-box code via shared memory.
Shared memory scales nicely on NUMA machines, assuming 1-2 cores per
memory controller. It won't/doesn't scale with 8 cores and one memory
bus. How well does stream run on clovertown? NAS parallel?
The issue is, at the end of the day, the contended for resources.
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf