Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] More cores/More processors/More nodes?

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Mark Hahn hahn at physics.mcmaster.ca
Sat Sep 30 13:53:01 PDT 2006


> It seems there are at least 3 dimensions for expansion.  What (in your
> opinion) is the right tradeoff between more cores, more processors and
> more
> individual compute nodes?

I'd claim this is not a matter of opinion, but rather a matter of which
things matter most to you: memory bandwidth or capacity, density,
interconnect bandwidth, perhaps even disk IO bandwidth.

> In particular, I am thinking of in-house parallel finite difference /
> finite element codes,
> parallel BLAS, and maybe some commercial Monte-Carlo codes (the last
> being an
> embarrassingly parallel problem).

montecarlo, from what I see, is both emb-par and tiny, so really just wants 
lots of cores, little memory, light interconnect, etc.

but that's an extreme; more generally the right choice depends on issues like 
how cache-friendly the code is (thus less sensitive to the
core-to-memory-bandwidth ratio), whether on-node shared memory is 
a big win (still faster than interonnect, easier to program), whether 
memory _capacity_ is more of an issue (which with AMD leads to more 
sockets/node), etc.

it does seem like finite-element stuff tends to have relatively 
high work-to-surface-area, so is not terribly demanding of interconnect
(cheaper interconnect, and less harm from multiple cores per node).
similarly, higher levels of blas are less demanding of mem-bw.

> I have been set the task of building our first cluster for these
> applications.
> Our existing in-house codes run on an SGI machine with a parallelizing
> compiler.
> They would need to be ported to use MPI on a cluster.

would they?  have you considered whether they'd run well on something 
like an 8-socket, 16-core AMD system?  I'm guessing the SGI is an older
mips-based Origin, and thus has dramatically slower CPUs.

by "parallelizing compiler" do you mean OpenMPI?

> However, I do not
> understand
> what happens when you have multi-processor/multi-core nodes in a
> cluster.  Do you
> just use MPI (with each thread using its own non-shared memory) or is
> there any
> way to do "mixed-mode" programming which takes advantage of shared
> memory within a
> node (like, an MPI/OpenMP hybrid?).

sure, all the memory in a node is shared, so you can use threads or other
shared-memory techniques if you want.  but this takes lots of additional
effort.  is it worth it?  bear in mind that any MPI will take some advantage
of faster access to a peer which happens to be on the same node.  and 
there are some packages (eg goto-blas) which can use threads internally,
and thus give you speedup even if you don't explicitly program the threads.

I don't see anyone bothering with this on our clusters - people who make the 
jump to MPI tend not to care about small factors like 2 vs 4 cores/node,
since they're aiming at 3-digit core counts.  it's also easier to schedule
an n-way MPI job that has no requirements about the layout of workers,
versus one which would require all the cpus on all of its nodes.

for your transition, I would guess you need a combo-cluster: some nice fat
nodes, as well as a decent-sized set of MPI-friendly ones.  you really need
to investigate your workload to figure out whether you can use gigabit
everywhere (surprisingly effective, even for serious MPI that's not emb-par)
or whether you need to step up to a real HPC interconnect (to me, that would 
be either InfiniPath or Myrinet-10G.)

regards, mark hahn.



More information about the Beowulf mailing list