[Beowulf] More cores/More processors/More nodes?
hahn at physics.mcmaster.ca
Sat Sep 30 13:53:01 PDT 2006
> It seems there are at least 3 dimensions for expansion. What (in your
> opinion) is the right tradeoff between more cores, more processors and
> individual compute nodes?
I'd claim this is not a matter of opinion, but rather a matter of which
things matter most to you: memory bandwidth or capacity, density,
interconnect bandwidth, perhaps even disk IO bandwidth.
> In particular, I am thinking of in-house parallel finite difference /
> finite element codes,
> parallel BLAS, and maybe some commercial Monte-Carlo codes (the last
> being an
> embarrassingly parallel problem).
montecarlo, from what I see, is both emb-par and tiny, so really just wants
lots of cores, little memory, light interconnect, etc.
but that's an extreme; more generally the right choice depends on issues like
how cache-friendly the code is (thus less sensitive to the
core-to-memory-bandwidth ratio), whether on-node shared memory is
a big win (still faster than interonnect, easier to program), whether
memory _capacity_ is more of an issue (which with AMD leads to more
it does seem like finite-element stuff tends to have relatively
high work-to-surface-area, so is not terribly demanding of interconnect
(cheaper interconnect, and less harm from multiple cores per node).
similarly, higher levels of blas are less demanding of mem-bw.
> I have been set the task of building our first cluster for these
> Our existing in-house codes run on an SGI machine with a parallelizing
> They would need to be ported to use MPI on a cluster.
would they? have you considered whether they'd run well on something
like an 8-socket, 16-core AMD system? I'm guessing the SGI is an older
mips-based Origin, and thus has dramatically slower CPUs.
by "parallelizing compiler" do you mean OpenMPI?
> However, I do not
> what happens when you have multi-processor/multi-core nodes in a
> cluster. Do you
> just use MPI (with each thread using its own non-shared memory) or is
> there any
> way to do "mixed-mode" programming which takes advantage of shared
> memory within a
> node (like, an MPI/OpenMP hybrid?).
sure, all the memory in a node is shared, so you can use threads or other
shared-memory techniques if you want. but this takes lots of additional
effort. is it worth it? bear in mind that any MPI will take some advantage
of faster access to a peer which happens to be on the same node. and
there are some packages (eg goto-blas) which can use threads internally,
and thus give you speedup even if you don't explicitly program the threads.
I don't see anyone bothering with this on our clusters - people who make the
jump to MPI tend not to care about small factors like 2 vs 4 cores/node,
since they're aiming at 3-digit core counts. it's also easier to schedule
an n-way MPI job that has no requirements about the layout of workers,
versus one which would require all the cpus on all of its nodes.
for your transition, I would guess you need a combo-cluster: some nice fat
nodes, as well as a decent-sized set of MPI-friendly ones. you really need
to investigate your workload to figure out whether you can use gigabit
everywhere (surprisingly effective, even for serious MPI that's not emb-par)
or whether you need to step up to a real HPC interconnect (to me, that would
be either InfiniPath or Myrinet-10G.)
regards, mark hahn.
More information about the Beowulf