[Beowulf] Mark Hahn's Beowulf/Cluster/HPC mini-FAQ for newbies & some further thoughts

Mon Nov 5 01:02:50 PST 2012

thanks for your kind comments.

> of specific clusters.  I couldn't easily find a FAQ suitable for people
> with general computer experience wondering what they can and can't do,
> or at least what would be wise to do or not do, with clusters, High

heh, I guess so.  then again, it's also pretty obvious sometimes.
if you use Ansys to do your CFD, for instance, they don't hide the fact that
you can run it on a cluster.  "standard" HPC codes like Gromacs, Namd,
weather codes are very up-front about it.  it's not hard to find out how
R codes (or Matlab, etc) can utilize clusters.

> HPC has quite a long history and my understanding of the Beowulf concept
> is to make clusters using easily available hardware and software, such

well, I would claim it's more of a hacker thing: PCs got cheap for their
own reasons, and Beowulfry was what we call the techniques for "repurposing"
PCs to do supercomputer-like work.

> as operating systems, servers and interconnects.  The Interconnects, I
> think are invariably Ethernet, since anything else is, or at least was
> until recently, exotic and therefore expensive.

there is perhaps a fundamentalist strain of Beowulfry that sticks strictly
to completely commoditized hardware, such as ethernet.  truely mass-market,
many-sourced hardware provides incredible cost benefits.  but it's also true
that you chose your tradeoffs: I built a cluster in 2003 or so which 
used traditional 1U rackmount dual-socket servers and Myrinet interconnect.
Gigabit had gotten reasonably affordable by that time - it was integrated
essentially for free in most servers, though switches were not very cheap.
but Myrinet (2g) delivers 250 MB/s of bandwidth, at a latency of 3-4 us, 
to a potentially large cluster at full-bisection.  at the time, gigabit
could, with a tailwind, do 100 MB/s at a latency of >= 50 us, and was not 
cheap to scale, switch-wise.  these days, Infiniband is pretty ubiquitous,
though you cannot possibly call it many-sourced - I'd say that gigabit 
and Myrinet grew the market size so IB could survive, in spite of having 
very limited adoption outside HPC and being mostly a single-source "standard".

> Traditionally - going back a decade or more - I think the assumptions
> have been that a single server has one or maybe two CPU cores and that

well, that's mostly a matter of fab technology - multiprocessors have been
around a long time, and bus-based SMP got to be fairly off-the-shelf in the 
80's.  buses are not scalable, so everything went to point-to-point links.
putting multiple cores on a chip was pretty much the result of moore's law.
(or at least designers not finding ways to use the abundance of transistors
to produce faster single cores...)

> the only way to get a single system to do more work is to interconnect a
> large number of these servers with a good file system and fast

well, you can always make a single system faster - it just becomes 
outlandishly expensive.  don't forget that the original supercomputers
(say, Cray) were not cheap!  they were arguably also clusters, though
mostly bespoke.

a key issue you haven't touched on is the schism between shared memory
and distributed memory (message passing).

> efficiently) can do the job well.  All these things - power supplies,
> motherboards, CPUs, RAM, Ethernet (one or two 1GBps Ethernet on the
> motherboard) and Ethernet switches are inexpensive and easy to throw
> together into a smallish cluster.

and particular choices are constantly changing - 10G ethernet is becoming
cheap enough to compete, even though it's significantly slower than IB.

> However, creating software to use it efficiently can be very tricky -

I always have trouble with statements like this.  either it doesn't mean
much, or it's pretty arguable.  software to make effective use of clusters
can be quite simple.  basically depending on the granularity and dataflow
patterns of the workload (perhaps that YMMV statement is even worse!)

> unless of course the need can be satisfied by MPI-based software which
> someone else has already written.

well, I don't think there's any thing fundamentally different about 
cluster software - if your workload suits windows+office, for instance,
you can pretty easily find that software ;)

> For serious work, the cluster and its software needs to survive power
> outages,

well, it's a cost-benefit tradeoff.  my organization has no power protection
on any compute nodes, though we do UPSify the storage.  interestingly, the
power failure rates differ at our various sites - it might have made sense 
to put UPSes at one of our sites, for instance.  (HPC codes normally have 
checkpointing, so that the computation can be restarted after a failure;
historically, this is because HPC calculations often push the boundaries of 
both cluster size and execution size.  if one node has a 1-year mean time 
to failure, then even a small 100-node cluster will fail, in part, every 
few days.  an HPC job that consumes 1000 cpu-years is not extremely big...)

> failure of individual servers and memory errors, so ECC memory
> is a good investment . . . which typically requires more expensive
> motherboards and CPUs.

well, the biggest problem of running without ECC is that without it,
you can't necessarily tell when a bit has flipped.  you expect more 
ECC-fixable events the more memory you have, and the harder you use it.
you need to do the cost-benefit analysis to decide - if your jobs are 
all small and cache-friendly, or can be verified when complete, it 
might make sense to omit ECC.

quality is like a slippery slope, though: once you start selecting 
higher-quality components, the cost savings of omitting, say, ECC,
start to look negligible.

> I understand that the most serious limitation of this approach is the
> bandwidth and latency (how long it takes for a message to get to the
> destination server) of 1Gbps Ethernet.  The most obvious alternatives
> are using multiple 1Gbps Ethernet connections per server (but this is
> complex and only marginally improves bandwidth, while doing little or
> nothing for latency) or upgrading to Infiniband.  As far as I know,
> Infiniband is exotic and expensive compared to the mass market
> motherboards etc. from which a Beowulf cluster can be made.  In other

well, "mass market" is a fuzzy term.  for instance, features like IB or ECC
are produced in large enough volume to enjoy economies of scale.  Gb is 
cheap mainly because it became standard in even the highest-volume consumer
systems.  IB never will; probably nor will ECC.

> words, I think Infiniband is required to make a cluster work really
> well, but it does not not (yet) meet the original Beowulf goal of being
> inexpensive and commonly available.

higher-priced components are not "against the rules" for Beowulf - 
just that once you start upgrading pieces, there is a tendency to 
wind up with all gold plated components from a gold-plated vendor.
that extreme is pretty non-Beowulf, IMO.

> I think this model of HPC cluster computing remains fundamentally true,
> but there are two important developments in recent years which either
> alter the way a cluster would be built or used or which may make the
> best solution to a computing problem no longer a cluster.  These
> developments are large numbers of CPU cores per server, and the use of
> GPUs to do massive amounts of computing, in a single inexpensive graphic
> card - more crunching than was possible in massive clusters a decade
> earlier.

I would claim that neither is terribly important.  multicore is emphatically
nothing new, since the market has always balanced between shared and
distributed memory.  GPUs are, IMO, mostly a cul-de-sac: significant 
mainly because they provide a way to explore assumptions about
joules-per-watt and data parallelism and the importance of language support.
I don't really expect to have GPU clusters in 10 years - at least not 
ones as recognizably distinct as Cuda/OpenCL cards in conventional nodes.

> The ideal computing system would have a single CPU core which could run
> at arbitrarily high frequencies, with low latency, high bandwidth,
> access to an arbitrarily large amount of RAM, with matching links to
> hard disks or other non-volatile storage systems, with a good Ethernet
> link to the rest of the world.

talking about ideals is about like discussing religion.

the devil is in the tradeoffs, I suppose.  since power dissipation 
is a superlinear function of frequency, we can't have an infinitely 
fast clock.  since clocks are power-limited, they are also space-limited
as well as fanout-limited.  that means inter-cpu links will never be 
as fast as you want, and neither will your ram channel.

maybe it's simpler: the devil is heat ;)

OTOH, it's not hard to imagine heterodox architectures to address this:
stick your cpus in the same package as your ram, and connect lots of 
these units using a scalable mesh.  I've always been a little surprised
that GPUs aren't going in this direction.

> While CPU clock frequencies and computing effort per clock frequency
> have been growing slowly for the last 10 years or so, there has been a

I'm actually skeptical of the conventional story that there has been 
some kind of "march of the clock" driving the industry.  I think there 
have been certain fairly discrete advances that produced clear boosts
in the performance of certain components.  RISC and pipelining, for instance,
changed the picture of practical-to-engineer CPU designs.  a series of 
fab advances (including but not limited to Leff scaling) let to certain
step-wise advances in power efficiency/clock vs devices-per-chip.
moving from buses to single-ended channels was another discrete advance, 
and enabled trainable drivers, ddr/qdr, etc.  PRML recording.  MLC flash.

> Most mass market motherboards are for a single CPU device, but there are
> a few two and four CPU motherboards for Intel and AMD CPUs.

depends on your definition of "mass".  I would claim that hardware has 
outrun consumer demand for compute speed (recent consumer improvements
have been mostly in price and lower power.)  for servers, two-socket 
boards have been the norm for many years.  multicore was a hot topic 
for a while, but these days is just hot (ie, power-limited.)

> It is possible to get 4 (mass market) 6, 8, 12 or sometimes 16 CPU cores
> per CPU device.  I think the 4 core i7 CPUs or their ECC-compatible Xeon
> equivalents are marginally faster than those with 6 or 8 cores.

higher-cored chips tend to run at lower clocks, for power reasons.
"server" chips tend to run at lower clocks because there is always 
*some* frequency-reliability tradeoff.

> In all cases, as far as I know, combining multiple CPU cores and/or
> multiple CPU devices results in a single computer system, with a single
> operating system and a single body of memory, with multiple CPU cores
> all running around in this shared memory.

there have been systems that put multiple cpus onto a single board 
without shared memory.  there have been systems that run multiple OSs
concurrently within the same memory domain, as well.

> I have no clear idea how each
> CPU core knows what the other cores have written to the RAM they are
> using,

well, first of all, cpus only read/write to caches.  it's the caches
that keep track of each other and/or memory. 
google for "cache coherency protocol".

> Magny-Cours Opteron 6100 and Interlagos 6200 devices solve these
> problems better than current Intel chips.

not really.  back in the old days of buses, cpu/cache modules would 
"snoop" all transactions taking place over a shared memory bus.  when 
buses went away, the same basic protocols worked by broadcast within
a fabric of point-to-point-connected caches.  the other basic approach
is for caches to track what other caches have a particular cacheline,
in what state.  Intel and AMD are very broadley comparable.  (AMD has 
an optional directory-accelerated mode (but I'm not sure whether it's 
turned on for 2-node systems at all), and Intel has such support in
some of its higher-end chips such as those for SGI UV.  I think that 
conventional wisdom is that smallish node counts are more efficient 
with broadcast rather than accepting the overhead of the directory.)

> to buy 2 and 4 socket motherboards (though they are not mass-market, and
> may require fancy power supplies)

well, most server boards use the ATX-SSI power standard, which is pretty
high volume, though perhaps not consumer-level.  more sockets use more power
too, but we're not talking about anyting crazy.  (if you wanna see crazy,
look at the power-supply-size fetishists in the gamer community!)
basically, servers are usually rack-mounted, and the rack-mount industry
tends to provide the right kind of PSUs.

> I understand the G34 Opterons have a separate Hypertransport link

for a few years, AMD had a real technical advantage in HT,
but they squandered it, and Intel's been using QPI since Nehalem.

> I understand that MPI works identically from the programmer's
> perspective between CPU-cores on a shared memory computer as between
> CPU-cores on separate servers.  However, the performance (low latency
> and high bandwidth) of these communications within a single shared
> memory system is vastly higher than between any separate servers, which
> would rely on Infiniband or Ethernet.

well, I think "vastly" is overstating the case.  it's noticable, sure.
I think sometimes people conflate the numbers - for instance, 1-2 us
IB latency vs <50ns dram latency.  but passing a message in shared memory
takes more time than one dram fetch...  (bandwidth difference is around a
factor of 10x.)

> So even if you have, or are going to write, MPI-based software which can
> run on a cluster, there may be an argument for not building a cluster as
> such, but for building a single motherboard system with as many as 64
> CPU cores.

of course!  IMO the strongest argument for shared memory machines is always
that one is easier to manage than many.  and you can always cluster fat
nodes if they are a major win for your code...

> I think the major new big academic cluster projects focus on getting as
> many CPU cores as possible into a single server, while minimising power
> consumption per unit of compute power, and then hooking as many as
> possible of these servers together with Infiniband.

isn't that like saying we prefer really good apple pie over bad apple pie?

> a single-threaded program, but I think it will be practical and a lot
> easier than writing and debugging an MPI based program running either on
> on multiple servers or on multiple CPU-cores on a single server.

the good thing about MPI is that separate ranks have very clear interactions
(ie, via messages).  the good thing about threads is that threads can
interact more flexibly - especially they can share read-mostly memory.
the fact that threads can write shared memory is basically dual with 
message passing, so equivalent in that sense.

I would claim that the clear/limited interaction among MPI ranks actually
makes it easier to reason about your program, compared to threads.

OTOH, I can't really imagine how music synthesis could use enought 
compute power to keep more than a few cores busy, no matter what
programming model you choose.  I'd expect it to me memory limited,
if anything.

offhand, I'd guess that GP-GPU would work pretty well for rendering audio,
for obvious reasons.

> Modern mass-market graphics cards have dozens or hundreds of CPUs in
> what I think is a shared, very fast, memory environment.  These CPUs are

hundreds, yes.  you should think of these cores as vector units, though:
they're all doing the same thing to different data.

> The recent thread on the Titan Supercomputer exemplifies this approach -
> get as many CPU-cores and GPU-cores as possible onto a single

GP-GPU is long since new.  main appeal is ops-per-watt and -per-volume.

regards, mark.