[Beowulf] Mark Hahn's Beowulf/Cluster/HPC mini-FAQ for newbies & some further thoughts

Sat Nov 3 19:28:42 PDT 2012

Especially informative! Your specific examples shed light on just the sort
of things I was trying to better understand.
A thousand thanks:)

On Sun, Nov 4, 2012 at 9:55 AM, Robin Whittle <rw at firstpr.com.au> wrote:

> On 2012-10-31 CJ O'Reilly asked some pertinent questions about HPC,
> Cluster, Beowulf computing from the perspective of a newbie:
>
>   http://www.beowulf.org/pipermail/beowulf/2012-October/030359.html
>
> The replies began in the "Digital Image Processing via
> HPC/Cluster/Beowulf - Basics"  thread on the November pages of the
> Beowulf Mailing List archives.  (The archives strip off the "Re: " from
> the replies' subject line.)
>
>   http://www.beowulf.org/pipermail/beowulf/2012-November/thread.html
>
> Mark Hahn wrote a response which I and MC O'Reilly found most
> informative.  For the benefit of other lost souls wandering into this
> fascinating field, I nominate Mark's reply as a:
>
>    Beowulf/Cluster/HPC FAQ for newbies
>    Mark Hahn 2012-11-03
>    http://www.beowulf.org/pipermail/beowulf/2012-November/030363.html
>
> Quite likely the other replies and those yet to come will be well worth
> reading too, so please check the November thread index, and potentially
> December if the discussions continue there.  Mark wrote a second
> response exploring options for some hypothetical image processing problems.
>
> Googling for an HPC or Cluster FAQ lead to FAQs for people who are users
> of specific clusters.  I couldn't easily find a FAQ suitable for people
> with general computer experience wondering what they can and can't do,
> or at least what would be wise to do or not do, with clusters, High
> Performance Computing etc.
>
> The Wikipedia articles:
>
>   http://en.wikipedia.org/wiki/High-performance_computing
>   http://en.wikipedia.org/wiki/High-throughput_computing
>
> are of general interest.
>
> Here are some thoughts which might complement what Mark wrote.  I am a
> complete newbie to this field and I hope more experienced people will
> correct or expand on the following.
>
> HPC has quite a long history and my understanding of the Beowulf concept
> is to make clusters using easily available hardware and software, such
> as operating systems, servers and interconnects.  The Interconnects, I
> think are invariably Ethernet, since anything else is, or at least was
> until recently, exotic and therefore expensive.
>
> Traditionally - going back a decade or more - I think the assumptions
> have been that a single server has one or maybe two CPU cores and that
> the only way to get a single system to do more work is to interconnect a
> large number of these servers with a good file system and fast
> inter-server communications so that suitably written software (such as
> something written to use MPI so the instances of the software on
> multiple servers and/or CPU cores and work on the one problem
> efficiently) can do the job well.  All these things - power supplies,
> motherboards, CPUs, RAM, Ethernet (one or two 1GBps Ethernet on the
> motherboard) and Ethernet switches are inexpensive and easy to throw
> together into a smallish cluster.
>
> However, creating software to use it efficiently can be very tricky -
> unless of course the need can be satisfied by MPI-based software which
> someone else has already written.
>
> For serious work, the cluster and its software needs to survive power
> outages, failure of individual servers and memory errors, so ECC memory
> is a good investment . . . which typically requires more expensive
> motherboards and CPUs.
>
> I understand that the most serious limitation of this approach is the
> bandwidth and latency (how long it takes for a message to get to the
> destination server) of 1Gbps Ethernet.  The most obvious alternatives
> are using multiple 1Gbps Ethernet connections per server (but this is
> complex and only marginally improves bandwidth, while doing little or
> nothing for latency) or upgrading to Infiniband.  As far as I know,
> Infiniband is exotic and expensive compared to the mass market
> motherboards etc. from which a Beowulf cluster can be made.  In other
> words, I think Infiniband is required to make a cluster work really
> well, but it does not not (yet) meet the original Beowulf goal of being
> inexpensive and commonly available.
>
>
> I think this model of HPC cluster computing remains fundamentally true,
> but there are two important developments in recent years which either
> alter the way a cluster would be built or used or which may make the
> best solution to a computing problem no longer a cluster.  These
> developments are large numbers of CPU cores per server, and the use of
> GPUs to do massive amounts of computing, in a single inexpensive graphic
> card - more crunching than was possible in massive clusters a decade
> earlier.
>
> The ideal computing system would have a single CPU core which could run
> at arbitrarily high frequencies, with low latency, high bandwidth,
> access to an arbitrarily large amount of RAM, with matching links to
> hard disks or other non-volatile storage systems, with a good Ethernet
> link to the rest of the world.
>
> While CPU clock frequencies and computing effort per clock frequency
> have been growing slowly for the last 10 years or so, there has been a
> continuing increase in the number of CPU cores per CPU device (typically
> a single chip, but sometimes multiple chips in a device which is plugged
> into the motherboard) and in the number of CPU devices which can be
> plugged into a motherboard.
>
> Most mass market motherboards are for a single CPU device, but there are
> a few two and four CPU motherboards for Intel and AMD CPUs.
>
> It is possible to get 4 (mass market) 6, 8, 12 or sometimes 16 CPU cores
> per CPU device.  I think the 4 core i7 CPUs or their ECC-compatible Xeon
> equivalents are marginally faster than those with 6 or 8 cores.
>
> In all cases, as far as I know, combining multiple CPU cores and/or
> multiple CPU devices results in a single computer system, with a single
> operating system and a single body of memory, with multiple CPU cores
> all running around in this shared memory.  I have no clear idea how each
> CPU core knows what the other cores have written to the RAM they are
> using, since each core is reading and writing via its own cache of the
> memory contents.  This raises the question of inter-CPU-core
> communications, within a single CPU chip, between chips in a multi-chip
> CPU module, and between multiple CPU modules on the one motherboard.
>
> It is my impression that the AMD socket G34 Opterons:
>
>   http://en.wikipedia.org/wiki/Opteron#Socket_G34
>
> Magny-Cours Opteron 6100 and Interlagos 6200 devices solve these
> problems better than current Intel chips.  For instance, it is possible
> to buy 2 and 4 socket motherboards (though they are not mass-market, and
> may require fancy power supplies) and then to plug in 8, 12 or 16 core
> CPU devices into them, with a bunch of DDR3 memory (ECC or not) and so
> make yourself a single shared memory computer system with compute power
> which would have only been achievable with a small cluster 5 years ago.
>
> I understand the G34 Opterons have a separate Hypertransport link
> between any one CPU-module and the other three CPU modules on the
> motherboard on a 4 CPU-module motherboard.
>
> I understand that MPI works identically from the programmer's
> perspective between CPU-cores on a shared memory computer as between
> CPU-cores on separate servers.  However, the performance (low latency
> and high bandwidth) of these communications within a single shared
> memory system is vastly higher than between any separate servers, which
> would rely on Infiniband or Ethernet.
>
> So even if you have, or are going to write, MPI-based software which can
> run on a cluster, there may be an argument for not building a cluster as
> such, but for building a single motherboard system with as many as 64
> CPU cores.
>
> I think the major new big academic cluster projects focus on getting as
> many CPU cores as possible into a single server, while minimising power
> consumption per unit of compute power, and then hooking as many as
> possible of these servers together with Infiniband.
>
> Here is a somewhat rambling discussion of my own thoughts regarding
> clusters and multi-core machines, for my own purposes.  My interests in
> high performance computing involve music synthesis and physics simulation.
>
> There is an existing, single-threaded (written in C, can't be made
> multithreaded in any reasonable manner) music synthesis program called
> Csound.  I want to use this now, but as a language for synthesis, I
> think it is extremely clunky.  So I plan to write my own program - one
> day . . .   When I do, it will be written in C++ and multithreaded, so
> it will run nicely on multiple CPU-cores in a single machine.  Writing
> and debugging a multithreaded program is more complex than doing so for
> a single-threaded program, but I think it will be practical and a lot
> easier than writing and debugging an MPI based program running either on
> on multiple servers or on multiple CPU-cores on a single server.
>
> I want to do some simulation of electromagnetic wave propagation using
> an existing and widely used MPI-based (C++, open source) program called
> Meep.  This can run as a single thread, if there is enough RAM, or the
> problem can be split up to run over multiple threads using MPI
> communication between the threads.  If this is done on a single server,
> then the MPI communication is done really quickly, via shared memory,
> which is vastly faster than using Ethernet or Inifiniband to other
> servers.  However, this places a limit on the number of CPU-cores and
> the total memory.  When simulating three dimensional models, the RAM and
> CPU demands can easily become extremely demanding.  Meep was written to
> split the problem into multiple zones, and to work efficiently with MPI.
>
> For Csound, my goal is to get a single piece of music synthesised in as
> few hours as possible, so as to speed up the iterative nature of the
> cycle: alter the composition, render it, listen to it and alter the
> composition again.  Probably the best solution to this is to buy a
> bleeding edge clock-speed (3.4GHz?) 4-core Intel i7 CPU and motherboard,
> which are commodity consumer items for the gaming market.  Since Csound
> is probably limited primarily by integer and floating point processing,
> rather than access to large amounts of memory or by reading and writing
> files, I could probably render three or four projects in parallel on a
> 4-core i7, with each rendering nearly as fast as if just one was being
> rendered.  However, running four projects in parallel is of only
> marginal benefit to me.
>
> If I write my own music synthesis program, it will be in C++ and will be
> designed for multithreading via multiple CPU-cores in a shared memory
> (single motherboard) environment.  It would be vastly more difficult to
> write such a program using MPI communications.  Music projects can
> easily consume large amounts of CPU and memory power, but I would rather
> concentrate on running a hot-shot single motherboard multi-CPU shared
> memory system and writing multi-threaded C++ than to try to write it for
> MPI.  I think any MPI-based software would be more complex and generally
> slower, even on a single server (MPI communications via shared memory
> rather than Ethernet/Infiniband) than one which used multiple threads
> each communicating via shared memory.
>
> Ten or 15 years ago, the only way to get more compute power was to build
> a cluster and therefore to write the software to use MPI.  This was
> because CPU-devices had a single core (Intel Pentium 3 and 4) and
> because it was rare to find motherboards which handled multiple such chips.
>
> Now, with 4 CPU-module motherboards it is totally different.  A starting
> point would be to get a 2-socket motherboard and plug 8-core Opterons
> into it.  This can be done for less than $1000, not counting RAM and
> power supply.  For instance the Asus KGPE-D16 can be bought for $450.  8
> core 2.3GHz G34 Opterons can be found on eBay for $200 each.
>
> I suspect that the bleeding edge mass-market 4 core Intel i7 CPUs are
> probably faster per core than these Opterons.  I haven't researched
> this, but they are a much faster clock-rate at 3.6GHz and the CPU design
> is the latest product of Intel's formidable R&D.  On the other hand, the
> Opterons have Hypertransport and I think have a generally strong
> floating point performance.  (I haven't found benchmarks which
> reasonably compare these.)
>
> I guess that many more 8 and 12 core Opterons may come on the market as
> people upgrade their existing systems to use the 16 core versions.
>
> The next step would be to get a 4 socket motherboard from Tyan or
> SuperMicro for $800 or so and populate it with 8, 12 or (if money
> permits) 16 core CPUs and a bunch of ECC RAM.
>
> My forthcoming music synthesis program would run fine with 8 or 16GB of
> RAM.  So one or two of these 16 (2 x 8) to 64 (4 x 16) core Opteron
> machines would do the trick nicely.
>
> This involves no cluster, HPC or fancy interconnect techniques.
>
> Meep can very easily be limited by available RAM.  A cluster solves
> this, in principle, since it is easy (in principle) to get an arbitrary
> number of servers and run a single Meep project on all of them.
> However, this would be slow compared to running the whole thing on
> multiple CPU-cores in a single server.
>
> I think the 4 socket G43 Opteron motherboards are probably the best way
> to get a large amount of RAM into a single server.  Each CPU-module
> socket has its own set of DDR3 RAM, and there are four of these.  The
> Tyan S8812:
>
>
>
> http://www.tyan.com/product_SKU_spec.aspx?ProductType=MB&pid=670&SKU=600000180
>
> has 8 memory slots per CPU-module socket.  Populated with 32 x 8GB ECC
> memory at about $80 each, this would be 256GB of memory for $2,560.
>
> As far as I know, if the problem requires more memory than this, then I
> would need to use multiple servers in a cluster with MPI communications
> via Ethernet or Infiniband.
>
> However, if the problem can be handled by 32 to 64 CPU-cores with 256GB
> of RAM, then doing it in a single server as described would be much
> faster and generally less expensive than spreading the problem over
> multiple servers.
>
> The above is a rough explanation for why the increasing number of
> CPU-cores per motherboard, together with increased number of DIMM slots
> per motherboard with increased affordable memory per DIMM slot means
> that many projects which in the past could only be handled with a
> cluster, can now be handled faster and less expensively with a single
> server.
>
> This "single powerful multi-core server" approach is particularly
> interesting to me regarding writing my own programs for music synthesis
> program or physics simulation.  The simplest approach would be write the
> software in single-threaded C++.  However, that won't make use of the
> CPU power, so I need to use the inbuilt multithreading capabilities of
> C++.  This requires a more complex program design, but I figure I can
> cope with this.
>
> In principle I could write a single-thread physics simulation program
> and access massive memory via a 4 CPU-module Opteron motherboard, but
> any such program would be performance limited by the CPU speed, so it
> would make sense to split it up over multiple CPU-cores with
> multithreading.
>
> For my own projects, I think this is the way forward.  As far as I can
> see, I won't need more memory than 256GB and more CPU power than I can
> get on a single motherboard.  If I did, then I would need to write or
> use MPI-based software and build a cluster - or get some time on an
> existing cluster.
>
>
> There is another approach as well - writing software to run on GPUs.
> Modern mass-market graphics cards have dozens or hundreds of CPUs in
> what I think is a shared, very fast, memory environment.  These CPUs are
> particularly good at floating point and can be programmed to do all
> sorts of things, but only using a specialised subset of C / C++.  I want
> to write in full C++.  However, for many people, these GPU systems are
> by far the cheapest form of compute power available.  This raises
> questions of programming them, running several such GPU board in a
> single server, and running clusters of such servers.
>
> The recent thread on the Titan Supercomputer exemplifies this approach -
> get as many CPU-cores and GPU-cores as possible onto a single
> motherboard, in a reasonably power-efficient manner and then wire as
> many as possible together with Infiniband to form a cluster.
>
> Mass market motherboards and graphics cards with Ethernet is arguably
> "Beowulf".  If and when Infiniband turns up on mass-market motherboards
> without a huge price premium, that will be "Beowulf" too.
>
> I probably won't want or need a cluster, but I lurk here because I find
> it interesting.
>
> Multi-core conventional CPUs (64 bit Intel x86 compatible, running
> GNU-Linux) and multiple such CPU-modules on a motherboard are chewing
> away at the bottom end of the cluster field.  Likewise GPUs if the
> problem can be handled by these extraordinary devices.
>
> With clusters, since the inter-server communication system (Infiniband
> or Ethernet) - with the accompanying software requirements (MPI and
> splitting the problem over multiple servers which do not share a single
> memory system) is the most serious bottleneck, the best approach seems
> to be to make each server as powerful as possible, and then use as few
> of them as necessary.
>
>  - Robin         http://www.firstpr.com.au
>
>

--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20121104/89e14794/attachment.html>