Especially informative! Your specific examples shed light on just the sort of things I was trying to better understand.<br>A thousand thanks:)<br><div class="gmail_extra"><br><br><div class="gmail_quote">On Sun, Nov 4, 2012 at 9:55 AM, Robin Whittle <span dir="ltr"><<a href="mailto:rw@firstpr.com.au" target="_blank">rw@firstpr.com.au</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 2012-10-31 CJ O'Reilly asked some pertinent questions about HPC,<br>

Cluster, Beowulf computing from the perspective of a newbie:<br>

<br>

  <a href="http://www.beowulf.org/pipermail/beowulf/2012-October/030359.html" target="_blank">http://www.beowulf.org/pipermail/beowulf/2012-October/030359.html</a><br>

<br>

The replies began in the "Digital Image Processing via<br>

HPC/Cluster/Beowulf - Basics"  thread on the November pages of the<br>

Beowulf Mailing List archives.  (The archives strip off the "Re: " from<br>

the replies' subject line.)<br>

<br>

  <a href="http://www.beowulf.org/pipermail/beowulf/2012-November/thread.html" target="_blank">http://www.beowulf.org/pipermail/beowulf/2012-November/thread.html</a><br>

<br>

Mark Hahn wrote a response which I and MC O'Reilly found most<br>

informative.  For the benefit of other lost souls wandering into this<br>

fascinating field, I nominate Mark's reply as a:<br>

<br>

   Beowulf/Cluster/HPC FAQ for newbies<br>

   Mark Hahn 2012-11-03<br>

   <a href="http://www.beowulf.org/pipermail/beowulf/2012-November/030363.html" target="_blank">http://www.beowulf.org/pipermail/beowulf/2012-November/030363.html</a><br>

<br>

Quite likely the other replies and those yet to come will be well worth<br>

reading too, so please check the November thread index, and potentially<br>

December if the discussions continue there.  Mark wrote a second<br>

response exploring options for some hypothetical image processing problems.<br>

<br>

Googling for an HPC or Cluster FAQ lead to FAQs for people who are users<br>

of specific clusters.  I couldn't easily find a FAQ suitable for people<br>

with general computer experience wondering what they can and can't do,<br>

or at least what would be wise to do or not do, with clusters, High<br>

Performance Computing etc.<br>

<br>

The Wikipedia articles:<br>

<br>

  <a href="http://en.wikipedia.org/wiki/High-performance_computing" target="_blank">http://en.wikipedia.org/wiki/High-performance_computing</a><br>

  <a href="http://en.wikipedia.org/wiki/High-throughput_computing" target="_blank">http://en.wikipedia.org/wiki/High-throughput_computing</a><br>

<br>

are of general interest.<br>

<br>

Here are some thoughts which might complement what Mark wrote.  I am a<br>

complete newbie to this field and I hope more experienced people will<br>

correct or expand on the following.<br>

<br>

HPC has quite a long history and my understanding of the Beowulf concept<br>

is to make clusters using easily available hardware and software, such<br>

as operating systems, servers and interconnects.  The Interconnects, I<br>

think are invariably Ethernet, since anything else is, or at least was<br>

until recently, exotic and therefore expensive.<br>

<br>

Traditionally - going back a decade or more - I think the assumptions<br>

have been that a single server has one or maybe two CPU cores and that<br>

the only way to get a single system to do more work is to interconnect a<br>

large number of these servers with a good file system and fast<br>

inter-server communications so that suitably written software (such as<br>

something written to use MPI so the instances of the software on<br>

multiple servers and/or CPU cores and work on the one problem<br>

efficiently) can do the job well.  All these things - power supplies,<br>

motherboards, CPUs, RAM, Ethernet (one or two 1GBps Ethernet on the<br>

motherboard) and Ethernet switches are inexpensive and easy to throw<br>

together into a smallish cluster.<br>

<br>

However, creating software to use it efficiently can be very tricky -<br>

unless of course the need can be satisfied by MPI-based software which<br>

someone else has already written.<br>

<br>

For serious work, the cluster and its software needs to survive power<br>

outages, failure of individual servers and memory errors, so ECC memory<br>

is a good investment . . . which typically requires more expensive<br>

motherboards and CPUs.<br>

<br>

I understand that the most serious limitation of this approach is the<br>

bandwidth and latency (how long it takes for a message to get to the<br>

destination server) of 1Gbps Ethernet.  The most obvious alternatives<br>

are using multiple 1Gbps Ethernet connections per server (but this is<br>

complex and only marginally improves bandwidth, while doing little or<br>

nothing for latency) or upgrading to Infiniband.  As far as I know,<br>

Infiniband is exotic and expensive compared to the mass market<br>

motherboards etc. from which a Beowulf cluster can be made.  In other<br>

words, I think Infiniband is required to make a cluster work really<br>

well, but it does not not (yet) meet the original Beowulf goal of being<br>

inexpensive and commonly available.<br>

<br>

<br>

I think this model of HPC cluster computing remains fundamentally true,<br>

but there are two important developments in recent years which either<br>

alter the way a cluster would be built or used or which may make the<br>

best solution to a computing problem no longer a cluster.  These<br>

developments are large numbers of CPU cores per server, and the use of<br>

GPUs to do massive amounts of computing, in a single inexpensive graphic<br>

card - more crunching than was possible in massive clusters a decade<br>

earlier.<br>

<br>

The ideal computing system would have a single CPU core which could run<br>

at arbitrarily high frequencies, with low latency, high bandwidth,<br>

access to an arbitrarily large amount of RAM, with matching links to<br>

hard disks or other non-volatile storage systems, with a good Ethernet<br>

link to the rest of the world.<br>

<br>

While CPU clock frequencies and computing effort per clock frequency<br>

have been growing slowly for the last 10 years or so, there has been a<br>

continuing increase in the number of CPU cores per CPU device (typically<br>

a single chip, but sometimes multiple chips in a device which is plugged<br>

into the motherboard) and in the number of CPU devices which can be<br>

plugged into a motherboard.<br>

<br>

Most mass market motherboards are for a single CPU device, but there are<br>

a few two and four CPU motherboards for Intel and AMD CPUs.<br>

<br>

It is possible to get 4 (mass market) 6, 8, 12 or sometimes 16 CPU cores<br>

per CPU device.  I think the 4 core i7 CPUs or their ECC-compatible Xeon<br>

equivalents are marginally faster than those with 6 or 8 cores.<br>

<br>

In all cases, as far as I know, combining multiple CPU cores and/or<br>

multiple CPU devices results in a single computer system, with a single<br>

operating system and a single body of memory, with multiple CPU cores<br>

all running around in this shared memory.  I have no clear idea how each<br>

CPU core knows what the other cores have written to the RAM they are<br>

using, since each core is reading and writing via its own cache of the<br>

memory contents.  This raises the question of inter-CPU-core<br>

communications, within a single CPU chip, between chips in a multi-chip<br>

CPU module, and between multiple CPU modules on the one motherboard.<br>

<br>

It is my impression that the AMD socket G34 Opterons:<br>

<br>

  <a href="http://en.wikipedia.org/wiki/Opteron#Socket_G34" target="_blank">http://en.wikipedia.org/wiki/Opteron#Socket_G34</a><br>

<br>

Magny-Cours Opteron 6100 and Interlagos 6200 devices solve these<br>

problems better than current Intel chips.  For instance, it is possible<br>

to buy 2 and 4 socket motherboards (though they are not mass-market, and<br>

may require fancy power supplies) and then to plug in 8, 12 or 16 core<br>

CPU devices into them, with a bunch of DDR3 memory (ECC or not) and so<br>

make yourself a single shared memory computer system with compute power<br>

which would have only been achievable with a small cluster 5 years ago.<br>

<br>

I understand the G34 Opterons have a separate Hypertransport link<br>

between any one CPU-module and the other three CPU modules on the<br>

motherboard on a 4 CPU-module motherboard.<br>

<br>

I understand that MPI works identically from the programmer's<br>

perspective between CPU-cores on a shared memory computer as between<br>

CPU-cores on separate servers.  However, the performance (low latency<br>

and high bandwidth) of these communications within a single shared<br>

memory system is vastly higher than between any separate servers, which<br>

would rely on Infiniband or Ethernet.<br>

<br>

So even if you have, or are going to write, MPI-based software which can<br>

run on a cluster, there may be an argument for not building a cluster as<br>

such, but for building a single motherboard system with as many as 64<br>

CPU cores.<br>

<br>

I think the major new big academic cluster projects focus on getting as<br>

many CPU cores as possible into a single server, while minimising power<br>

consumption per unit of compute power, and then hooking as many as<br>

possible of these servers together with Infiniband.<br>

<br>

Here is a somewhat rambling discussion of my own thoughts regarding<br>

clusters and multi-core machines, for my own purposes.  My interests in<br>

high performance computing involve music synthesis and physics simulation.<br>

<br>

There is an existing, single-threaded (written in C, can't be made<br>

multithreaded in any reasonable manner) music synthesis program called<br>

Csound.  I want to use this now, but as a language for synthesis, I<br>

think it is extremely clunky.  So I plan to write my own program - one<br>

day . . .   When I do, it will be written in C++ and multithreaded, so<br>

it will run nicely on multiple CPU-cores in a single machine.  Writing<br>

and debugging a multithreaded program is more complex than doing so for<br>

a single-threaded program, but I think it will be practical and a lot<br>

easier than writing and debugging an MPI based program running either on<br>

on multiple servers or on multiple CPU-cores on a single server.<br>

<br>

I want to do some simulation of electromagnetic wave propagation using<br>

an existing and widely used MPI-based (C++, open source) program called<br>

Meep.  This can run as a single thread, if there is enough RAM, or the<br>

problem can be split up to run over multiple threads using MPI<br>

communication between the threads.  If this is done on a single server,<br>

then the MPI communication is done really quickly, via shared memory,<br>

which is vastly faster than using Ethernet or Inifiniband to other<br>

servers.  However, this places a limit on the number of CPU-cores and<br>

the total memory.  When simulating three dimensional models, the RAM and<br>

CPU demands can easily become extremely demanding.  Meep was written to<br>

split the problem into multiple zones, and to work efficiently with MPI.<br>

<br>

For Csound, my goal is to get a single piece of music synthesised in as<br>

few hours as possible, so as to speed up the iterative nature of the<br>

cycle: alter the composition, render it, listen to it and alter the<br>

composition again.  Probably the best solution to this is to buy a<br>

bleeding edge clock-speed (3.4GHz?) 4-core Intel i7 CPU and motherboard,<br>

which are commodity consumer items for the gaming market.  Since Csound<br>

is probably limited primarily by integer and floating point processing,<br>

rather than access to large amounts of memory or by reading and writing<br>

files, I could probably render three or four projects in parallel on a<br>

4-core i7, with each rendering nearly as fast as if just one was being<br>

rendered.  However, running four projects in parallel is of only<br>

marginal benefit to me.<br>

<br>

If I write my own music synthesis program, it will be in C++ and will be<br>

designed for multithreading via multiple CPU-cores in a shared memory<br>

(single motherboard) environment.  It would be vastly more difficult to<br>

write such a program using MPI communications.  Music projects can<br>

easily consume large amounts of CPU and memory power, but I would rather<br>

concentrate on running a hot-shot single motherboard multi-CPU shared<br>

memory system and writing multi-threaded C++ than to try to write it for<br>

MPI.  I think any MPI-based software would be more complex and generally<br>

slower, even on a single server (MPI communications via shared memory<br>

rather than Ethernet/Infiniband) than one which used multiple threads<br>

each communicating via shared memory.<br>

<br>

Ten or 15 years ago, the only way to get more compute power was to build<br>

a cluster and therefore to write the software to use MPI.  This was<br>

because CPU-devices had a single core (Intel Pentium 3 and 4) and<br>

because it was rare to find motherboards which handled multiple such chips.<br>

<br>

Now, with 4 CPU-module motherboards it is totally different.  A starting<br>

point would be to get a 2-socket motherboard and plug 8-core Opterons<br>

into it.  This can be done for less than $1000, not counting RAM and<br>

power supply.  For instance the Asus KGPE-D16 can be bought for $450.  8<br>

core 2.3GHz G34 Opterons can be found on eBay for $200 each.<br>

<br>

I suspect that the bleeding edge mass-market 4 core Intel i7 CPUs are<br>

probably faster per core than these Opterons.  I haven't researched<br>

this, but they are a much faster clock-rate at 3.6GHz and the CPU design<br>

is the latest product of Intel's formidable R&D.  On the other hand, the<br>

Opterons have Hypertransport and I think have a generally strong<br>

floating point performance.  (I haven't found benchmarks which<br>

reasonably compare these.)<br>

<br>

I guess that many more 8 and 12 core Opterons may come on the market as<br>

people upgrade their existing systems to use the 16 core versions.<br>

<br>

The next step would be to get a 4 socket motherboard from Tyan or<br>

SuperMicro for $800 or so and populate it with 8, 12 or (if money<br>

permits) 16 core CPUs and a bunch of ECC RAM.<br>

<br>

My forthcoming music synthesis program would run fine with 8 or 16GB of<br>

RAM.  So one or two of these 16 (2 x 8) to 64 (4 x 16) core Opteron<br>

machines would do the trick nicely.<br>

<br>

This involves no cluster, HPC or fancy interconnect techniques.<br>

<br>

Meep can very easily be limited by available RAM.  A cluster solves<br>

this, in principle, since it is easy (in principle) to get an arbitrary<br>

number of servers and run a single Meep project on all of them.<br>

However, this would be slow compared to running the whole thing on<br>

multiple CPU-cores in a single server.<br>

<br>

I think the 4 socket G43 Opteron motherboards are probably the best way<br>

to get a large amount of RAM into a single server.  Each CPU-module<br>

socket has its own set of DDR3 RAM, and there are four of these.  The<br>

Tyan S8812:<br>

<br>

<br>

<a href="http://www.tyan.com/product_SKU_spec.aspx?ProductType=MB&pid=670&SKU=600000180" target="_blank">http://www.tyan.com/product_SKU_spec.aspx?ProductType=MB&pid=670&SKU=600000180</a><br>

<br>

has 8 memory slots per CPU-module socket.  Populated with 32 x 8GB ECC<br>

memory at about $80 each, this would be 256GB of memory for $2,560.<br>

<br>

As far as I know, if the problem requires more memory than this, then I<br>

would need to use multiple servers in a cluster with MPI communications<br>

via Ethernet or Infiniband.<br>

<br>

However, if the problem can be handled by 32 to 64 CPU-cores with 256GB<br>

of RAM, then doing it in a single server as described would be much<br>

faster and generally less expensive than spreading the problem over<br>

multiple servers.<br>

<br>

The above is a rough explanation for why the increasing number of<br>

CPU-cores per motherboard, together with increased number of DIMM slots<br>

per motherboard with increased affordable memory per DIMM slot means<br>

that many projects which in the past could only be handled with a<br>

cluster, can now be handled faster and less expensively with a single<br>

server.<br>

<br>

This "single powerful multi-core server" approach is particularly<br>

interesting to me regarding writing my own programs for music synthesis<br>

program or physics simulation.  The simplest approach would be write the<br>

software in single-threaded C++.  However, that won't make use of the<br>

CPU power, so I need to use the inbuilt multithreading capabilities of<br>

C++.  This requires a more complex program design, but I figure I can<br>

cope with this.<br>

<br>

In principle I could write a single-thread physics simulation program<br>

and access massive memory via a 4 CPU-module Opteron motherboard, but<br>

any such program would be performance limited by the CPU speed, so it<br>

would make sense to split it up over multiple CPU-cores with multithreading.<br>

<br>

For my own projects, I think this is the way forward.  As far as I can<br>

see, I won't need more memory than 256GB and more CPU power than I can<br>

get on a single motherboard.  If I did, then I would need to write or<br>

use MPI-based software and build a cluster - or get some time on an<br>

existing cluster.<br>

<br>

<br>

There is another approach as well - writing software to run on GPUs.<br>

Modern mass-market graphics cards have dozens or hundreds of CPUs in<br>

what I think is a shared, very fast, memory environment.  These CPUs are<br>

particularly good at floating point and can be programmed to do all<br>

sorts of things, but only using a specialised subset of C / C++.  I want<br>

to write in full C++.  However, for many people, these GPU systems are<br>

by far the cheapest form of compute power available.  This raises<br>

questions of programming them, running several such GPU board in a<br>

single server, and running clusters of such servers.<br>

<br>

The recent thread on the Titan Supercomputer exemplifies this approach -<br>

get as many CPU-cores and GPU-cores as possible onto a single<br>

motherboard, in a reasonably power-efficient manner and then wire as<br>

many as possible together with Infiniband to form a cluster.<br>

<br>

Mass market motherboards and graphics cards with Ethernet is arguably<br>

"Beowulf".  If and when Infiniband turns up on mass-market motherboards<br>

without a huge price premium, that will be "Beowulf" too.<br>

<br>

I probably won't want or need a cluster, but I lurk here because I find<br>

it interesting.<br>

<br>

Multi-core conventional CPUs (64 bit Intel x86 compatible, running<br>

GNU-Linux) and multiple such CPU-modules on a motherboard are chewing<br>

away at the bottom end of the cluster field.  Likewise GPUs if the<br>

problem can be handled by these extraordinary devices.<br>

<br>

With clusters, since the inter-server communication system (Infiniband<br>

or Ethernet) - with the accompanying software requirements (MPI and<br>

splitting the problem over multiple servers which do not share a single<br>

memory system) is the most serious bottleneck, the best approach seems<br>

to be to make each server as powerful as possible, and then use as few<br>

of them as necessary.<br>

<br>

 - Robin         <a href="http://www.firstpr.com.au" target="_blank">http://www.firstpr.com.au</a><br>

<br>

</blockquote></div><br><br clear="all"><br>-- <br><img src="http://img.photobucket.com/albums/v202/CrashOveride/sijir-1.gif"><br><div></div><div></div><div></div><br>

</div>