Beowulf: A theorical approach

Thu Jun 22 04:59:08 PDT 2000

On Thu, 22 Jun 2000, Nacho Ruiz wrote:

> Hi,
> 
> I'm doing my final year project and I'm writting about the Beowulf project
> an the Beowulf clusters.
> I've been reading several documents about the beowulf clusters, but I would
> like to ask all of you some questions about them.
> 
> As I've seen the main objective behind any Beowulf cluster is the
> price/performance tag, specially when compared to supercomputers. But as
> network hardware and commodity systems are becoming faster and faster
> (getting closer to GHz and Gigabit speeds), could you think on competting
> directly with supercomputers?

Competing on what terms ?

SMP supercomputers are very convenient to work with, because then have both a
lot of CPU power, a lot of memory, and you can use either or both as you see
fit.  In other words, a lazy programmer with poor tools can make almost
anything run well on a SMP supercomputer.

Clusters are a different story. They have a lot of CPU power as well, and a lot
of memory, but one CPU can't easily access all the memory. Several CPUs can't
easily share the same memory.    Clusters are a lot less convenient than one
huge SMP machine, they require more effort on the side of the programmer for
general problem solving, if they are to run as well as the supercomputer.
There are special problems though, which are extrememly well suited for
clusters, and I believe that a large number of problems _could_ be solved well
on clusters.  Since we have the Beowulf list, I guess I'm not the only one
thinking that  :)

``very'' parallel problems (extreme example:  seti at home or distributed.net)
will run as well on a cluster than they will on a traditional SMP
supercomputer.  This kind of problems is well suited for clusters, the
sub-problems solved by each CPU are completely isolated so the CPUs in the
cluster (or in the supercomputer) need not communicate.   This type of problem
is rare though.  I will consider ``general'' problems in the following,
problems that can parallelize, but where sub-problems are not independent.

Large SMP supercomputers are (should I say ``usually'' ?) NUMA architectures.
You can think of them as a cluster with incredibly high network bandwidth,
very low latency, and where the hardware (with some help from the operating
system) emulates shared memory between CPUs.   If you want a cluster to work
the same way, you will not only need some software to help you, you will also
need a very high bandwidth network, and even then you will see that the network
latency is becoming a problem.  MOSIX (www.mosix.org) is a piece of software
that tries to emulate a large SMP machine on distributed systems.  They still
lack functionality as far as I know (especially wrt. memory shared between
threads), but the software can already now give you an idea about how hard it
is to compete with SMP supercomputers on _their_ terms.  If you want your
cluster to give any program the impression that it's in fact running on a giant
SMP machine, well, you're in trouble.   It can be done, but it can't be done
well.   Not because the software isn't good (MOSIX _is_ good) but because you
just don't set up switched networks with multiple GByte/s bandwidth and very
low latency.   Gigabit networks are _nothing_ compared to the crossbar switch
in your average Origin system.

Clusters can't fake a large SMP system _generally_.   I mean, you can do
it yes, but you cannot get good speed - generally.  If the operating system
and the hardware are the parts that work together to give a program the
impression that it can run on 16 CPUs sharing the same meory, you will
need the speed of the backbone in the supercomputers.

Having the abstraction layer at the hardware or operating system level is
sub-optimal.   The application knows (or could know) when it is going
to move data, when it can use spare CPUs, etc. etc.   The hardware and the
operating system can never know.   The only reason why this works so well
in supercomputers, is because they have an _incredibly_ fast ``network''
between their CPUs, so the cost of moving information from CPU to CPU in this
suboptimal manner, is acceptable.

> 
> As I see it the Beowulf cluster idea could be based in the distributed
> computign and the parallel computing: you put more CPUs to get more speedup,
> but as you can't have all the CPUs in the same machine you use several. So
> the Beowulf cluster could fit in between the distributed computing and the
> supercomputers (vetorial computers, parallel computers,..etc). You have
> advantages from both sides: parallel programming and high scalability; but
> you also have several drawbacks: mainly interconection problems. Do you
> think that with 10 Gb conections (OC-192 bandwith), SMP in chip (Power 4)
> and  massive primary and secondary memory devices at low cost, you could
> have a chance to beat most of the traditional supercomputers? or is not your
> "goal"?

To me that's not a goal.  It a game we've already lost if we start playing it.

We could set up 10Gbit/s networks, but your average supercomputer (SMP) has
a 10GByte/s ``network'' _today_.  Wait another five years, set up your 100Gbit/s
network, and guess what your SMP competitor has.   We're not even talking about
latency here...  I don't have latency numbers handy for a typical NUMA machine
interconnect, but I think it's safe to assume that it's pretty damn good and that
we're not going to get there ever with a traditional network.  If for nothing
else then because our wires are longer.

One way to work around some of the shortcomings of clusters in general problem
solvning is to make cluster more SMP.  Eg. use two or four CPUs in each box,
then connect a number of those.   Sure, it's one way to go, if you want to
improve things, but you will still not be taking a lead.

I don't mean to sound negative about clusters, really.  I just want to make
it clear that I don't think it's wise to spend money and effort trying to
beat the supercomputers on their terms - eg. having fairly stupid parallel
applications believing they're on share memory.   I think the way to go is
to build smarter applications that _know_ that they aren't on shared memory.
This is what people do with MPI and PVM.  And I think that could be taken
much further.

> 
> And about the evolution of the Beowulf clusters, do you all follow a kind of
> guideness or the project have divided in several flavors and objectives?
> Are the objectives of the beggining the same as today or now you plan to
> have something like a "super SMP computer" in a distributed way (with good
> communications times). I've seen that a lot of you are focusing in the GPID
> and whole machine idea, do you think that is reachable? What are the main
> objectives vs the MPI/PVM message passing idea?
> And what about shared memory (in the HD level or the RAM level), do you take
> advantage of having this amount  of resouces?

Again, I think GPID and fake shared memory are interesting ideas, and we might
even end up being able to compete with supercomputers in terms of
price/performance, for _some_ problems.

But trying to build a shared memory system from memory that just isn't shared
is hard enough for the SMP Supercomputer people (ccNUMA architectures are
very complicated hardware and software systems).   We can't do that better than
they do (and I'd love to eat those words, but I don't think it will happen).

> 
> Is this idea trying to reach the objective of making parallel programs
> "independent" to the programmer? I mean, that instead of having to program
> having in mind that you are using a parallel machine you can program in a
> "normal" way and the compiler will divide/distribute the code over the
> cluster. Is this reachable or just a dream? Is somebody working on this?

Lots of people are working on new ways to use clusters.  An hour of surfing
from the beowulf site, or any national laboratory should give you plenty of
pointers     :)

And I'm working on something.  It will take time before I have results, but my
basic idea is:
*)   The programmer writes a serial program (in a language designed for this
     purpose) There must be no way in the language to represent parallelism.
     The programmer should not be concerned with it.  Besides, he doesn't know
     whether there one or a hundred idle nodes in the cluster when he runs his
     program.
*)   The program is submitted to a parallel virtual machine running on all
     nodes of the cluster.   This virtual machine can automatically parallelize
     the program, it can be fault tolerant, it can predict/guess where/when
     data might be needed on what nodes and move data before it's needed.
*)   The hardware and the operating system should do nothing but providing
     base services (network and disk I/O) on each node in the cluster. The OS
     doesn't know about the flow of the program being run, so it should
     basically just stay out of the way doing what it does well already.

The goal is to make the network bandwidth/latency less important, by moving
data to remote nodes before the data are being waited for there.  Also,
parallelizing with regard to the current state of the cluster should improve
the usage of the cluster, making sure that all nodes are busy at all times.  As
each sub-task consists of a well defined set of input data, the system can
simply re-send a task to some other node, in case of node failure.

This is fiction today.  Hang on for another year and I might have something  :)

(there's a report available at http://ostenfeld.dtu.dk/~jakob/TONS-1/
 describing some of the work - but a lot changed since then.  You can also get
 the software from http://sslug.dk/TONS/, but really, unless you like parallel
 Fibonacci number calculations there's little you can do with this software
 today)

> 
> And what about the administration of a cluster. Having all the machine of
> the cluster under control, so you can know which are avaliable to send some
> work, is an hazarous task but necessary. Is not as easy as in a SMP machine
> where you know or assume that all the CPUs inside are working, in a cluster
> you can't do that as the CPU might work but the HD, NIC or memory may fail.
> How much computational time do you spend in this task? There's somebody
> working in a better way to manage with this?

There are various queue systems available that can take some of this into
account.   This is not really my area, so I'll leave that for someone else  :)

> 
> I know that sometime ago HP had a machine woth several faulty processors
> working and achiving high computational speeds without any error. They used
> some kind of  "control algorithm" that manages to use only the good CPUs. Do
> you have something like this or there is no point? Does it make sense?

Fault tolerance is important in clusters.  If you have 100 nodes (or like
Google, 4000) then some of them is going to have problems.  A previous
discussion here at the Beowulf list covered that subject pretty well I think.
Especially a good point was made that if the cost of fault tolerance is high,
you might as well achieve ``fault-tolerance'' by just restarting the program
that failed.

-- 
................................................................
: jakob at ostenfeld.dtu.dk  : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............: