Beowulf: A theorical approach

Thu Jun 22 06:43:53 PDT 2000

--------
> Hi,
> 
> I'm doing my final year project and I'm writting about the Beowulf project
> an the Beowulf clusters.
> I've been reading several documents about the beowulf clusters, but I would
> like to ask all of you some questions about them.
> 
> As I've seen the main objective behind any Beowulf cluster is the
> price/performance tag, specially when compared to supercomputers. But as
> network hardware and commodity systems are becoming faster and faster
> (getting closer to GHz and Gigabit speeds), could you think on competting
> directly with supercomputers?

If it becomes possible to compete with a "supercomputer" in all applications
using COTS HW that would be great!  I don't see that in the near future, but
we may get there eventually.  Depends on how much Beowulf cuts into the
development of "supercomputers."

> As I see it the Beowulf cluster idea could be based in the distributed
> computign and the parallel computing: you put more CPUs to get more speedup,
> but as you can't have all the CPUs in the same machine you use several. So
> the Beowulf cluster could fit in between the distributed computing and the
> supercomputers (vetorial computers, parallel computers,..etc). You have
> advantages from both sides: parallel programming and high scalability; but
> you also have several drawbacks: mainly interconection problems. Do you
> think that with 10 Gb conections (OC-192 bandwith), SMP in chip (Power 4)
> and  massive primary and secondary memory devices at low cost, you could
> have a chance to beat most of the traditional supercomputers? or is not your
> "goal"?

The "goal" is to do the best we can with COTS HW.  The problem right now really
isn't in link speeds (though better link speeds are good), its in how close/far
the network interface is from the CPU.  COTS HW doesn't place a high value
on direct access to IO devices - there is a higher value on a standardized
bus interface to allow different system components to be integrated and updated
independently.  A "supercomputer" can have the network engineered directly
into the node architecture.  This is a huge advantage.  Luckily, this advantage
has the most effect in only some programs.  Beowulf attempts to exploit
those programs where that advantage isn't as much of an issue.

> And about the evolution of the Beowulf clusters, do you all follow a kind of
> guideness or the project have divided in several flavors and objectives?
> Are the objectives of the beggining the same as today or now you plan to
> have something like a "super SMP computer" in a distributed way (with good
> communications times). I've seen that a lot of you are focusing in the GPID
> and whole machine idea, do you think that is reachable? What are the main
> objectives vs the MPI/PVM message passing idea?
> And what about shared memory (in the HD level or the RAM level), do you take
> advantage of having this amount  of resouces?

No, there isn't a single approach.  Beowulf is all about "roll your own"
technology.  There are those who would like to see some kind of 
standardization,
but noone can quite agree on what that standard should look like.  This
indicates to *me* that we aren't ready for a standard yet.

GPID is really close.  We are running that every day around here and I feel
like it makes a *huge* impact on the usability and programability of the
system.  I still don't know if the way we are doing it will emerger as THE
way to do it, but that's what this community is all about - do it and then
let the rest of the world decide what it thinks.

> Is this idea trying to reach the objective of making parallel programs
> "independent" to the programmer? I mean, that instead of having to program
> having in mind that you are using a parallel machine you can program in a
> "normal" way and the compiler will divide/distribute the code over the
> cluster. Is this reachable or just a dream? Is somebody working on this?

No, that's really a seperate issue.  There are (and have been) many, many
people working on this.  My *personal* opinion is we will never have a
perfect - or even really good - parallelizing compiler (no flames guys, this
is MY opinion).  I think in the end we will evolve until programmers are
capable of programming in parallel.

> And what about the administration of a cluster. Having all the machine of
> the cluster under control, so you can know which are avaliable to send some
> work, is an hazarous task but necessary. Is not as easy as in a SMP machine
> where you know or assume that all the CPUs inside are working, in a cluster
> you can't do that as the CPU might work but the HD, NIC or memory may fail.
> How much computational time do you spend in this task? There's somebody
> working in a better way to manage with this?

Several people working on this.  There are some good approaches out there.
I think we'll see some achievements in this in the next couple of years and
this will make a big difference in getting Beowulf off the ground as a "real"
computing technology used in a wide range of applications.

> I know that sometime ago HP had a machine woth several faulty processors
> working and achiving high computational speeds without any error. They used
> some kind of  "control algorithm" that manages to use only the good CPUs. Do
> you have something like this or there is no point? Does it make sense?

Good idea - again, this is an area people are (and should be) working in.

So, anyway, this is ONE answer to your questions.  I'm sure you will get
several others.

Walt

-- 
Dr. Walter B. Ligon III
Associate Professor
ECE Department
Clemson University