Beowulf: A theorical approach

Thu Jun 22 00:20:33 PDT 2000

Hi,

I'm doing my final year project and I'm writting about the Beowulf project
an the Beowulf clusters.
I've been reading several documents about the beowulf clusters, but I would
like to ask all of you some questions about them.

As I've seen the main objective behind any Beowulf cluster is the
price/performance tag, specially when compared to supercomputers. But as
network hardware and commodity systems are becoming faster and faster
(getting closer to GHz and Gigabit speeds), could you think on competting
directly with supercomputers?

As I see it the Beowulf cluster idea could be based in the distributed
computign and the parallel computing: you put more CPUs to get more speedup,
but as you can't have all the CPUs in the same machine you use several. So
the Beowulf cluster could fit in between the distributed computing and the
supercomputers (vetorial computers, parallel computers,..etc). You have
advantages from both sides: parallel programming and high scalability; but
you also have several drawbacks: mainly interconection problems. Do you
think that with 10 Gb conections (OC-192 bandwith), SMP in chip (Power 4)
and  massive primary and secondary memory devices at low cost, you could
have a chance to beat most of the traditional supercomputers? or is not your
"goal"?

And about the evolution of the Beowulf clusters, do you all follow a kind of
guideness or the project have divided in several flavors and objectives?
Are the objectives of the beggining the same as today or now you plan to
have something like a "super SMP computer" in a distributed way (with good
communications times). I've seen that a lot of you are focusing in the GPID
and whole machine idea, do you think that is reachable? What are the main
objectives vs the MPI/PVM message passing idea?
And what about shared memory (in the HD level or the RAM level), do you take
advantage of having this amount  of resouces?

Is this idea trying to reach the objective of making parallel programs
"independent" to the programmer? I mean, that instead of having to program
having in mind that you are using a parallel machine you can program in a
"normal" way and the compiler will divide/distribute the code over the
cluster. Is this reachable or just a dream? Is somebody working on this?

And what about the administration of a cluster. Having all the machine of
the cluster under control, so you can know which are avaliable to send some
work, is an hazarous task but necessary. Is not as easy as in a SMP machine
where you know or assume that all the CPUs inside are working, in a cluster
you can't do that as the CPU might work but the HD, NIC or memory may fail.
How much computational time do you spend in this task? There's somebody
working in a better way to manage with this?

I know that sometime ago HP had a machine woth several faulty processors
working and achiving high computational speeds without any error. They used
some kind of  "control algorithm" that manages to use only the good CPUs. Do
you have something like this or there is no point? Does it make sense?

That's all for now, thanks to all of you.
If you know of some sources where I can get more information, please let me
know.

Nacho Ruiz.