[Beowulf] Re: Linux cluster for my college

Mon Apr 16 03:52:56 PDT 2007

Geoff Jacobs wrote:
> Kyle Spaans wrote:
>> Wait, we can use openMOSIX and MPI at the same time? I thought they
>> were separate ideas? For example, MPI for multithreading and message
>> passing, and openMOSIX for just process migration. Can they be used at
>> the same time?
>>
> 
> 1) Spawn MPI processes on the head node. Must be using tcp/ip for
> interconnect.
> 2) OpenMOSIX migrates processes to dormant computers.
> 3) Sit back and watch the data roll in.

Hello, Geoff.

Sounds nice doesn't it, but it doesn't scale up very well with large 
numbers of MPI processors...

I run a small (64-node) openMosix (oM) Beowulf cluster running my own 
port of the Linux-2.4.26-om1 kernel compiled under Ubuntu 6.06.1 LTS 
(Ubuntu server edition supported until 2011 for any Ubuntu/Debian 
detractors reading this):

	http://www.ubuntu.com/getubuntu/download

We've tried different MPI implementations including MPICH, and LAM. The 
problem is that, by default, the same nodes get hammered all the time 
when MPI programs are run so these nodes have high system times because 
there is a significant overhead involved when processes are migrated to 
and from the 'home' nodes (the node where a process is started) to do 
i/o. The reason is that oM has to use the kernel on the 'home' node to 
do physical i/o. In fact, I run a patched version that prevents oM 
migrating a process back home just to run the time() functions. This 
makes a BIG difference to programs that monitor their own progress!

However, if you use a large number of MPI processors to run a job, you 
would be better off using "Ganglia" to generate a load-balanced list of 
MPI nodes first:

	gstat -am

However, for small MPI jobs it's true that openMosix does automatically 
add load-balancing to MPI. If your jobs are CPU intensive this works 
very well, but if they do a lot of i/o the cluster spends a lot of its 
time moving the active pages of processes between the kernels running on 
different CPU's. BTW, openMosix migrates the active pages of the user 
context, not the entire process. This is done very efficiently, but has 
a finite cost because the COTS cluster interconnect is GigaBit ethernet.

The strategy we're developing now is to use all three approaches so that 
MPI jobs are started on a "gstat" load-balanced list of processors and 
oM will migrate MPI processes between nodes as the load changes if the 
oM load-balancing algorithm sees a benefit. However, openMosix is best 
for 'embarassingly' parallel tasks, which is the main reason we run it.

Re: the FC vs. ubuntu/Debian comments here recently, I'm one of those 
people who ran RH7.3->RH9 for years longer than I should have because it 
was both 'stable' and 'supported'. If it had not been for the excellent 
FedoraLegacy project I would have migrated to Debian years ago. However, 
the "end of life" statement for RH9 at FL is what forced me into action. 
It was, indeed, an effort to upgrade to Ubuntu but I'm glad that I did.

	Tony.
-- 
Dr. A.J.Travis,                     |  mailto:ajt at rri.sari.ac.uk
Rowett Research Institute,          |    http://www.rri.sari.ac.uk/~ajt
Greenburn Road, Bucksburn,          |   phone:+44 (0)1224 712751
Aberdeen AB21 9SB, Scotland, UK.    |     fax:+44 (0)1224 716687