[Beowulf] Heterogeneous, intermitent beowulf cluster administration

Gus Correa gus at ldeo.columbia.edu
Fri Sep 27 12:08:20 PDT 2013


On 09/26/2013 09:00 AM, Ivan M wrote:
> Hi folks,
>
> I have access to a bunch (around 20) machines in our lab, each one with
> a particular configuration, usually some combination of Core i5/i7 and
> 4GB/8GB/16GB RAM (the "heterogeneous" part), connected by a 24 ports
> Cisco switch with reasonable backplane. They're end user machines, but
> with the current lab occupation only a fraction of them are used
> constantly, but which ones change every day. They are all running Debian
> stable. I got an idea: why not use the downtime to run some parallel
> simulations, instead of using the university cluster?
>
> They main problems now are:
>
> 1) System administration: for now I'm doing the clusterssh way to
> update/configure/install new software, but this can be very cumbersome,
> as one of the machines can be being used and so I can't change its
> configuration, so I have to keep track of which ones have changed. Maybe
> puppet can help here?
>
> 2) Managing resources: knowing which machine is up and available withou
> having to shout, and knowing the available configuration to allocate
> jobs that can fit in that particular machine, etc. There are extreme
> cases when the machine needs to be rebooted to run some Windows program.
>
> 3) Migrating jobs (the intermitent part): any machine can be requested
> by a user at any time, so if I have a parallel job running I would have
> to migrate the job to another machine, preferably without stopping the
> other jobs. We are running mostly ROMS over MPI and some in-house
> simulations that use a combination of OpenMP and MPI.
>
> Does anyone have any experience or pointers on how to address these
> issues? It seems a waste not to use those idle machines...
>
> Ivan Marinhttp://scholar.google.com.br/citations?user=faM0PCYAAAAJ
>

Hi Ivan

Maybe HTCondor (formerly Condor) is the tool you're looking for:

http://research.cs.wisc.edu/htcondor/

It's been around for quite a while.
As others already commented, the benefit in idle CPU cycles
harvested may not offset the sys. admin. effort spent.
Also, running jobs may be interrupted by keyboard and other
events too often to produce meaningful work timely.

My two cents,
Gus Correa



More information about the Beowulf mailing list