I've got 8 linux boxes, what now

Fri Dec 7 13:48:59 PST 2001

On Fri, Dec 07, 2001 at 11:59:01AM -0500, Greg Lindahl wrote:
> On Thu, Dec 06, 2001 at 04:40:43PM -0800, Chris Majewski wrote:
> 
> > We're a  computer science department  investigating, very tentatively,
> > the   possibility  of  installing   a  linux   cluster  as   our  next
> > general-purpose compute  server. To date we've been  using things like
> > expensive multiprocessor SUN machines. 
> 
> Chris,
> 
> You need to think about your user interface. Here are 3 possibilities:
> 
> 1) When your user logs in, they are dumped on 1 of N Linux boxes. All
> their processes run on the box. You can use LVS (Linux Virtual
> Servers) to do this. NFS mount their directories. If a user starts a
> long running job, they have to remember which box they started it on
> if they want to kill it. And there's no guarantee that load averages
> will remain similar, although LVS can stop sending users to a box with
> a higher load average. Fortunately, most of your users don't start
> long running jobs, so for most people, it just works.
> 
> 2) As (1), but also users can also use a special scheme for long
> running job, a batch queue. Use Condor for the batch queue. Condor
> allows people to find out where their jobs are, and it will be able to
> migrate some long running jobs to different boxes to balance the
> load. Since only a subset of your users have long running jobs, most
> people don't have to learn about Condor.
> 
> 3) As (1), but use MOSIX. MOSIX can automagically migrate long running
> jobs to a different system, but "ps" still shows the job on the "home"
> system. This is more transparent to the users, but now the job dies if
> either system crashes, so it's less reliable than Condor.
> 
> With any of the 3, you still need to work out a way of administering
> the system to keep them synchronized.
> 
> TurboLinux has a cluster admin system that helps you keep system disks
> synchronized.
> 
> You can use "rsync" or cfengine, which are traditional Linux sysadmin
> tools.
> 
> Scyld Beowulf doesn't really address this situation. However, it
> could, with a modest amount of work. Don, do you have any comments
> about this? Since it's many users running many jobs, including
> interactive ones, it's not really the area that "Beowulf clusters"
> traditionally address. I wish people would work on this, though, as
> I'd love to have a prepackaged solution I could sell in this area.

About two years ago, we were in exactly the same position Chris is in now:
We are providing research computing for the whole university, i.e., the 
system should be able to run all kinds of applications you can think of:
serial jobs, MPI jobs, jobs that require good network performance, jobs
that require huge compute power. And we have a small budget.

Two years ago our SGI SMP boxes were switched off, we bought 8 dual
processor PIII/500MHz boxes, and built a beowulf cluster.

At that time I had to make a decision about loadbalancing and a batch
queing system. I decided to have none of it.

I am in the position of being a sysadmin and a user at the same time. From
my experience as a user I can only say batch systems don't work. This has
nothing to do with the software (pbs, etc.), but with the concept:
You have to come up with a set of rules that classify the jobs with respect
to their job requirements and associate the jobs with certain queues 
accordingly. Let's assume (for simplicity) that you use the run time of 
a job as the only criterion. Then you set up one queue for jobs with
upto 100 min. run time, one for 1000 min., one for 10000 min., etc. (we 
very frequently have jobs that run over a month). Then you run 10 jobs from
the 100min. queue before you run one from the 1000 min. queue, etc.
The effect will be (in all cases that I have seen) that one queue is
completely filled up (usually the long time queue) whereas the others
are empty. Hence, a job started in say the 1000 min. queue will get
executed right away, but a job in the 10000 or longer queue has to wait
sometimes for weeks. This situation has all kinds of bizarre effects:
Those users who can will split up their jobs into several pieces and start
running chain jobs - as a consequence the situation for those jobs that
can't be split up and are stuck in the long-time queue gets even worse.
Besides that rewriting programs so that they can be run as chain jobs
is a total waste of time. And there are many more problems you can
think of (e.g., what do you do with MPI jobs? what do you do, if a job
exceeds its time limit - kill it and waste even more time? etc.).

Thus: batch systems may work for corporations or small research groups
(where everybody runs very similar jobs), but in a university environment
they just don't work. I.e., it is impossible to come up with a rule
set that is fair to all users.

Loadbalancing (e.g., mosix) may work if you have idle processors, but
if you usually have more jobs than processors this doesn't help much.
Also my impression is that mosix is very bad for MPI jobs that require
good network performance.

If somebody made different experiences I like to hear of those, but for now
we do nothing of this kind.

This actually does not mean chaos - on the contrary. As a sysadmin you
always like to have things regulated in an orderly fashion so that it
is clear how things work (even if they don't work well). We basically
said let the users talk to each other and let them distribute the resources
by themselves. We only published a few guidelines (for those who are
interested: http://www.sfu.ca/acs/cluster/guidelines.html) and then
let the users agree on the rest themselves (who runs what where for how
long). This actually works extremely well. Within the last two years
there were only a few instances were I actually had to stop jobs
temporarily. In almost all cases this were new users who didn't know the
system yet.

Currently we are setting up a new (larger) cluster again as a research
facility for the whole university, basically a general purpose number
cruncher. The design of such a thing is a another problem that is rarely
talked about on this list. If you know the application you want to run,
you can design your cluster accordingly, but what do you do if you don't
(or if you want to run "everything")? You have a fixed budget, thus
do you buy the best network you can think of (myrinet or similar) so that
you basically can run everything? But that'll about halve the number of nodes
that you can buy, which will make your Monte-Carlo users very unhappy.
You have to find some compromise, but it isn't easy.

Cheers,
Martin

========================================================================
Martin Siegert
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
========================================================================