Image Processing on a BeoWulf

Sun Aug 20 17:48:52 PDT 2000

> Our desire is to allow many jobs to be initiated ad-hoc by the
> production operators or by web based clients and take advantage of
> the parallelism and scaling offered by the cluster.

At this point you need a scheduler. You said you were using PVM. PVM isn't
that smart at handing out nodes when you ask for a lot more than you have.
In addition, it depends on the details when you have several jobs running on
a node and talking. Sometimes (I know this is true for mpich) a job doing
the wrong thing will basically spin in a busy-wait loop if the people it's
trying to talk to happen to not be running. This results in such awful
performance that you'd think that the job was stuck. I don't know about PVM;
I'd think that the "non direct route" option should be OK, but I've never
tried it.

> We currently have a work
> around in place - we implemented a queue that lines up the submitted
> jobs sequentially.  This certainly is not optimal as small jobs have
> to wait behind large ones.

That's a scheduler. Another scheduler you could use would be a queue system
like PBS. However, what you really want is a queue system which provides
"gang scheduling". With gang scheduling, only one program at a time is awake
on a node, so you don't have any spin-wait problems.

The RWCP guys have this for their SCORE operating system. But SCORE is
pretty big, and I don't think it's easy to extract just that one feature.

Now the T3E has a lovely gang scheduler...

An alternate which might do better for you would be to have all your jobs
use the same set of worker processes, one per cpu. Then multiple jobs would
just send extra work and have to wait until the workers got around to
handling them. Since PVM has full dynamic process creation, this is fairly
easy to write -- N slaves, create a new master to hand out work each time
you have a new job to do.

-- greg