[Beowulf] question about enforcement of scheduler use

Mon May 22 05:45:17 PDT 2006

My apologies in advance if this is a FAQ, but I'm reading through the
documentation and tinkering with the problem below simultaneously, and
would appreciate  help at least focussing the problem and avoiding
going down useless paths (given my relative inexperience with clusters).

I'm primarily a solaris sysadmin (and a somewhat specialized one at
that).  I've been given the task of administering a cluster (40 nodes
+ head) put together by atipa, and have been scrambling to come up to
speed on Linux on the one hand and the cluster-specific software and
config files on the other.

I was asked by the folks in charge of working with the end users to
help migrate to enforcement of the use of a scheduler (in our case
PBSpro).  In preparation for this I was asked to isolate four nodes
and make those nodes only accessable to end users via PBSpro.

The most promising means I found in my searches was the one used
by Dr. Weisz, of modifying the PAM environment, limits.conf, and the
PBS prologue and epilogue files.  I found his document describing the
approach, but have not found his original prologue and epilogue scripts.

However, I wrote prologue and epilogue scripts that did what he decribed
(wrote a line of the form "${USER}   hard maxlogins 18  #${JOB_ID}"
to the limits.conf file on the target node, and erased it after the job was 
completed).

If we limit the job to one node the prologue and epilogue scripts run
with the intended effect.  The problem is when we put the other three
target nodes in  play, we get a failure on three of the nodes, which is I 
suspect due to an attempt by the application to communicate via ssh under 
the user's id laterally from node to node. 

PBS hands the job off to node037 which sucessfully runs it's prologue
file.

Here's the contents of the output file:

Starting 116.head Thu May 18 15:10:48 CDT 2006
Initiated on node037

Running on 4 processors: node037 node038 node039 node040

Here's the error file:

Connection to node038 closed by remote host.
Connection to node039 closed by remote host.
Connection to node040 closed by remote host.
=>> PBS: job killed: walltime 159 exceeded limit 120

To clean up my question a bit I'll break it into four chunks:

1) Is the general approach I'm using appropriate for my intended effect 
   (isolating four nodes and enforcing the use of the pbspro scheduler
   on those nodes)?

2) If so what's the best way of allowing node-to-node communication, if
   indeed that's my likely problem?

3) If not does anyone have any other strategies for achieving what I'm
   after?

4) If the answer is RTFM could someone steer me towards the FMs or parts
   thereof I need to be perusing :-)

Thanks in advance.

Larry

Larry
-- 
========================================================
"I learned long ago, never to wrestle with a pig. You 
 get dirty, and besides, the pig likes it."

                              George Bernard Shaw
========================================================