[Beowulf] question about enforcement of scheduler use

Andrew D. Fant fant at pobox.com
Mon May 22 18:05:02 PDT 2006


Larry,
    I'll echo what Chris said about not seeking a technical solution for a 
political problem.   You can't solve the latter by yourself, either.  Get the 
users involved and the people who write their evaluations, and agree to 
acceptable terms of use and some sort of penalty for violating them.

    Having said that, there are various technical things that one can do to 
limit the ability of a casually frustrated user to game the system.  One of my 
favorites involves putting a monitoring system in place that counts logins and 
lets you know when the login count goes over a given number (in my case, I like 
to set it to 1, so that I can get in to fix things in a shell if I have to, 
though you want to minimize this in a well-run cluster).  This relies on the 
batch system not writing to wtmp and starting a login session for users, so it 
might not work on PBSPro, but I like it.

     One other possibilty that you might consider if you are somewhat desperate 
and you have a terminal server and serial port console on all the systems (or a 
separate management host that users cannot access) is to put a firewall rule or 
tcpwrappers rule in place that prevents ssh connections from the head node to 
the compute nodes.  Normally, I don't like firewalls on compute nodes, because 
it adds to the failure modes in glorious and obscure ways, but this might be one 
way to buy time in an arms race to get management to understand a problem.  You 
probably will want to put a reverse rule in place as well, to keep users from 
submitting jobs to start ssh or sshd on a compute node and start a tunnel back 
to the head node that they can access.    Again, if it reaches this point, you 
probably already have a problem above your pay grade.

     The last bit of advice that I will toss out is that you may want to 
seriously look at LSF or GridEngine for your cluster.  PBSPro does have 
commercial support behind it, but from what I have seen, it's salad days are 
behind it.  If you need commercial support and industrial strength  integration, 
LSF is the market leader at this point, and if you are in need of low cost and 
current technologies, GridEngine is seeing consistent growth in user base and 
development.

      I see you are at Georgia State.  If you want to talk to someone face to 
face and have a real conversation about cluster management, email me off-list. 
I know some people in Atlanta who might be willing to give some advice to 
someone who has been thrown to the wulfs, as it were.

HTH,
	Andy

-- 
Andrew Fant    | And when the night is cloudy    | This space to let
Molecular Geek | There is still a light          |----------------------
fant at pobox.com | That shines on me               | Disclaimer:  I don't
Boston, MA     | Shine until tomorrow, Let it be | even speak for myself




More information about the Beowulf mailing list