PBS Scheduler

Joe Landman landman at scalableinformatics.com
Sun Sep 29 18:41:12 PDT 2002


Hi Ivan:

  It is possible that you may still be running out of slots.  Several
years ago, I found that for an 84 node cluster, I needed about 1000
instances (we were using /etc/inetd.conf back then, but it is the same
concept).  This was also due in part to the nature of the code, which
used rsync to move data from the head node.

  It would be instructive to see your logs when the problems arise. 
Also, these parameters will not affect the NFS system.  They will allow
PBS more connection slots.  If you are running out of network bandwidth,
you will get some specific set of messages in your /var/log/messages
file.  Similarly if you do not have enough NFS daemons running.  If your
IO on the head node is a bottleneck, it should show up as high system
times, or lots of IO wait in sar, or large bi/bo in vmstat.  

  What is the load on your head node when you are getting these
messages?  Can you run "sar" or a similar data collection program  (I
usually run a "vmstat 1 > /big/data/collection/file/path" to help me
understand the problems.  

Joe

On Sun, 2002-09-29 at 21:04, Ivan Oleynik wrote:
> Joseph,
> 
> Thanks very much for your reply. I made the following changes to
> xinted.conf file:
> 
> instances 200 (original value - 60)
> cps 50 30 (original values 25 30 )
> 
> First run of my test went well with no problem reported from pbs
> scheduler. But during the second run the same problem has appeared:
> pbs_scheduler went down.
> 
> Do you think that the above parameters are not large enough to keep the
> nfs traffic from 36 processors simultaneously writing 300 Mb per
> processor?
> 
> Ivan 
> 
> ------------------------------------------------------------------------
> 
> On 27 Sep 2002, Joseph Landman wrote:
> 
> > On Fri, 2002-09-27 at 07:27, Ivan Oleynik wrote:
> > > Hi,
> > > 
> > > I have a problem with PBS scheduler: every time when I run IO intensive
> > > series of jobs it goes down. As a result, the whole pbs queue with other
> > > jobs become suspended.
> > > 
> > > I could not see any useful info in sched_logs and server_logs files except
> > > of noninformative messages:
> > > 
> > > 0001;PBS_Server;Svr;PBS_Server;Connection refused (111) in contact_sched,Could not contact Scheduler
> > 
> > This is actually quite informative.  What I have experienced in the past
> > with PBS and heavy NFS loads is that the cluster head node runs out of
> > tcp/udp slots as specified in the /etc/inetd.conf or /etc/xinetd.conf
> > files.  Depending upon which one you use, you will need to bump those
> > limits up a bit. 
> > 
> > > For this particular test I run a bunch of mpich jobs requesting just 1
> > > processor per job, and the number of the submitted jobs was 6 times the
> > > number of available nodes. Each job does intensive IO via NFS running over
> > > Myrinet (writing files ~ 300 Mb each).
> > 
> > [...]
> > 
> > -- 
> > Joseph Landman, Ph.D
> > Scalable Informatics LLC
> > email: landman at scalableinformatics.com
> >   web: http://scalableinformatics.com
> > phone: +1 734 612 4615
> > 
> > 
> ------------------------------------------------------------------------
> Ivan I. Oleynik                       E-mail : oleynik at chuma.cas.usf.edu
> Department of Physics
> University of South Florida
> 4202 East Fowler Avenue                  Tel : (813) 974-8186
> Tampa, Florida 33620-5700                Fax : (813) 974-5813
> ------------------------------------------------------------------------
-- 
Joseph Landman, Ph.D
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 612 4615




More information about the Beowulf mailing list