landman at scalableinformatics.com
Fri Sep 27 07:45:47 PDT 2002
On Fri, 2002-09-27 at 07:27, Ivan Oleynik wrote:
> I have a problem with PBS scheduler: every time when I run IO intensive
> series of jobs it goes down. As a result, the whole pbs queue with other
> jobs become suspended.
> I could not see any useful info in sched_logs and server_logs files except
> of noninformative messages:
> 0001;PBS_Server;Svr;PBS_Server;Connection refused (111) in contact_sched,Could not contact Scheduler
This is actually quite informative. What I have experienced in the past
with PBS and heavy NFS loads is that the cluster head node runs out of
tcp/udp slots as specified in the /etc/inetd.conf or /etc/xinetd.conf
files. Depending upon which one you use, you will need to bump those
limits up a bit.
> For this particular test I run a bunch of mpich jobs requesting just 1
> processor per job, and the number of the submitted jobs was 6 times the
> number of available nodes. Each job does intensive IO via NFS running over
> Myrinet (writing files ~ 300 Mb each).
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman at scalableinformatics.com
phone: +1 734 612 4615
More information about the Beowulf