[Beowulf] MPI and Redhat9 NFS slow down

Jack Chen chimou at mail.wsu.edu
Mon Aug 23 15:52:20 PDT 2004


Hi all,

I'm not sure if this is the right place to post this question.  If it
is not, please tell me where's the best place to get help on this,
thanks..

We recently built a 8-node PC Linux cluster running RedHat 9 (kernel:
2.4.20-8smp #1 SMP).  We use this system to run EPA's CMAQ
photochemical grid model.  I have installed the latest MPICH 1.2.6
with Portland Group Compiler (5.2-1) using ssh.  Everything worked
fine with the mpi example programs (cpi, pi3p etc)and 'make testing'. 
However when I tried to run any program that write output to other nfs
mounted drives I get very long delay.  I'm not sure where the problem
is.  I know the NFS automount is working fine because if I start the
job with just one processor (mpirun -np 1), I don't experience the
slow down.

For example: 
If I start the job on master node using 4 processors (mpirun -np 4)
and write to the master node (master2 0),
PIxxx file:
master2 0 /master2/home/chenj/CMAQ_v4.3/Run/cctm/CCTM_e2a
node103 1 /master2/home/chenj/CMAQ_v4.3/Run/cctm/CCTM_e2a
node103 1 /master2/home/chenj/CMAQ_v4.3/Run/cctm/CCTM_e2a
node104 1 /master2/home/chenj/CMAQ_v4.3/Run/cctm/CCTM_e2a

the run takes 168 sec

If I start the same job but write the output to any other nfs mounted
drives besides the master node, the job will be extremely slow.  In
this case the same job took 10962 sec.

I have tried to mount the drive using different parameters (rw,soft
and rw,hard,bg,intr,noac) and increased the nfsd daemon from 8 to 16
on the NSF server, but nothing change.

If you have any idea on what is going on, please help!

Any help/suggestion are greatly appreciated.

Jack

 Jack Chen
 Laboratory for Atmospheric Research
 Dept.of Civil & Environmental Engineering
 Washington State University
 Pullman, WA 99164-2910
 509.335.5738
 509.335.7632 (FAX)

 



More information about the Beowulf mailing list