Writing/Reading Files

Josip Loncaric josip at icase.edu
Wed May 8 09:14:36 PDT 2002


"Robert G. Brown" wrote:
> 
> On Tue, 7 May 2002, Timothy W. Moore wrote:
> 
> > This is getting frustrating.  I have an application where each node
> > creates its own data file.  When I go to process these files with a
> > serial application on the host, it can only read the first timestep
> > contained within the file.  Could this have something to  with NFS...
> 
> I am still having a bit of trouble visualizing your difficulty.

Perhaps this is similar to a problem one of our users had.  His parallel
optimization code works like this:

(1) process 0 writes many input files, each of which defines a test case
(2) MPI_Barrier(MPI_COMM_WORLD);
(3) many processes compute their own test cases, write own output files
(4) MPI_Barrier(MPI_COMM_WORLD);
(5) process 0 reads output files then selects new test cases
(6) repeat from (1) until some criterion is satisfied

The problem was this: MPI_Barrier takes only a few hundred microseconds
after the last file is written, while NFS may take longer to complete
writing the last file to the NFS server.  This coding style is
particularly hard on NFS since the server gets bombarded with 20-30
simultaneous writes just before step (4), which means that the server
can (and sometimes does) run out of nfsd threads, necessitating many
retries etc.  Meanwhile, since the process 0 got going again within
microseconds, it can find that the file it wants to read is incomplete
or missing.

The solution was to insert a 3 second "sleep" after the MPI_Barrier() in
step (4).

Sincerely,
Josip

P.S.  This NFS behavior is by design, and "noac" does not help much.  In
general, NFS is a slow and unreliable method of passing data between
processes.  MPI is much better -- IF you have source code access.  In
this case, only binary executables using input/output files were
available...

P.P.S.  As someone has pointed out here, one should check
/proc/net/rpc/nfsd on the NFS server where the last number on the "th"
line indicates the nnumber of times all nfsd threads were in use.  If
this seems high, increase the number of nfsd threads by editing the
RPCNFSDCOUNT=8 line in /etc/rc.d/init.d/nfs script.  We use
RPCNFSDCOUNT=32 on our main server now.  Moreover, if you use "soft" NFS
mounts on the NFS clients, increase the number of retransmissions
(default is retrans=3) to something like retrans=10.  Finally, use
"noac" attribute when mounting NFS filesystems (required by MPI-IO).


-- 
Dr. Josip Loncaric, Research Fellow               mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134



More information about the Beowulf mailing list