restartability problem

Robert G. Brown rgb at phy.duke.edu
Mon May 6 11:40:05 PDT 2002


On Mon, 6 May 2002, Kelley Wittmeyer wrote:

> When our model is distributed over more than 1 process using MPI, we are 
> not getting the correct output files. The output file is a direct access 
> file in which each process writes to specific records in this single 
> file. This produces output files that are process independent. This 

When you say "direct access file" just what do you mean?  hash?  btree?
flat (recno)?  If it is flat (and I'd guess that it is) it sounds like
the processes aren't successfully locking the file when they attempt
their independent writes -- I get corruption just like this if I
accidentally try opening and writing to an ordinary file that is already
open and being written to by another process.  Writes are not
guaranteed to be atomic, and the two programs can interleave the writing
back of their images and get all sort of "interesting" results like you
describe.  Collisions will be more likely for larger files, of course.

For help towards a solution (if you think this is the problem after
looking at it further) see "man open" and read about the O_EXCL flag and
NFS.  It sounds like you may have to play some games with link() and
stat() for each MPI process to get the lock on the file in turn.

Alternatively, you could have each process write to its own unique file
and do a merge, or you could carefully control the timing of the dbopens,
writes and closes so that two processes can never have the file open at
once.

Hope this helps.  Although your problem seems familiar, I don't usually
write to dbopen db's and it might be something else entirely.

   rgb

> technique has been used successfully on IBM SP, SGI O2K, and Apple G4 
> platforms. Problem symptoms are that the restart files are generally too 
> large (too many records) and larger files (say, over 10MB) are always 
> corrupted, but smaller files (say, under 1MB) are generally, but  not 
> always, ok. We suspect this is a problem w/ our file system setup (NFS). 
> Any ideas on how to fix this would be greatly appreciated.
> 
> Thank you in advance!
> Kelley Wittmeyer
> Atmospheric Science
> Colorado State University
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu






More information about the Beowulf mailing list