restartability problem

Kelley Wittmeyer kelley at atmos.colostate.edu
Mon May 6 10:34:43 PDT 2002


Hi gang.

We've just stepped into the cluster arena and got everything going
except for program restartability.  Hope someone might be able
to shine some light on the problem.

hardware:  16 node, dual proc amd athlon mp on a tyan thunder
		intel 10/100/1000 network cards (switch is currently
			still a 100, however)
		lotsa ram, scsi disks

software:	redhat 7.2
		mpich 1.2.x
		netcdf v3
		f90 (pgi 3.3-1)
		nfs (from the redhat distribution)

When our model is distributed over more than 1 process using MPI, we are 
not getting the correct output files. The output file is a direct access 
file in which each process writes to specific records in this single 
file. This produces output files that are process independent. This 
technique has been used successfully on IBM SP, SGI O2K, and Apple G4 
platforms. Problem symptoms are that the restart files are generally too 
large (too many records) and larger files (say, over 10MB) are always 
corrupted, but smaller files (say, under 1MB) are generally, but  not 
always, ok. We suspect this is a problem w/ our file system setup (NFS). 
Any ideas on how to fix this would be greatly appreciated.

Thank you in advance!
Kelley Wittmeyer
Atmospheric Science
Colorado State University




More information about the Beowulf mailing list