[Beowulf] Parallel Programming Question
gus at ldeo.columbia.edu
Wed Jun 24 12:21:14 PDT 2009
Hi Amjad, list
Mark Hahn said it all:
2 is possible, but only works efficiently under certain conditions.
In my experience here, with ocean, atmosphere, and climate models,
I've seen parallel programs with both styles of I/O (not only input!).
Here is what I can tell about them.
#1 is the traditional way, the "master" processor
reads the parameter (e.g. Fort ran namelists) and data file(s),
broadcasts parameters that are used by all "slave" processors,
and scatters any data that will be processed in a distributed fashion
by each "slave" processor.
E.g., the seawater density may be a parameter used by all processors,
whereas the incoming solar radiation on a specific part of the planet,
is only used by the particular processor that is handling that
specific area of the world.
For output it is about the same procedure in the reverse sense,
i.e., the "master" processor gathers the data from all "slave"
processors, and writes the output file(s).
That always works, there is no file system contention.
In case the data is too big for the master node memory capacity,
one can always split the I/O in smaller chunks.
One drawback of this mode is that it halt the computation
while the "master" is doing I/O.
There is also some communication cost associated to broadcasting,
scattering, or gathering data.
Another drawback is that you need to write more code for the I/O procedure.
In my experience this cost is minimal, and this mode pays off.
Normally funneling I/O through the "master" processor
takes less time than the delays caused by the contention
that would be generated by many processors trying to access the same
files, say, on a single NFS file server.
(If you have a super-duper parallel file system then this may be fine.)
In addition, MPI is in control of everything, you are less dependent on
I really prefer this mode #1.
However, I must say that the type of program we run here doesn't do
I/O very often, typically once every 1000 time steps (order of
magnitude), with lots of computation in-between.
For I/O intensive programs you may need strategy #2.
#2 is used by a number of programs we run.
Not that they really need to do it,
but because it takes less coding, I suppose.
They always cause a problem when the number of processors is big.
Most of these programs take a cavalier approach to I/O, which does not
take into account the actual filesystem in use (whether local disk,
NFS, or parallel FS), the size of the files, the number of processors in
I.e. they tend to ignore all the points that Mark
mentioned as important.
Often times these codes were developed on big iron machines,
ignoring the hurdles one has to face on a Beowulf.
In general they don't use MPI parallel I/O either
(or other MPI-I/O developments such as HDF-5 file parallel I/O,
or NetCDF-4 file parallel I/O).
These programs may improve in the future,
but the current situation is this: brute force "parallel" I/O.
(The right name would be something else, not "parallel I?O", perhaps
"contentious I/O", "stampede I/O", "looting I/O", or other.)
I have to run these programs on a restricted number of processors,
to avoid NFS to collapse.
Somewhere around 32-64 processors real problems start to happen,
and much earlier than that if your I/O and MPI networks are the same.
You can do some tricks, like increasing the number of NFS daemons,
changing the buffer size, etc, but there is a limit to how far the
tricks can go.
Parallel file systems presumably handle this situation more smoothly
(but they will cost more than a single NFS server).
However, this second approach should work well if you stage
in and stage out all parameter files and data to/from
**local disks** on each node (although with dual-socket quad-core
systems you may still have 8 processors reading
the same files at the same time).
This is typically done by the script you submit to the
resource manager/queue system.
The script stages in the input files,
then launches the parallel program (mpirun),
and at the end stages out the output files.
However, in the field I work, "stage-in/stage-out" data files has
mostly been phased out and replaced by files on NFS mounted directories.
(I read here and elsewhere that I/O intensive parallel programs -
genome research, computational biology, maybe computational chemistry -
continue to use "stage-in/stage-out" precisely to avoid contention
over NFS, and to avoid paying more for a parallel file system.)
So, for low I/O-to-computation ratio, I suggest using #1.
For high I/O-to-computation ratio, maybe #2 using local disks and
stage-in/stage-out (if you want to keep cost low).
As usual, YMMV. :)
I hope this helps.
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
Mark Hahn wrote:
>> In an mpi parallel code which of the following two is a better way:
>> 1) Read the input data from input data files only by the master
>> and then broadcast it other processes.
>> 2) All the processes read the input data directly from input data
>> (no need of broadcast from the master process). Is it possible?.
> 2 is certainly possible; whether it's any advantage depends too much
> on your filesystem, size of data, etc. I'd expect 2 to be faster only
> if your file setup is peculiar - for instance, if you can expect all
> nodes to have the input files cached already. otherwise, with a FS like
> NFS, 2 will lose, since MPI broadcast is almost certainly more
> time-efficient than N nodes all fetching the file separately.
> but you should ask whether the data involved is large, and whether each
> rank actually needs it. if each rank needs only a different subset of
> data, then reading separately could easily be faster.
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf