[Beowulf] Parallel Programming Question
Nifty Tom Mitchell
niftyompi at niftyegg.com
Thu Apr 9 11:17:00 PDT 2009
On Thu, Apr 09, 2009 at 08:15:07PM +0500, amjad ali wrote:
> Hello All,
> On my 4-node Beowulf Cluster, when I run my PDE solver code (compiled
> with mpif90 of openmpi-installed-with-gfortran) with -np 4 launched
> only on the Head Node (without providing -machinefile), it gives me
> correct results. ONLY one problem is there: when I monitor RAM
> behavior it gets filling at a constant speed throughout the RUN of
> program, till I get final result. So Why the usage of RAM is
> constantly increasing (although this is not the case with relevant
> serial code of the same problem/algorithm/method).
> Secondly when I launch the same compiled code on 4 nodes (with
> -machinefile option). Then I do not get correct result. The
> convergence gets very much slow down after few iterations, ultimately
> resulting in NaN values of problem variables.
> I would be very grateful for having comments for the remedy of above
> two difficulties/confusions.
Without the code there is not too much this group can do.
The constant increase in RAM sounds like a memory leak
or a natural result of the programs' structure. You may need
to run a debugger to see what part of your code is triggering
the memory activity.
Getting different results locally and distributed sounds like
a bug in your code. Look for uninitialized data and code
that depends on side effects. Compiler flags can help:
start with dialing optimization down -O0 -g look also at -pedantic
-Wall -fbounds-check -fno-range-check -Wsurprising (man gfortran) and perhaps
first verify that the N hosts have the same runtime libs and
hardware that the local host has. Different results when N changes
may also be a natural result of your code. While algebra tells us
about the commutativity of simple operations, such as multiplication or addition
real floating point arithmetic can prove to be unstable. IEEE arithmetic
and IEEE exceptions control when and how libs return NaN etc.... Since
this is sometimes managed via environment variables look there as
T o m M i t c h e l l
Found me a new hat, now what?
More information about the Beowulf