[Beowulf] Parallel Programming Question
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Joe Landman landman at scalableinformatics.comTue Jul 17 04:39:43 PDT 2007
- Previous message: [Beowulf] Parallel Programming Question
- Next message: [Beowulf] Parallel Programming Question
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi James:
James Roberson wrote:
> Hello,
>
> I'm new to parallel programming and MPI. I've developed a simulator in
> C++ for which I would
> like to decrease the running time by using a Beowulf cluster. I'm not
> interested in optimizing
> speed, I'm just looking for a quick and easy way to significantly
> improve the speed over running
> the program on a single machine.
I am not sure I grasp "I'm not interested in optimizing speed, I'm just
looking for a quick and easy way to significantly improve the speed".
Basically I think you are saying you want to run it in parallel, in
order to speed up the program. Parallelization is an optimization.
> Basically, I need to learn how to parallelize a C++ function. The
> following functions in particular
> take the longest to run in my simulator. The first implements LU
> decomposition on a large matrix and
> the second implements the backsubstitution method to solve matrix division.
First, how large are your matrices? Second, how many times is this
called?
>
> void NR::ludcmp(Mat_IO_DP &a, Vec_O_INT &indx, DP &d)
> {
> const DP TINY=1.0e-20;
> int i,imax,j,k;
> DP big,dum,sum,temp;
>
> int n=a.nrows();
> Vec_DP vv(n);
> d=1.0;
> for (i=0;i<n;i++) {
> big=0.0;
> for (j=0;j<n;j++)
> if ((temp=fabs(a[i][j])) > big) big=temp;
(warning: coffee impaired, may not be correct in all aspects, notation,
typing, rambling ....)
abs is a slow function. Think about mathematical equivalents. abs(x)
maps x into the [0,) range. Multiplication is much faster than abs. If
your (x) is small enough,
temp=fabs(a[i][j]);
if ( (temp*temp) > (big*big) ) big = temp;
might be a bit faster.
> if (big == 0.0) nrerror("Singular matrix in routine ludcmp");
> vv[i]=1.0/big;
Second, you might want to define delta = some_small_number and compare
big to that. You would get a catastrophic loss of precision if you
divided by a very small number, and amplify error in the process.
Renormalization/scaling can be your friend, as well as row manipulation.
That said, I would suggest avoiding re-inventing wheels. There are a
number of excellent LU-decomposition and back substitution codes out
there ...
>
> (The functions are borrowed from the library provided by "Numerical
> Recipes in C++")
> I'm currently calling these functions from the main loop with the lines:
ah.... that explains it...
>
> NR::ludcmp(c,indx,d);
>
> and
>
> NR::lubksb(c,indx,xv);
>
> where the variable 'c' is a large matrix (holding image pixel values)
> and 'xv' is a vector
> used in backsubstitution.
> All of the variables passed into these functions are of types defined in
> " nr.h"
> 'c' is a Mat_DP (double-precision matrix)
> 'indx' is a Vec_INT (integer vector)
> 'd' is a DP (double precision)
> 'xv' is a Vec_DP (double precision vector)
>
> Is there a simple way to call these functions which will cause the
> cluster to distribute the
> load of operations? Currently, when I run the program with a 50x50
> array on a single machine,
> it takes about 5 minutes to process a single iteration through the
> matrix division.
Huh? What machine? What compiler? What options?
(more lack of coffee induced stupor here)
This should take well under 1 second. O(N**3) operations (50*50*50 =
0.125E+6 operations) on a 1 GFLOP machine ( 1E+9 FLOPS) should be
achievable in well under 1 second (ball park of 0.125/1000 seconds times
a constant)
Maybe I am missing something. This *should* be fast. Unless you are
running on a Palm pilot or something like that ...
In octave, using an interpreted LU decomposition on a hilbert matrix
h=hilb(50);
a=time ; [l,u,p] = lu(h); b=time; b-a
ans = 0.00063705
this is ball-park what I had estimated, within a constant factor
(cough). This is an Opteron 242 with 4 GB ram, so it isn't a speed demon...
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 866 888 3112
cell : +1 734 612 4615
- Previous message: [Beowulf] Parallel Programming Question
- Next message: [Beowulf] Parallel Programming Question
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
