[Beowulf] Weird problem with mpp-dyna

Joe Landman landman at scalableinformatics.com
Wed Mar 14 11:54:53 PDT 2007


Joshua Baker-LePain wrote:

>> Do you use a statically linked binary or did you relink it with your
>> mpich?
> 
> Agh.  I forgot to mention this little wrinkle.  LSTC software 
> distribution is... interesting.

Yup.  Caused us lot of fun at some customer sites.

>  For mpp-dyna, they ship dynamically 
> linked binaries compiled against a specific version of LAM/MPI (7.0.3 in 
> this case).

Yup.  Very hard to come by, that particular build.  Very hard.

> They also provide the matching pre-compiled LAM/MPI 
> libraries on their site. For a fun little wrinkle, RHEL/CentOS ships 
> LAM/MPI 7.0.6. However, the spec file in their RPM does *not* include 
> the --enable-shared flag.  IOW, the OS vendor's LAM/MPI package has no 
> .so files.

I rebuilt this (the LAM) for our customer.  Works nicely now.

> 
> It seems like it'd be worth re-compiling the centos lam RPM to include 
> the shared libraries and run against those to see if it helps.

Try an ldd against mpp-dyna-big-long-name

> 
>> We have ran lstc ls-dyna mpp970 and mpp971 across more than 16 nodes
>> without any issues on Scyld CW4 which is also centos 4 based.
> 
> We can run straight structural sims across as many nodes/CPUs as we've 
> tried, and ditto for straight thermal sims.  It's just on coupled 
> structural/thermal sims that this issue crops up.  That, to me, rather 
> points to a bug in dyna itself.  But the fact that the bug manifests 
> itself (at least in part) by the MPI job trying to talk to a different 
> network interface than was 'lamboot'ed is what is throwing me off a bit.



-- 

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452 or +1 866 888 3112
cell : +1 734 612 4615




More information about the Beowulf mailing list