[Beowulf] Any Gaussian users out there?
landman at scalableinformatics.com
Sun Jan 7 19:49:55 PST 2007
I found a neat ... feature ... of Linux while getting g03 running in SMP
on cluster nodes. Long story, but the folks I am doing this for don't
have/want to use Linda. They asked us to help them get g03 operational
in SMP parallel. This wasn't painful. Have it integrated into SGE and
our SICE interface now as well.
Basic idea is that we are getting a kernel exception in the VFS layer
only when running with 2 or more CPUs on an SMP node. Shows up only on
SuSE 9.3 nodes. The other nodes are RHEL 3 based (2.4 kernel, but hey,
its really stable).
I don't want to post a nasty-looking trap here.
The problem occurs with both xfs and jfs. Haven't had the chance to try
ext3 yet, though if the issue is in the vfs layer, I can't see how
changing the underlying block device is going to alter the layers (VFS)
The net effect of this is that it runs great on the 2.4 based machines,
but gets SIGKILLs when running on the 2.6 based SuSE 9.3 machines.
Looks like the app is tickling the OS bug. I can repeatably cause this
trap, though it seems to occur at "random" places, well, not really.
The way Gaussian runs, it has "links" which are binary modules which
execute a particular portion of the calculation (its pretty neat
really). Each link is read in from the disk. This VFS bug gets
triggered regardless of local or remote FS.
Any Gaussian users out there see that? Does a kernel upgrade fix it?
Inquiring minds want to know ...
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452 or +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf