[Beowulf] Any Gaussian users out there?

Joe Landman landman at scalableinformatics.com
Sun Jan 7 19:49:55 PST 2007


I found a neat ... feature ... of Linux while getting g03 running in SMP 
on cluster nodes.  Long story, but the folks I am doing this for don't 
have/want to use Linda.  They asked us to help them get g03 operational 
in SMP parallel.  This wasn't painful.  Have it integrated into SGE and 
our SICE interface now as well.

Basic idea is that we are getting a kernel exception in the VFS layer 
only when running with 2 or more CPUs on an SMP node.  Shows up only on 
SuSE 9.3 nodes.  The other nodes are RHEL 3 based (2.4 kernel, but hey, 
its really stable).

I don't want to post a nasty-looking trap here.

The problem occurs with both xfs and jfs.  Haven't had the chance to try 
ext3 yet, though if the issue is in the vfs layer, I can't see how 
changing the underlying block device is going to alter the layers (VFS) 
above it.

The net effect of this is that it runs great on the 2.4 based machines, 
but gets SIGKILLs when running on the 2.6 based SuSE 9.3 machines. 
Looks like the app is tickling the OS bug.  I can repeatably cause this 
trap, though it seems to occur at "random" places, well, not really. 
The way Gaussian runs, it has "links" which are binary modules which 
execute a particular portion of the calculation (its pretty neat 
really).  Each link is read in from the disk.  This VFS bug gets 
triggered regardless of local or remote FS.

Any Gaussian users out there see that?  Does a kernel upgrade fix it? 
Inquiring minds want to know ...

-- 

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452 or +1 866 888 3112
cell : +1 734 612 4615




More information about the Beowulf mailing list