[Beowulf] Any Gaussian users out there?
Rafael R. Pappalardo
rafapa at us.es
Tue Jan 9 00:33:42 PST 2007
On Monday 08 January 2007 04:49, Joe Landman wrote:
> I found a neat ... feature ... of Linux while getting g03 running in SMP
> on cluster nodes. Long story, but the folks I am doing this for don't
> have/want to use Linda. They asked us to help them get g03 operational
> in SMP parallel. This wasn't painful. Have it integrated into SGE and
> our SICE interface now as well.
> Basic idea is that we are getting a kernel exception in the VFS layer
> only when running with 2 or more CPUs on an SMP node. Shows up only on
> SuSE 9.3 nodes. The other nodes are RHEL 3 based (2.4 kernel, but hey,
> its really stable).
> I don't want to post a nasty-looking trap here.
> The problem occurs with both xfs and jfs. Haven't had the chance to try
> ext3 yet, though if the issue is in the vfs layer, I can't see how
> changing the underlying block device is going to alter the layers (VFS)
> above it.
> The net effect of this is that it runs great on the 2.4 based machines,
> but gets SIGKILLs when running on the 2.6 based SuSE 9.3 machines.
> Looks like the app is tickling the OS bug. I can repeatably cause this
> trap, though it seems to occur at "random" places, well, not really.
> The way Gaussian runs, it has "links" which are binary modules which
> execute a particular portion of the calculation (its pretty neat
> really). Each link is read in from the disk. This VFS bug gets
> triggered regardless of local or remote FS.
> Any Gaussian users out there see that? Does a kernel upgrade fix it?
> Inquiring minds want to know ...
Don't know if it's threads related but... Sometimes setting
LD_ASSUME_KERNEL to 2.4.1 in the environment solves this kind of problems.
There are other possible values, you can have a look at:
Dr. Rafael R. Pappalardo
Dept. Physical Chemistry, Univ. de Sevilla (Spain)
e-mail: rafapa at us.es
More information about the Beowulf