Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] OOM errors when running HPL

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Prentice Bisbal prentice at ias.edu
Mon Dec 22 05:52:44 PST 2008


Alan Louis Scheinine wrote:
> A year ago large memory jobs would cause AMD nodes to crash
> on the cluster for which I was system administrator.
> /var/log/messages showed out of memory errors before the crash.
> I can't say that the problem has been solved, I refer to last
> year because I changed jobs.
> 
> In order to understand if the problem is a known bug (as in the
> case cited above) please specify the main board, the amount of
> memory, the number of cores and the version of the kernel.
> 
> You wrote:
>> I used to run hpl jobs much bigger than this on my cluster w/o a
>> problem.
> 
> How does the amount of memory on the new cluster compare to the cluster
> in which you did not have a problem.  In particular, the amount of
> memory per core, assuming all cores were used in your testing.

Alan, thanks for the reply. It's the same cluster - jobs that ran on it
a few weeks ago, are no longer running. There has been no hardware
changes, so I don't think it's a hardware problem. The only difference I
can think if is that I'm now using SGE to launch these jobs, which I may
not have been doing the last time I ran a job this big.

The only other possible software changes are kernel package updates that
may have occurred since the last successful run of a job this big.


-- 
Prentice



More information about the Beowulf mailing list