[Beowulf] OOM errors when running HPL

Fri Dec 19 16:49:18 PST 2008

Prentice Bisbal wrote:
> I've got a new problem with my cluster. Some of this problem may be with
> my queuing system (SGE), but I figured I'd post here first.
> 
> I've been using hpl to test my new cluster. I generally run a small
> problem size (Ns=60000)so the job only runs 15-20 minutes. Last night, I
> upped the problem size by a factor of 10 to Ns=600000). Shortly after
> submitting the job, have the nodes were shown as down in Ganglia.
> 
> I killed the job with qdel, and the majority of the nodes came back, but
> about 1/3 did not. When I came in this morning, there were kernel
> panic/OOM type messages on the consoles of the systems that never came
> back.
> 
> I used to run hpl jobs much bigger than this on my cluster w/o a
> problem. There's nothing I actively changes, but there might have been
> some updates to the OS (kernel, libs, etc) since the last time I ran a
> job this big. Any ideas where I should begin looking?

I've run into similar problems, and traced it to the way Linux
overcommits RAM. What are your vm.overcommit_memory and
vm.overcommit_ratio sysctls set to, and how much swap and RAM do the
nodes have?

-- 
-- Skylar Thompson (skylar at cs.earlham.edu)
-- http://www.cs.earlham.edu/~skylar/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 250 bytes
Desc: OpenPGP digital signature
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20081219/a5214870/attachment.sig>