[Beowulf] [OOM killer/scheduler] disabling swap on cluster nodes?
dgruber at univa.com
Mon Feb 9 01:50:45 PST 2015
„schedd_job_info“ does not scale due to its nature (the amount of
messages per job are depend on the cluster size and for each job
messages are generated). It is also questionable if all scheduler
decisions for each job and resource (queue instances) needs to
be documented temporarily. Hence the recommendation is always
to turn it off (I think we changed the default to that in one of the
last Sun versions). Alternatively you can use "qalter -w p <jobid>“
for figuring out why a job is not scheduled (produces similar messages
but for only one particular job instead).
> Am 09.02.2015 um 09:43 schrieb Remy Dernat <remy.dernat at univ-montp2.fr>:
> Le 09/02/2015 03:56, Christopher Samuel a écrit :
>> On 07/02/15 14:57, Alan Louis Scheinine wrote:
>>> Only problem I've seen is that if a user allocates too much memory,
>>> OOM killer can kill maintenance processes such as a scheduler daemon.
>> This is why we disable overcommit. :-)
> I already saw that problem on our master. The scheduler, SGE, runs out of memory and OOM decided to kill it:
> Dec 1 15:01:07 cluster1 kernel: Out of memory: Kill process 7963 (sge_qmaster) score 948 or sacrifice child
> I resolved that issue by disabling "schedd_job_info" in SGE with "qconf -msconf".
> However, this setting gives significant informations about our jobs.
> How should I adjust OOM killer ? Sould I set
> = 2
> Best regards,
> Rémy Dernat
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf