[Beowulf] Google CPI process monitoring
john.hearns at mclaren.com
Fri Apr 12 04:32:59 PDT 2013
They're doing job overcommitting, but dynamically. Cool.
Each of our clusters runs a central scheduler and admis-
sion controller that ensures that resources are not oversub-
scribed among the latency-sensitive jobs, although it spec-
ulatively over-commits resources allocated to batch ones.
Overcommitting resources is a form of statistical multiplex-
ing, and works because most jobs do not use their maxi-
mum required resources all the time. If the scheduler guesses
wrong, it may need to preempt a batch task and move it to
another machine; this is not a big deal - it's simply another
source of the failures that need to be handled anyway for
correct, reliable operation.
The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.
More information about the Beowulf