[Beowulf] disabling bad nodes

Reuti reuti at staff.uni-marburg.de
Mon Mar 27 12:19:34 PST 2006


Am 26.03.2006 um 21:07 schrieb James Rustad:

> Guys
> This is a strange question, but
> Is there any way to disable a bad node in PBS without being the  
> system administrator?
> I am lining up about 50 jobs in the queue and they fail  
> sequentially when they hit
> the bad node.  This often seems to happen on the weekends when nobody
> is around to reboot the node.
> Can I specify within PBS "don't use node015" or something like that.
> Thanks
> Jim Rustad
> ps
> I may be using TORQUE rather than PBS, by the way

although I can't answer your question directly: what is causing this  
black hole in the cluster? I faced this with a filled /tmp on some  
nodes from time to time. As we are using SGE, I use their load-sensor  
facility to check the free space there and put the node into alarm- 
state otherwise, i.e. disabling the queues on this node. Maybe  
something similar could be implemented also with Torque, to get some  
self-healing at weekends. - Reuti

> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list