Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] disabling bad nodes

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Reuti reuti at staff.uni-marburg.de
Mon Mar 27 12:19:34 PST 2006


Hi,

Am 26.03.2006 um 21:07 schrieb James Rustad:

> Guys
> This is a strange question, but
> Is there any way to disable a bad node in PBS without being the  
> system administrator?
> I am lining up about 50 jobs in the queue and they fail  
> sequentially when they hit
> the bad node.  This often seems to happen on the weekends when nobody
> is around to reboot the node.
>
> Can I specify within PBS "don't use node015" or something like that.
> Thanks
> Jim Rustad
> ps
> I may be using TORQUE rather than PBS, by the way

although I can't answer your question directly: what is causing this  
black hole in the cluster? I faced this with a filled /tmp on some  
nodes from time to time. As we are using SGE, I use their load-sensor  
facility to check the free space there and put the node into alarm- 
state otherwise, i.e. disabling the queues on this node. Maybe  
something similar could be implemented also with Torque, to get some  
self-healing at weekends. - Reuti


> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list