Better rsh with timeout option?

alan at dasher.wustl.edu alan at dasher.wustl.edu
Fri Feb 8 06:32:12 PST 2002


:(where lilith is a laptop that happens to be entirely off, so it doesn't
:even ping).  In cases where the host pings but sshd happens to be turned
:off it tries four times and then quits -- about four seconds.  I haven't
:tried it when a host is hung in exactly the state you describe (pingable
:so there is a route, but sshd nonresponsive but not quite dead because
:of thrashing) but it seems likely that it would be handled somewhere
:between these two extremes.

    Not always -- there are times when this sort of assumption will fail,
    and ssh will hang forever.  We used to run into this trouble routinely 
    on our cluster when running large G98 jobs, and although the problems
    diminished when we upgraded to the 2.4 kernels, they never entirely
    disappeared.  

    I handled this by surrounding the ssh calls in my perl script with
    alarms, such that the perl code enforced the timeout since ssh
    wouldn't.  This approach wasn't foolproof -- for reasons I don't
    entirely understand, every once in a while the perl code would dump
    core.  However, since it was only a status monitor, I just set up a
    cron job to restart it if it stopped.  

    If you're interested, you can grab the perl code for my status monitor
    at http://dasher.wustl.edu/~alan/software/  psh2, which runs commands
    on each node in turn, does not have this code built in to it, but
    that's because the user tells it which nodes to run on, and can
    manually skip nonresponsive nodes.  

    Hope this helps,

Alan Grossfield
----------------------------------------------------------
|  Programming: a pastime similar to banging one's head  |
|  against a wall, but with fewer opportunities for      |
|  reward.           The Jargon File                     |
|----------------------------------------------------------



More information about the Beowulf mailing list