[Beowulf] running out of rsh ports

Dan Stromberg strombrg at dcs.nac.uci.edu
Wed May 3 13:07:02 PDT 2006


On Wed, 2006-05-03 at 15:21 -0400, Joe Landman wrote:
> David Simas wrote:
> 
> > Except that it probably won't help with the problem, which I'm
> > guessing is caused by a given host attempting more than 1024
> > RSH connections to a given server in less than TCP TIME WAIT
> > seconds (minutes, whatever).  If the original correspondent
> 
> Actually it handles exactly these cases.  The FANOUT variable lets you 
> indicate the appropriate parallelism for rsh.  I believe pdsh is in use 
> on the big clusters ( > 1024 nodes at the national labs )

Nod.  I was pleased to learn of pdsh.  FWIW, loop doesn't try to run all
n at once either, though this degree of parallelism is controlled with a
command line option.

> > doesn't want to use SSH for RSH, which would fix things 
> 
> True, and you can use ssh with pdsh.  Or rsh.  With no syntax change to 
> the end user.
> 
> > SSH isn't restricted to low-numbered ports, he could try to
> > re-implement his application in MPI.
> 
> The basic question a few of us have is exactly what is Bruce and team 
> doing that is causing them to run out of ports.  Once we see this, we 
> can stop guessing and make better/targetted suggestions.

Yup, and strace/truss/whatever is your friend for that:

http://dcs.nac.uci.edu/~strombrg/debugging-with-syscall-tracers.html

...though based on the message, I'm guessing they are trying to run too
many rsh's in parallel, and hence running out of reserved ports.






More information about the Beowulf mailing list