[Beowulf] TCP connect error: ECONNREFUSED.
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
David Simas dgs at slac.stanford.eduTue Mar 31 10:57:03 PDT 2009
- Previous message: [Beowulf] TCP connect error: ECONNREFUSED.
- Next message: [Beowulf] GPU diagnostics?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Mon, Mar 30, 2009 at 02:14:50PM +0100, J?rg Sa?mannshausen wrote: > Dear all, > > I am having this rather anoying problem with the parallel execution of > one of the programs (GAMESS US version) on our cluster. The error > message is: > > TCP connect error: ECONNREFUSED. > TCP: Connect failed. comp10 -> comp02.chem.strath.ac.uk:42208. > A fatal error occurred on DDI Process 0. > TCP connect error: ECONNREFUSED. > TCP: Connect failed. comp10 -> comp02.chem.strath.ac.uk:42208. > A fatal error occurred on DDI Process 60. > TCP connect error: ECONNREFUSED. > TCP: Connect failed. comp10 -> comp02.chem.strath.ac.uk:42208. > A fatal error occurred on DDI Process 2. > TCP connect error: ECONNREFUSED. > > [ ... ] > > Eventually, the ddicick tips over and the whole thing crashes. The > program is using rsh (yes, I know, security, I did not install the > cluster!) and I can rsh comp10 -> comp02 and there is no firewall > installed between the nodes (at least, not that I am aware of). Trying > to run the same job with the same number of nodes will fail X times and > at X+1 suddenly work. I could not work out a pattern for that (other > that I get exponentially annoyed). Right now, there is only one gigabit > network connecting the cluster, so nfs, mpi etc. is all running over one > interface (again, I did not set up the cluster). How rapidly are these rsh connection attempts occuring? The rsh protocol requires connections from privileged ports - less than 1024. If a host attempts to make more than 1024 to another host in less than TCP TIME-WAIT seconds, it will run out ports and the connections will fail. I've seen this occur with parallel applications using rsh. David S. > > I have run out of ideas of where to look. I checked (as quickly as > possible) at some nodes with netstat, the ddicick program is acutally > running. Changing to ssh did not solve the problem. > > I would appreciate any feedback as it is highly anyoing to wait Y days > to get the job running and then it crashes. > > All the best from Glasgow! > > J?rg > > > -- > ************************************************************* > J?rg Sa?mannshausen > Research Fellow > University of Strathclyde > Department of Pure and Applied Chemistry > 295 Cathedral St. > Glasgow > G1 1XL > > email: jorg.sassmannshausen at strath.ac.uk > web: http://sassy.formativ.net > > Please avoid sending me Word or PowerPoint attachments. > See http://www.gnu.org/philosophy/no-word-attachments.html > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf
- Previous message: [Beowulf] TCP connect error: ECONNREFUSED.
- Next message: [Beowulf] GPU diagnostics?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
