[Beowulf] Timeout in making connection to remote process...

jorgegg at sas.upenn.edu jorgegg at sas.upenn.edu
Fri Jun 17 14:03:45 PDT 2005


Hi,
I'm running a fortran 90 code on a Linux cluster with 7 nodes (I actually only
use 6) using the MPI library. I can change the "size" of the program (meaning
the number of operations to be performed although all operations are the same).
The problem is that when I try to run the program using mpirun sometimes --most
of the times but not always-- the program won't start running and I'll get the
following message (the name of the cluster is max and it's not always the node
number 2):
p0_20621:  p4_error: Timeout in making connection to remote process on maxsl2-d:
0
bm_list_20622:  p4_error: interrupt SIGINT: 2

Some other times it would run fine even with the same number of operations! It's
 not the number of people using the cluster because most of the time it's only
me. This problem also arises sometimes after 3 or 4 hours of running the
program.
Do you have any idea of why this happens? I estimate that with this number of
nodes my code should run around 3 weeks to finish so I really need to rely on
the computers keep communicating.
Thank you very much and please let me know if I didn't explain myself clearly.
Jorge







More information about the Beowulf mailing list