[Beowulf] running MPICH on AMD Opteron Dual Core Processor Cluster( 72 Cpu's)

Mark Hahn hahn at physics.mcmaster.ca
Wed Jan 3 07:53:35 PST 2007


> "  p1_8544: p4_error: Timeout in Establishing connection to remote process:
> 0  "
> rm_l_1_8667: (359.417969) net_send: could not write to fd=5, errno=104
>
> We have been trying the same for the past two days and we didnt get any
> solution for the above.

but what have you tried?  I would guess that this is a simple rsh config
problem, nothing to do with mpich.

> Also we downloaded the Latest MPICH 1.2.7p1 and configured the same. now for

but why do you think the problem lies with mpich?

> The same testing with LAM/MPI and OPENMPI are working fine.

lam being mostly just a previous version of lam, and I think inheriting
lam's agent-based process-starting, no?

personally, I'm pretty convinced that MPI implementations should stay
out of the jobstarter business, and go with straight agentless (ssh-based)
job spawning.



More information about the Beowulf mailing list