[Beowulf] Problems with a JS21 - Ah, the networking...

Ivan Paganini ispmarin at gmail.com
Sat Sep 29 12:40:26 PDT 2007


Hello Mark!

2007/9/29, Mark Hahn <hahn at mcmaster.ca>:
> > I sniffed the network in the store nodes interface, and i got lots of
> > TCP lost fragment, previos lost fragments, ack lost fragments and TCP
> > window size full. The GPFS is now heavily used.
>
> so this indicates that you have a serious ethernet problem, no?

I also think so, and this is my strongest possibility. But IBM does
not accept that there is a error in the hardware, and while I argue
with then about it, I was trying to search for other causes of the
ether problem.

>
> > The myrinet connection was working right, but sometimes a user program
> > just got stuck - one of the processes was sleeping, and all others
> > were running. Then, the program hangs. Investigating this further,
> > this happened with the simple mpich examples like cpi, cpilog, etc. We
> > are using the mx driver version 1.1.6, and mpich-mx 1.2.7..5. mx_info
> > shows all nodes connected when this happens, and the switch did not
> > overheat. mpirun.ch_mx -v shows that all the processes are issued ok
> > to the nodes, but somehow one (or more) process go to sleep or never
> > starts, and all the other processes just hangs. The mx diagnose tools
> > did not show any problem so far, but we still did not have done a
>
> but spawning myrinet jobs normally involves some use of ethernet,
> which has known problems.  as I recall, the protocol involves a
> rendezvous ethernet socket managed by the rank0 node. couldn't the
> myrinet-starting problem simply be due to the eth problem, rather than
> anything specific to myrinet?
>
> here's an idea: configure ip-over-myrinet, and use it exclusively
> to start the jobs.  if that works, then you know for sure that the
> problem is solely on the eth side (switch, perhaps, or maybe a nic
> that's jabbering or otherwise misbehaving?)

I have configured the ip-over-myrinet, but I'm not sure how to use
exclusively myrinet. I will have to search more about this.

My configuration is as follows: I am using mpich-mx v 1.2.7..5, and
configured all the blades with one ip using ifconfig, like
ifconfig myri0 192.168.30.101

Then, in a file called list, I put
192.168.30.101:4
(each blade has 4 cores).

and ran using
mpich.ch_mx -v -machinefile list -np 4 ./program

This still involves ethernet?

Thank you very much.

-- 
-----------------------------------------------------------
Ivan S. P. Marin
----------------------------------------------------------



More information about the Beowulf mailing list