Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Problems with a JS21 - Ah, the networking...

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Ivan Paganini ispmarin at gmail.com
Sat Sep 29 12:40:26 PDT 2007


Hello Mark!

2007/9/29, Mark Hahn <hahn at mcmaster.ca>:
> > I sniffed the network in the store nodes interface, and i got lots of
> > TCP lost fragment, previos lost fragments, ack lost fragments and TCP
> > window size full. The GPFS is now heavily used.
>
> so this indicates that you have a serious ethernet problem, no?

I also think so, and this is my strongest possibility. But IBM does
not accept that there is a error in the hardware, and while I argue
with then about it, I was trying to search for other causes of the
ether problem.

>
> > The myrinet connection was working right, but sometimes a user program
> > just got stuck - one of the processes was sleeping, and all others
> > were running. Then, the program hangs. Investigating this further,
> > this happened with the simple mpich examples like cpi, cpilog, etc. We
> > are using the mx driver version 1.1.6, and mpich-mx 1.2.7..5. mx_info
> > shows all nodes connected when this happens, and the switch did not
> > overheat. mpirun.ch_mx -v shows that all the processes are issued ok
> > to the nodes, but somehow one (or more) process go to sleep or never
> > starts, and all the other processes just hangs. The mx diagnose tools
> > did not show any problem so far, but we still did not have done a
>
> but spawning myrinet jobs normally involves some use of ethernet,
> which has known problems.  as I recall, the protocol involves a
> rendezvous ethernet socket managed by the rank0 node. couldn't the
> myrinet-starting problem simply be due to the eth problem, rather than
> anything specific to myrinet?
>
> here's an idea: configure ip-over-myrinet, and use it exclusively
> to start the jobs.  if that works, then you know for sure that the
> problem is solely on the eth side (switch, perhaps, or maybe a nic
> that's jabbering or otherwise misbehaving?)

I have configured the ip-over-myrinet, but I'm not sure how to use
exclusively myrinet. I will have to search more about this.

My configuration is as follows: I am using mpich-mx v 1.2.7..5, and
configured all the blades with one ip using ifconfig, like
ifconfig myri0 192.168.30.101

Then, in a file called list, I put
192.168.30.101:4
(each blade has 4 cores).

and ran using
mpich.ch_mx -v -machinefile list -np 4 ./program

This still involves ethernet?

Thank you very much.

-- 
-----------------------------------------------------------
Ivan S. P. Marin
----------------------------------------------------------



More information about the Beowulf mailing list