[Beowulf] Problems with a JS21 - Ah, the networking...
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Mark Hahn hahn at mcmaster.caSat Sep 29 10:41:29 PDT 2007
- Previous message: [Beowulf] Problems with a JS21 - Ah, the networking...
- Next message: [Beowulf] Problems with a JS21 - Ah, the networking...
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> I sniffed the network in the store nodes interface, and i got lots of > TCP lost fragment, previos lost fragments, ack lost fragments and TCP > window size full. The GPFS is now heavily used. so this indicates that you have a serious ethernet problem, no? > The myrinet connection was working right, but sometimes a user program > just got stuck - one of the processes was sleeping, and all others > were running. Then, the program hangs. Investigating this further, > this happened with the simple mpich examples like cpi, cpilog, etc. We > are using the mx driver version 1.1.6, and mpich-mx 1.2.7..5. mx_info > shows all nodes connected when this happens, and the switch did not > overheat. mpirun.ch_mx -v shows that all the processes are issued ok > to the nodes, but somehow one (or more) process go to sleep or never > starts, and all the other processes just hangs. The mx diagnose tools > did not show any problem so far, but we still did not have done a but spawning myrinet jobs normally involves some use of ethernet, which has known problems. as I recall, the protocol involves a rendezvous ethernet socket managed by the rank0 node. couldn't the myrinet-starting problem simply be due to the eth problem, rather than anything specific to myrinet? here's an idea: configure ip-over-myrinet, and use it exclusively to start the jobs. if that works, then you know for sure that the problem is solely on the eth side (switch, perhaps, or maybe a nic that's jabbering or otherwise misbehaving?)
- Previous message: [Beowulf] Problems with a JS21 - Ah, the networking...
- Next message: [Beowulf] Problems with a JS21 - Ah, the networking...
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
