Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Problems with a JS21 - Ah, the networking...

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Ivan Paganini ispmarin at gmail.com
Sat Sep 29 03:36:42 PDT 2007


Thank you, Bruce, I will try as soon I have access to the cluster.

I already contacted Myricom support, John, and they are working to try
to solve this, but still no solution to the problem. mx_counters in
the two nodes that I am trying the test mpich programs dont show
anything unusual:

1 ports
            Lanai uptime (seconds):     766268 (0xbb13c)
         Counters uptime (seconds):     766268 (0xbb13c)
                 Bad CRC8 (Port 0):          0 (0x0)
                Bad CRC32 (Port 0):          0 (0x0)
         Unstripped route (Port 0):          0 (0x0)
         pkt_desc_invalid (Port 0):          0 (0x0)
          recv_pkt_errors (Port 0):          0 (0x0)
            pkt_misrouted (Port 0):          0 (0x0)
                  data_src_unknown:          0 (0x0)
                    data_bad_endpt:          0 (0x0)
                 data_endpt_closed:          0 (0x0)
                  data_bad_session:          0 (0x0)
                   push_bad_window:          0 (0x0)
                    push_duplicate:          0 (0x0)
                     push_obsolete:          0 (0x0)
                  push_race_driver:          0 (0x0)
        push_bad_send_handle_magic:          0 (0x0)
                push_bad_src_magic:          0 (0x0)
                     pull_obsolete:          0 (0x0)
              pull_notify_obsolete:          0 (0x0)
                  pull_race_driver:          0 (0x0)
                  pull_notify_race:          0 (0x0)
                      ack_bad_type:          0 (0x0)
                     ack_bad_magic:          0 (0x0)
                   ack_resend_race:          0 (0x0)
                          Late ack:          0 (0x0)
           ack_nack_frames_in_pipe:          0 (0x0)
                    nack_bad_endpt:          0 (0x0)
                 nack_endpt_closed:          0 (0x0)
                  nack_bad_session:          0 (0x0)
                  nack_bad_rdmawin:          0 (0x0)
                  nack_eventq_full:          0 (0x0)
                  send_bad_rdmawin:          0 (0x0)
                   connect_timeout:          0 (0x0)
               connect_src_unknown:          0 (0x0)
                   query_bad_magic:          0 (0x0)
                   query_timed_out:          0 (0x0)
                 query_src_unknown:          0 (0x0)
                Raw sends (Port 0):     198711 (0x30837)
             Raw receives (Port 0):      84612 (0x14a84)
    Raw oversized packets (Port 0):          0 (0x0)
                  raw_recv_overrun:          0 (0x0)
                      raw_disabled:          0 (0x0)
                      connect_send:        698 (0x2ba)
                      connect_recv:        692 (0x2b4)
                 ack_send (Port 0):       1361 (0x551)
                 ack_recv (Port 0):       1353 (0x549)
                push_send (Port 0):        306 (0x132)
                push_recv (Port 0):          0 (0x0)
               query_send (Port 0):        114 (0x72)
               query_recv (Port 0):         12 (0xc)
               reply_send (Port 0):         12 (0xc)
               reply_recv (Port 0):        114 (0x72)
            query_unknown (Port 0):          0 (0x0)
            query_unknown (Port 0):          0 (0x0)
           data_send_null (Port 0):        382 (0x17e)
          data_send_small (Port 0):        255 (0xff)
         data_send_medium (Port 0):          0 (0x0)
           data_send_rndv (Port 0):         18 (0x12)
           data_send_pull (Port 0):          0 (0x0)
           data_recv_null (Port 0):        434 (0x1b2)
   data_recv_small_inline (Port 0):        174 (0xae)
     data_recv_small_copy (Port 0):         24 (0x18)
         data_recv_medium (Port 0):         19 (0x13)
           data_recv_rndv (Port 0):          0 (0x0)
           data_recv_pull (Port 0):         54 (0x36)
   ether_send_unicast_cnt (Port 0):      15990 (0x3e76)
 ether_send_multicast_cnt (Port 0):         10 (0xa)
     ether_recv_small_cnt (Port 0):      12205 (0x2fad)
       ether_recv_big_cnt (Port 0):       5234 (0x1472)
                     ether_overrun:          0 (0x0)
                   ether_oversized:         19 (0x13)
              data_recv_no_credits:          0 (0x0)
                    Packets resent:          0 (0x0)
  Packets dropped (data send side):          0 (0x0)
              Mapper routes update:         64 (0x40)
         Route dispersion (Port 0):          0 (0x0)
               out_of_send_handles:          0 (0x0)
               out_of_pull_handles:          0 (0x0)
               out_of_push_handles:          0 (0x0)
                  medium_cont_race:          0 (0x0)
                  cmd_type_unknown:          0 (0x0)
                 ureq_type_unknown:          0 (0x0)
                Interrupts overrun:          0 (0x0)
         Waiting for interrupt DMA:          0 (0x0)
         Waiting for interrupt Ack:          0 (0x0)
       Waiting for interrupt Timer:          0 (0x0)
                   Slabs recycling:          0 (0x0)
                    Slabs pressure:          0 (0x0)
                  Slabs starvation:          0 (0x0)
               out_of_rdma handles:          0 (0x0)
                       eventq_full:          0 (0x0)
              buffer_drop (Port 0):          0 (0x0)
              memory_drop (Port 0):          0 (0x0)
    Hardware flow control (Port 0):          0 (0x0)
(Devel) Simulated packets lost (Port 0):          0 (0x0)
   (Logging) Logging frames dumped:          0 (0x0)
                   Wake interrupts:        629 (0x275)
               Averted wakeup race:        326 (0x146)
                 Dma metadata race:          0 (0x0)
                               foo:          0 (0x0)

mx_endpoints shows there is no connection between any nodes when the
program is not running, and the right number of connections when the
program is running and is just hanged.

I am just waiting to some user programs to finish to then stress the
myrinet, and try to change the driver from 1.1.6 to 1.2.2.

Thank you.

Ivan

2007/9/29, John Hearns <john.hearns at streamline-computing.com>:
> On Fri, 2007-09-28 at 17:43 -0300, Ivan Paganini wrote:
> > Hello everybody,
> >
> > I am beginning to take care of an IBM's JS21. The cluster consists of
>
> > The myrinet connection was working right, but sometimes a user program
> > just got stuck - one of the processes was sleeping, and all others
> > were running. Then, the program hangs.
> >
> > Any suggestions?
>
> Contact Myricom support?
>
> BTW, if you are doing the debugging by yourself, start from the bottom.
> Take two machines, run mx_info, mx_endpoint (should be nothing if no
> programs running) and mx_counters.
> Then do your pingpong and further stress tests as in the README.
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


-- 
-----------------------------------------------------------
Ivan S. P. Marin
----------------------------------------------------------



More information about the Beowulf mailing list