[Beowulf] RE: [mvapich-discuss] Two problems related to slowness and TASK_UNINTERRUPTABLE process
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Tahir Malas tmalas at ee.bilkent.edu.trMon Jun 18 09:23:04 PDT 2007
- Previous message: [Beowulf] Re: [mvapich-discuss] Two problems related to slowness and TASK_UNINTERRUPTABLE process
- Next message: [Beowulf] Two problems related to slowness and TASK_UNINTERRUPTABLE process
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Sayantan,
We have installed OFED 1.2, and our two problems have gone! Now there is
neither suspending processes and nor inconsistent communication times:
PACKAGE SIZE 512 BYTES
1.76
PACKAGE SIZE 4096 BYTES
13.83
These were
Our test:
512: 29.434
4096: 16.209
with OFED 1.1.
Thanks and regards,
Tahir Malas
Bilkent University
Electrical and Electronics Engineering Department
Phone: +90 312 290 1385
> -----Original Message-----
> From: Sayantan Sur [mailto:surs at cse.ohio-state.edu]
> Sent: Tuesday, June 12, 2007 6:09 PM
> To: Tahir Malas
> Cc: mvapich-discuss at cse.ohio-state.edu; beowulf at beowulf.org;
> teoman.terzi at gmail.com; 'Ozgur Ergul'
> Subject: Re: [mvapich-discuss] Two problems related to slowness and
> TASK_UNINTERRUPTABLE process
>
> Hi Tahir,
>
> Thanks for sharing this data and your observations. It is interesting.
> We have a more recent release, MVAPICH-0.9.9 which is available from our
> website (mvapich.cse.ohio-state.edu) as well as with OFED-1.2
> distribution. Could you please try out our newer release and see if the
> results change/remain the same?
>
> Thanks,
> Sayantan.
>
> Tahir Malas wrote:
> > Hi all,
> > We have an 8 dual quad-core node HP cluster connected via Infiniband. We
> use
> > Voltaire DDR cards and 24-port switch. We also use OFED 1.1 and MVAPICH
> > 0.9.7. We have two interesting problems that we could not overcome yet:
> >
> > 1. In our test program which mimics the communications in our code, the
> > nodes are paired as follows: (0 and 1), (2 and 3), (4 and 5), (6 and 7).
> We
> > perform one to one communications between these pairs of nodes
> > simultaneously. We use blocking MPI send and receive commands to
> communicate
> > an integer array of various sizes. In addition, we consider different
> > numbers of processes:
> > (a) 1 process per node, 8 processes overall: One link is established
> between
> > the pairs of nodes.
> > (b) 2 process per node, 16 processes overall: Two links are established
> > between the pairs of nodes.
> > (c) 4 process per node, 32 processes overall: Four links are established
> > between the pairs of nodes.
> > (d) 8 process per node, 64 processes overall: Eight links are
> established
> > between the pairs of nodes.
> >
> > We obtain logical timings, except for the following interesting
> comparison:
> >
> > For 32 processes (4 process per node), the arrays with 512-Byte size are
> > communicated slower than the 4096-Byte size arrays. For both of them, we
> > send/receive 1,000,000 arrays and take the average to find the time per
> > package. Only package size changes. We have made many trials and
> confirmed
> > this abnormal case is persistent. More specifically, communication of
> > 4k-Byte packages are 2 times faster than the communication of 512-Byte
> > packages.
> >
> > The OSU bandwidth and latency test around these points shows:
> > Byte MB/s
> > 256 417.53
> > 512 592.34
> > 1024 691.02
> > 2048 857.35
> > 4096 906.04
> > 8192 1022.52
> > Time (usec)
> > 256 4.79
> > 512 5.48
> > 1024 6.60
> > 2048 8.30
> > 4096 11.02
> > So this behavior does not seem reasonable to us.
> >
> > 2. SOMETIMES, after the test with overall 32 processes, one of the four
> > processes at node3 hangs in TASK_UNINTERRUPTABLE "D" state. Hence, the
> test
> > program shows a "done." and waits for sometime. We can neither kill the
> > process nor soft reboot the node. We have to wait for that process to
> > terminate, which can last long.
> >
> > Does anybody have some comments in these issues?
> > Thanks in advance,
> > Tahir Malas
> > Bilkent University
> > Electrical and Electronics Engineering Department
> >
> >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>
>
> --
> http://www.cse.ohio-state.edu/~surs
>
- Previous message: [Beowulf] Re: [mvapich-discuss] Two problems related to slowness and TASK_UNINTERRUPTABLE process
- Next message: [Beowulf] Two problems related to slowness and TASK_UNINTERRUPTABLE process
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
