[Beowulf] Two problems related to slowness and TASK_UNINTERRUPTABLE process
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Tahir Malas tmalas at ee.bilkent.edu.trWed Jun 13 05:37:08 PDT 2007
- Previous message: [Beowulf] Two problems related to slowness and TASK_UNINTERRUPTABLE process
- Next message: [Beowulf] Re: Beowulf Digest, Vol 40, Issue 9
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> -----Original Message----- > From: Mark Hahn [mailto:hahn at mcmaster.ca] > Sent: Tuesday, June 12, 2007 6:15 PM > To: Tahir Malas > Cc: mvapich-discuss at cse.ohio-state.edu; beowulf at beowulf.org; > teoman.terzi at gmail.com; 'Ozgur Ergul' > Subject: Re: [Beowulf] Two problems related to slowness and > TASK_UNINTERRUPTABLE process > > > For 32 processes (4 process per node), the arrays with 512-Byte size are > > communicated slower than the 4096-Byte size arrays. For both of them, we > > do you mean that this is not the case in other configurations? > an interconnect _should_ have some steep rise in effective bandwidth > as packet size is increased. it's a useful metric to know the packet > size at which half-peak bandwidth is achieved, since this offers some > "sense of scale" to programmers judging whether their own packet sizes > are appropriate. > > > this abnormal case is persistent. More specifically, communication of > > 4k-Byte packages are 2 times faster than the communication of 512-Byte > > packages. > > perhaps I'm dense this morning, but what's unexpected about that? Considering the latency and bw measures, my expectation for the communication times in us: 512: 5.48 + 512/592.34 = 6.34 4096: 11.02 + 4096/906.04 = 15.54 Our test: 512: 29.434 4096: 16.209 So, somehow, isn't communication time for 512 bytes is unexpectedly slow? > > > > 2. SOMETIMES, after the test with overall 32 processes, one of the four > > processes at node3 hangs in TASK_UNINTERRUPTABLE "D" state. Hence, the > test > > program shows a "done." and waits for sometime. We can neither kill the > > process nor soft reboot the node. We have to wait for that process to > > terminate, which can last long. > > does /proc/$pid/wchan (on the 'D' state process) tell you anything? > do all the ranks return from MPI_Finalize? > The file tells "__lock_buffer". Yes, all ranks return; but I think, this problematic process (i.e. one of the processes on node3) returns always the latest. Thanks, and regards, Tahir.
- Previous message: [Beowulf] Two problems related to slowness and TASK_UNINTERRUPTABLE process
- Next message: [Beowulf] Re: Beowulf Digest, Vol 40, Issue 9
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
