[Beowulf] Two problems related to slowness and TASK_UNINTERRUPTABLE process
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Tahir Malas tmalas at ee.bilkent.edu.trTue Jun 12 00:25:37 PDT 2007
- Previous message: [Beowulf] backtraces
- Next message: [Beowulf] Re: [mvapich-discuss] Two problems related to slowness and TASK_UNINTERRUPTABLE process
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi all, We have an 8 dual quad-core node HP cluster connected via Infiniband. We use Voltaire DDR cards and 24-port switch. We also use OFED 1.1 and MVAPICH 0.9.7. We have two interesting problems that we could not overcome yet: 1. In our test program which mimics the communications in our code, the nodes are paired as follows: (0 and 1), (2 and 3), (4 and 5), (6 and 7). We perform one to one communications between these pairs of nodes simultaneously. We use blocking MPI send and receive commands to communicate an integer array of various sizes. In addition, we consider different numbers of processes: (a) 1 process per node, 8 processes overall: One link is established between the pairs of nodes. (b) 2 process per node, 16 processes overall: Two links are established between the pairs of nodes. (c) 4 process per node, 32 processes overall: Four links are established between the pairs of nodes. (d) 8 process per node, 64 processes overall: Eight links are established between the pairs of nodes. We obtain logical timings, except for the following interesting comparison: For 32 processes (4 process per node), the arrays with 512-Byte size are communicated slower than the 4096-Byte size arrays. For both of them, we send/receive 1,000,000 arrays and take the average to find the time per package. Only package size changes. We have made many trials and confirmed this abnormal case is persistent. More specifically, communication of 4k-Byte packages are 2 times faster than the communication of 512-Byte packages. The OSU bandwidth and latency test around these points shows: Byte MB/s 256 417.53 512 592.34 1024 691.02 2048 857.35 4096 906.04 8192 1022.52 Time (usec) 256 4.79 512 5.48 1024 6.60 2048 8.30 4096 11.02 So this behavior does not seem reasonable to us. 2. SOMETIMES, after the test with overall 32 processes, one of the four processes at node3 hangs in TASK_UNINTERRUPTABLE "D" state. Hence, the test program shows a "done." and waits for sometime. We can neither kill the process nor soft reboot the node. We have to wait for that process to terminate, which can last long. Does anybody have some comments in these issues? Thanks in advance, Tahir Malas Bilkent University Electrical and Electronics Engineering Department
- Previous message: [Beowulf] backtraces
- Next message: [Beowulf] Re: [mvapich-discuss] Two problems related to slowness and TASK_UNINTERRUPTABLE process
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
