Hi all,<br><br>I need/request some help from those who have some experience in debugging/profiling/tuning parallel scientific codes, specially for PDEs/CFD.<br><br>I have parallelized a Fortran CFD code to run on Ethernet-based-Linux-Cluster. Regarding MPI communication what I do is that:<br>

<br>Suppose that the grid/mesh is decomposed for n number of processors, such that each processors has a number of elements that share their side/face with different processors. What I do is that I start non blocking MPI communication at the partition boundary faces (faces shared between any two processors) , and then start computing values on the internal/non-shared faces. When I complete this computation, I put WAITALL to ensure MPI communication completion. Then I do computation on the partition boundary faces (shared-ones). This way I try to hide the communication behind computation. Is it correct?<br>

<br>IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less elements) with an another processor B then it sends/recvs 50 different messages. So in general if a processors has X number of faces sharing with any number of other processors it sends/recvs that much messages. Is this way has "very much reduced" performance in comparison to the possibility that processor A will send/recv a single-bundle message (containg all 50-faces-data) to process B. Means that in general a processor will only send/recv that much messages as the number of processors neighbour to it.  It will send a single bundle/pack of messages to each neighbouring processor.<br>

Is their "quite a much difference" between these two approaches?<br><br>THANK YOU VERY MUCH.<br>AMJAD.<br>