<div>IMHO the hybris approach (MPI+threads) is interesting in case every MPI-process has lots of local data.</div>

<div> </div>

<div>If you have a cluster of quad-cores, you might either have one process per node with each process using 4 threads or put one mpi-process per core. The latter is simpler because it only requires MPI-parallelism but if the code is memory-bound and every mpi-process has much of the same data, it will be better to share this common data with all processes on the same cpu and thus use threads intra-node.

</div>

<div> </div>

<div>toon</div>

<div><br><br> </div>

<div><span class="gmail_quote">On 11/30/07, <b class="gmail_sendername">Mark Hahn</b> <<a href="mailto:hahn@mcmaster.ca">hahn@mcmaster.ca</a>> wrote:</span>

<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">> Many problems decompose well in large chunks that are well done<br>> with MPI on different nodes, and tight loops that are best done

> locally with shared memory processors. I think the argument against this approach is more based on practice than principles.  hybrid parallelism certainly is possible, and in the most abstract sense makes sense.

<br><br>however, it means a lot of extra work, and it's not clear how much benefit.<br>if you had an existing code or library which very efficiently used threads<br>to handle some big chunk of data, it might be quite simple to add some MPI

<br>to handle big chunks in aggregate.  but I imagine that would most likley<br>happen if you're already data-parallel, which also means embarassingly<br>parallel.  for instance, apply this convolution to this giant image -

<br>it would work, but it's also not very interesting (ie, you ask "then what<br>happens to the image?  and how much time did we spend getting it distributed<br>and collecting the results?")<br><br>for more realistic cases, something like an AMR code, I suspect the code

<br>would wind up being structured as a thread dedicated to inter data movement,<br>interfacing with a thread task queue to deal with the irregularity of intra<br>computations.  that's a reasonably complicated piece of code I guess,

<br>and before undertaking it, you need to ask whether a simpler model of<br>just one-mpi-worker-per-processor would get you similar speed but with<br>less effort.  consider, as well, that if you go to a work-queue for handing

bundles of work to threads, you're already doing a kind of message passing. if we've learned anything about programming, I think it's that simpler mental models are to be desired.  not that MPI is ideal, of course!

<br>just that it's simpler than MPI+OpenMP.<br><br>-mark hahn<br>_______________________________________________<br>Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a><br>To change your subscription (digest mode or unsubscribe) visit 

<a href="http://www.beowulf.org/mailman/listinfo/beowulf">http://www.beowulf.org/mailman/listinfo/beowulf</a><br></blockquote></div><br>