[Beowulf] cluster softwares supporting parallel CFD computing
Eric W. Biederman
ebiederm at xmission.com
Fri Sep 8 21:18:05 PDT 2006
Stuart Midgley <sdm900 at gmail.com> writes:
>> It does apply, however, many parallel algorithms used today are
>> naturally blocking. Why? Well, complicating your algorithm to overlap
>> communication and computation rarely gives a benefit in practice. So
>> anyone who's tried has likely become discouraged, and most people
>> haven't even tried.
>> -- greg
> You comment about overlapping computation and communication is interesting. As
> the number of cores per address space goes up, the chance that overlapping
> computation with communication actually gives you anything also
> decreases... memory copies require CPU intervention (unless you offload it to
> your NIC which then means you suffer the normal latencies/message rates etc
> Sure, you can offload the copy to the NIC on some interconnects (eg. Quadrics)
> but I personally found that the increased latency and decreased bandwidth of
> the copy affected performance more than not overlapping.
But overlapping compute and I/O while nice isn't the point. The point
is to have a buffer so your processes don't have to be in rigid lockstep.
Letting you bury OS jitter and communication latency, because you have
some work to do.
If it all happens on one machine that is fine. By receiving asynchronously
you can receive the data when you are ready for it, so none of your processes
needs to block waiting for the other, so ideally you are always busy doing useful
So no I don't think it is a waste of time when you have lots of cores per
node. There is less latency to bury, and lower odds of getting processes
out of lockstep, so the win is less until you go off node but that
is about it.
What I don't have a solid grasp on are what the data models of current
applications look like so I don't know how hard it is to be able to have
several messages in flight at any one time, and it may be in that case
there is little difference between a synchronous receive and an asynchronous
one. As the synchronous receive can just get the data out of the buffer
and give it the application, so you would only block if there were no more
messages but usually there would be a message waiting so that would work.
The only place I know where asynchronous message reception makes a lot
of sense is when you have several channels that could arrive at the same
time and then you could arrange for them to be processes in any order.
I need to go back and look and see what the MPI primitives for that kind
of operation are.
More information about the Beowulf