[Beowulf] interconnect and compiler ?

Sat Jan 31 04:06:48 PST 2009

Well it is very relevant for many applications, and especially  
because all cards
are having such ugly switch latencies. This where we have real soon  
like 64 cores
a node if not more.

Those who rewrote their algorithms to being more bandwidth dependency  
might
not be volunteering to share that always. More complex algorithms  
optimize more details
and are faster therefore. Optimizing one big part to bandwidth  
dependency is rather logical,
all the rest then that is latency dependant gets done with short  
messages. So you want
those to get through.

All such sorts of switch latencies are at least factor 50-100 worse  
than their one-way pingpong latency.

So we are not going to get many replies onto this thread, simply  
because they fear
they're worse than the rest.

My assumption is always: "if manufacturer doesn't respond it must be  
real bad for his network card".

MPI is quite ugly to do communications if you've got just 1 card and  
64 logical cores.

Note that pingpong latency also gets demonstrated in a wrong manner.
Requirement to determine one way pingpong should be that it eats no  
cpu time obtaining it.

My experience is that network communication can get done at most by 1  
thread.

We are again going to see in future ever more problems for  
supercomputers to have a fast network.

Power6 nodes already are around 0.6 Tflop now whereas network can do  
maybe a few gigabytes
per node.

Calculations that are O ( n ^ 3 ) of course are inefficient algorithms.

Just because no one gets publicly paid to solve the problem, it still  
gets calculated that inefficient,
at some locations on this planet, yet not all.

On Jan 30, 2009, at 10:28 PM, Håkon Bugge wrote:

> On Jan 30, 2009, at 21:24 , Vincent Diepeveen wrote:
>
>> Now that you're busy with all this, mind quoting interconnects  
>> switch latency?
>>
>> For example: If one of the cores c0 on our box is busy receiving a  
>> long message
>> from a remote node in the network, a message that will take  
>> significant time,
>> can it switch in between let go through a short message meant for  
>> c1, and if so
>> what latency time does it take to receive it for c1?
>>
>
> Vincent,
>
> This is a good comment. Multiple DMA engines/threads in the HCA and/ 
> or different priority level to the DMAs are issues. I would claim,  
> as usual in the HW vs. SW world, that the mechanisms are  
> implemented in the hardware, but the ability of software to take  
> advantage of this may not be there.
>
> Since, small messages often are required to get trhough in order to  
> start a large one, e.g. rendevouz protocols, your example is  
> relevant. You might have an interest in looking at the SPEC MPI2007  
> results at http://www.spec.org/mpi2007/results/mpi2007.html
>
> Here you will find different MPI implementations using equal or  
> very similar hardware, and/or different interconnects.  Out of the  
> 13 applications constituting the SPEC MPI2007 medium suite, you  
> will find the milage varies significantly.
>
> May be related to a response (in software) to your issue?
>
>
>> On Jan 30, 2009, at 6:06 PM, Greg Lindahl wrote:
>>> Even logp doesn't describe an interconnect that well. It matters how
>>> efficient your interconnect is at dealing with multiple cores,  
>>> and the
>>> number of nodes. As an example of that, MPI implementations for
>>> InfiniBand generally switch over to higher latency/higher overhead
>>> mechanisms as the number of nodes in a cluster rises, because the
>>> lowest latency mechanism at 2 nodes doesn't scale well.
>>>
>
> [slight change of subject]
>
> Greg, we (and your former colleagues at PathScale) have exchanged  
> opinions on RDMA vs. Message Passing. Based on SPEC MPI2007, you  
> will find that an RDMA based DDR interconnect using Platform MPI  
> performs better than a Message Passing stack using DDR. Looking at  
> 16 nodes, 128 cores, Intel E5472, the Message Passing paradigm is  
> faster on 5 (out of 13) application, whereas Platform MPI with its  
> RDMA paradigm is faster on 8 of the applications. Further, when the  
> Message Passing paradigm is faster, its never faster than 7% (on  
> pop2). On the other hand, when PMPI with its message passing is  
> faster, we talk 33, 18, and 17%.
>
> May be you will call it a single datapoint. But I will respond its  
> 13 applications. And frankly, I didn't have more gear to run on ;-)
>
>
> Thanks, Håkon