[Beowulf] interconnect and compiler ?

Wed Feb 11 20:43:53 PST 2009

Hi Patrick,

Interesting to know that you nowadays market ethernet cards. Still  
some knowledge on other companies switches
you also seem to posses. Congrats.

My faith in the switch and crossbars is actually quite high. Not so  
much in the MPI-cards however.
Let's assume for now that I was speaking of highend network cards  
that receive and send MPI packets in our myri cluster.

Let's say we got a cool quad xeon MP node @ 64 logical cores @ 3.x  
Ghz. Soon a very popular machine i'd guess.
Not really any other x86 cpu that can take it on head to head.

The upcoming release of the 'beckton' cpu from intel. The monster  
that probably
eats more power than any intel x86 cpu before :)

Feeding the monster node is not so dead easy.

Just look at 1 node now please. You've got at the MPI card 1 big long  
packet of a few megabyte that gets received.
Idemdito a few other threads get a few megabyte sized packets. At  
same time of all this, the card
also gets a packet of a few bytes for thread 42.

What is the time now for thread 42 to get the packet?

Will it first handle all the megabyte sized packets, or give the  
quick short packet already 'in between' to our "logical core 42"?

Thanks,
Vincent

p.s. i expect of course the answer '42' :)

On Feb 12, 2009, at 2:30 AM, Patrick Geoffray wrote:

> Vincent Diepeveen wrote:
>> All such sorts of switch latencies are at least factor 50-100  
>> worse than their one-way pingpong latency.
>
> I think you are a bit confused about switch latencies.
>
> There is the crossbar latency that is the time it takes for a  
> packet to be decoded and routed to the right output port. It is  
> essentially the difference between the pingpong latency with and  
> without the crossbar in the middle for the smallest packet size.  
> Typical crossbar latencies are in the order of 100ns for recent  
> Ethernot, 200ns for Ethernet. To build bigger fabric, you need to  
> connect multiple crossbars into Clos, Fat-tree or Torus topologies.  
> The end-to-end switch latency is then dependent on the number of  
> crossbars the packet crosses.
>
> There is the PHY/transceiver latency. That only applies to the edge  
> of the switch, where a physical cable plugs into a sockets. SFP+  
> for example requires serialization compared to QSFP. With fiber,  
> the transceiver have some overhead. Typical overhead is 250ns per  
> port for serial fiber PHY, almost nothing for parallel copper.
>
> Another overhead is the head-of-Line blocking. It happens when the  
> packet has to wait for another one to pass in order to be switched.  
> This is equivalent to 2 cars turning on the same road: one will  
> have to wait on the other to make the turn. This latency can be  
> high, specially if the packets are large (imagine a couple of  
> trains instead of cars).
> Is that what you call "ugly switch latency" ? HOL blocking will  
> reduce you switch efficiency to ~40% with random traffic. That  
> means your latency will be about two times higher in average,  
> assuming all packets have the same size. Where is the factor 50-100 ?
>
>> My assumption is always: "if manufacturer doesn't respond it must  
>> be real bad for his network card".
>
> Maybe they don't respond because the question does not make any sense.
>
>> Note that pingpong latency also gets demonstrated in a wrong manner.
>> Requirement to determine one way pingpong should be that it eats  
>> no cpu time obtaining it.
>
> You mean blocking on an interrupt ? When you go to a restaurant, do  
> you place your order and go back home waiting for a phone call or  
> do you wait at a table ? I, for one, sit down and busy poll.
>
> Patrick
>