[Beowulf] Correct networking solution for 16-core nodes

Vincent Diepeveen diep at xs4all.nl
Fri Aug 4 04:52:01 PDT 2006


Thanks Joachim,

This is indeed the case.

The problem of the mailing list is that there is very technical persons who 
work for highend companies who just try to emphasize the best case 
performance of their solution, whereas for software for those who are 
programmer, or even worse just people who run software on it, the worst case 
impacts most upon the
performance of applications, not the bestcase of the card without MPI 
overhead nor checking whether it overflows from too many short messages.

One such worst case is the case where several processes, and as we nowadays 
have at least 2 cores a node and in most luxury positions some networks have 
4 cores a node and soon this number of cores goes up, we already saw a 
posting of at least 2 groups who have 16 cores a node; the relevant thing is 
the switch latency of the NIC itself.

And that one is ugly, so very relevant to know for programmers.

Another bad one used to be certain switches when a lot of short messages 
flood the switch, though this is a different magnitude latency than the 
above ones,
and more expensive switches which get quoted here usually solve it.

The worst one is the runqueue latency of 10-20 ms. On paper 10 ms.

Especially at those SGI machines i had a lot of problems with the runqueue 
latency, additionally another problem there was the fact that timing was 
getting done
central. If only 1 cpu out of a partition of 512 is a timing cpu, then each 
process can't time its own performance simply, which is a major problem.

Of course my software times itself because i want to know what amount of 
system time effectively was used for the program itself. The rest of the 
time the process is busy polling (you can't idle because then you suffer the 
runqueue latency when waking up), so only timing within the process itself 
works.

At 130 cpu's without timing my chessprogram Diep achieved within 10 seconds 
perfect scaling (not to confuse with the speedup in time to find the best 
move
sooner, which is around 20% in case of such scaling), with timing: NEVER.

Lucky at clusters you don't have such problems which a single system image 
has.

Vincent

----- Original Message ----- 
From: "Joachim Worringen" <joachim at dolphinics.com>
To: <beowulf at beowulf.org>
Sent: Friday, August 04, 2006 10:35 AM
Subject: Re: [Beowulf] Correct networking solution for 16-core nodes


> Greg Lindahl wrote:
>> Vincent wrote:
>>
>>> Only quadrics is clear about its switch latency (probably
>>> competitors have a worse one). It's 50 us for 1 card.
>>
>> We have clearly stated that the Mellanox switch is around 200 usec per
>> hop.  Myricom's number is also well known.
>
> I think Vincent meant another latency, not the per-hop latency in the 
> switches: the time to switch between different processes communicating to 
> the NIC. I never heard of this latency being specified, nor being 
> substantial. Can anybody comment?
>
>  Joachim
>
> -- 
> Joachim Worringen, Software Architect, Dolphin Interconnect Solutions
> phone ++49/(0)228/324 08 17 - http://www.dolphinics.com
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
> 




More information about the Beowulf mailing list