[Beowulf] Correct networking solution for 16-core nodes
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Vincent Diepeveen diep at xs4all.nlFri Aug 4 04:52:01 PDT 2006
- Previous message: [Beowulf] Correct networking solution for 16-core nodes
- Next message: [Beowulf] Correct networking solution for 16-core nodes
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Thanks Joachim, This is indeed the case. The problem of the mailing list is that there is very technical persons who work for highend companies who just try to emphasize the best case performance of their solution, whereas for software for those who are programmer, or even worse just people who run software on it, the worst case impacts most upon the performance of applications, not the bestcase of the card without MPI overhead nor checking whether it overflows from too many short messages. One such worst case is the case where several processes, and as we nowadays have at least 2 cores a node and in most luxury positions some networks have 4 cores a node and soon this number of cores goes up, we already saw a posting of at least 2 groups who have 16 cores a node; the relevant thing is the switch latency of the NIC itself. And that one is ugly, so very relevant to know for programmers. Another bad one used to be certain switches when a lot of short messages flood the switch, though this is a different magnitude latency than the above ones, and more expensive switches which get quoted here usually solve it. The worst one is the runqueue latency of 10-20 ms. On paper 10 ms. Especially at those SGI machines i had a lot of problems with the runqueue latency, additionally another problem there was the fact that timing was getting done central. If only 1 cpu out of a partition of 512 is a timing cpu, then each process can't time its own performance simply, which is a major problem. Of course my software times itself because i want to know what amount of system time effectively was used for the program itself. The rest of the time the process is busy polling (you can't idle because then you suffer the runqueue latency when waking up), so only timing within the process itself works. At 130 cpu's without timing my chessprogram Diep achieved within 10 seconds perfect scaling (not to confuse with the speedup in time to find the best move sooner, which is around 20% in case of such scaling), with timing: NEVER. Lucky at clusters you don't have such problems which a single system image has. Vincent ----- Original Message ----- From: "Joachim Worringen" <joachim at dolphinics.com> To: <beowulf at beowulf.org> Sent: Friday, August 04, 2006 10:35 AM Subject: Re: [Beowulf] Correct networking solution for 16-core nodes > Greg Lindahl wrote: >> Vincent wrote: >> >>> Only quadrics is clear about its switch latency (probably >>> competitors have a worse one). It's 50 us for 1 card. >> >> We have clearly stated that the Mellanox switch is around 200 usec per >> hop. Myricom's number is also well known. > > I think Vincent meant another latency, not the per-hop latency in the > switches: the time to switch between different processes communicating to > the NIC. I never heard of this latency being specified, nor being > substantial. Can anybody comment? > > Joachim > > -- > Joachim Worringen, Software Architect, Dolphin Interconnect Solutions > phone ++49/(0)228/324 08 17 - http://www.dolphinics.com > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf >
- Previous message: [Beowulf] Correct networking solution for 16-core nodes
- Next message: [Beowulf] Correct networking solution for 16-core nodes
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
