[Beowulf] Correct networking solution for 16-core nodes

Wed Aug 2 04:54:10 PDT 2006

> -----Original Message-----
> From: Vincent Diepeveen [mailto:diep at xs4all.nl]
> Sent: Wednesday, August 02, 2006 12:42 PM
> To: Tahir Malas
> Subject: Re: [Beowulf] Correct networking solution for 16-core nodes
> 
> Hi Tahir,
> 
> Perhaps you can describe to the beowulf list what  type of software
> you use, and how latency sensitive your work is?
Hi,
We use fortran 90 compiler with LAM_MPI. How latency sensitive, is not easy
to describe. Briefly I can say that the parallel program is dominated by P2P
communications with an irregular pattern (not like collectives) and with
varying sizes.  That's why we want to increase "SMPness" to benefit fast
communication among the same board. At this point Tyan 8-way system helped a
lot.
> 
> Highend networks just work over MPI, not TCP of course and have
> in general a good latency.
> 
> Quadrics can work for example as direct shared memory among all
> nodes when you program for its shmem, which means that for short
> messages you can simply share from its 64MB ram on card something
> and share that array over all nodes. You just write normal in this array
> and the cards take care it gets synchronised.
> 
That is how LAM_MPI handles short messages in a SMP node, isn't it? But we
don't change anything in the MPI routines, if the message is short, it
handles it as of shmem.

> Very easy way to program for without needing to take care for all
> those calls. It makes things simpler to avoid doing a lot of
> synchronisation
> and avoids wasting a thread to just the MPI.
> 
> For plain MPI, right now IB is cheapest and has biggest bandwidth,
> though the latency they claim sounds a bit 'scientific'.
>

For IB of Voltaire for example there are some opportunities. Single port vs
dual port, MEMfree vs 128-256 MB RAM (which sounds similar to Quadrics), and
more importantly BUS interface; PCI-X,PCI-E, or AMD Hyper-transport (HTX).
HTX is said to provide 1.3us latency by connecting directly to the AMD
Opteron processor via a standard HyperTransport HTX slot. HyperTransport HTX
slot means RAM slots on the mb? Then we have to sacrifice some slots for
NIC? Well, at the end it is still unclear to me which one and how many to
choose. 

> Other than that latency, you have to realize that still the latency of
> those
> cards is
> ugly compared to the latencies within the quad boxes.
> 
> If you have 8 threads running or so in those boxes and you use an IB card,
> then it'll have a switch latency.
> 
> Only quadrics is clear about its switch latency (probably competitors have
> a
> worse
> one). It's 50 us for 1 card.
> 
But if we directly connect two boxes without a switch, then we can achieve
this latency I hope?

> That means that if 2 or more threads need a short message at same time,
> that
> switching
> from thread to thread is 50 us. For example if thread X gets some large
> message then
> the card in between switches to give thread Y a short message.
> 
> That takes 50 us.
> 
> I'm not sure about IB here whether it can switch anyway.
> 
> 50 microseconds is quite huge delay, though it'll be worse with other
> cards.
> Not sure whether your software can work with such latencies?
>
It works, but then requires much more time.

> I give this example because the manufacturers let everyone see latencies
> of
> their single core performed
> one way pingpong latencies (without MPI overhead usually executed) of like
> 1.x us.
> 
> That's just not reality.
> 
> With so many cores the switch latencies of the switch and the cards will
> be
> a major bottleneck if you ship
> lots of short messages and sometimes very big ones.
> 
> Compare to the 0.2 us it takes to get RAM from any part of the machine
> within a quad.
> 
> Factor 250 difference.
> 
> Probably you need some major software rewrite for this?
> 
> Good luck,
> Vincent

Thanks,
Tahir.