[Beowulf] fast interconnects, HT 3.0 ...

Eugen Leitl eugen at leitl.org
Tue May 23 09:35:01 PDT 2006

On Tue, May 23, 2006 at 11:04:52AM -0500, Richard Walsh wrote:

> >I don't know (like it would stop me), but there are few-port HT switches,
> >and if there are several ports on one chassis one could wire up
> >some topology, which, hopefully, will match the problem. 
> >
> >  
>    HT switches ... ?? ... can you point me to a reference?

Google knows of several, and even some hotplug connector specs.
See also http://www.commsdesign.com/design_corner/showArticle.jhtml?articleID=16503595
for a review.

> >I'm not sure how this is different from vanilla packet-switched
> >MPI network. It's not about maintaining memory coherency.
> >  
>    Well, of course you can run MPI over it as you can on the Cray and 
> Altix, but
>    you are artificially separating memory in software that is in fact 
> closer in hardware.

If I have some 10^3 nodes, and the context is not read-only
I always have to wait to make sure nobody is trying to write to
the same location. It's a worst case, but in a relativistic universe
maintaining the illusion of coherence over many copies is an
expensive one. Lots of signalling back and forth, until you 
know the state is settled for sure. This might work for 8, 16, maybe 32 systems
in a close enough location -- but with 10^3 or 10^6 nodes it 
has to give.

>    That is where the pGAS programming models become more efficient.  Remote
>    memory references expressed in the syntax and compiled to 
> instructions for
>    direct puts and gets without management or translation by a NIC.  It 

We're talking lunatic fringe interconnects where the wire or the fibre
is your FIFO, and the switch makes a routing decision after a few bits
of the headers have streamed past -- which is reasonably close to c.
With 10 GBit data rates and above that's a quick decision to take. 
At 10 GBit/s your serial bit is just ~3 cm or 100 ps short -- in vacuum. 
Shorter in glass, and much shorter in copper. So a very short message
can arrive within a few ns, which is order of magnitude RAM access.

> would seem
>    that HT 3.0 supports this model across chassis as long as the 
> programmer manages
>    memory synchronization.

You have to bite the bullet and manage synchronization by higher-order
protocols. The physical world at the bottom is fundamentally message-passing. 
You might notice it very much if you're working on us scale, but
in ns and below it you can't ignore it. 

> >>   Sounds like the Cray X1E pGAS memory model.  Is there a role for 
> >>    
> >
> >I don't think there is any other model but message passing. It's not
> >like this is a ccHT a la HORUS 
> >http://en.wikipedia.org/wiki/HORUS_interconnect
> >  
>    The inter-chassis, but non-coherent interface that HT 3.0 supports 
> would seem to work
>    very nicely with UPC and CAF.  They run very well on the Cray X1, 
> which provides
>    coherent memory on-board only as well. 

Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
ICBM: 48.07100, 11.36820            http://www.ativel.com
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 191 bytes
Desc: Digital signature
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20060523/d9598811/attachment.sig>

More information about the Beowulf mailing list