[Beowulf] fast interconnects, HT 3.0 ...

Richard Walsh rbw at ahpcrc.org
Tue May 23 10:14:08 PDT 2006

Eugen Leitl wrote:
> On Tue, May 23, 2006 at 11:04:52AM -0500, Richard Walsh wrote:
>>> I don't know (like it would stop me), but there are few-port HT switches,
>>> and if there are several ports on one chassis one could wire up
>>> some topology, which, hopefully, will match the problem. 
>>    HT switches ... ?? ... can you point me to a reference?
> Google knows of several, and even some hotplug connector specs.
> See also http://www.commsdesign.com/design_corner/showArticle.jhtml?articleID=16503595
> for a review.
>>> I'm not sure how this is different from vanilla packet-switched
>>> MPI network. It's not about maintaining memory coherency.
>>    Well, of course you can run MPI over it as you can on the Cray and 
>> Altix, but
>>    you are artificially separating memory in software that is in fact 
>> closer in hardware.
> If I have some 10^3 nodes, and the context is not read-only
> I always have to wait to make sure nobody is trying to write to
> the same location. It's a worst case, but in a relativistic universe
> maintaining the illusion of coherence over many copies is an
> expensive one. Lots of signalling back and forth, until you 
> know the state is settled for sure. This might work for 8, 16, maybe 32 systems
> in a close enough location -- but with 10^3 or 10^6 nodes it 
> has to give.
     Mmm ... I do not think we are connecting.  Off board non-coherence 
is managed
     by the application and is made possible in part by pGAS syntax in 

    We have some very novel, fine grained UPC CFD codes running on the 
Cray X1
     which do indexed adaptive mesh regeneration to model the flapping 
wings of a model
     humming bird to follow its shedding vortices.  Performance is good 
and we manage the
     off board incoherence/synchrony nicely.  It would have been almost 
impossible to write
     in MPI and its performance would be poor.  The application has 
reasonably good scaling
     properties as it is. It even runs on our cluster ... yes in UPC ... 
(albeit much more slowly).
     it is has some data locality (not GUPS like) but the remeshing 
approach is fine grained
     ... the "messages" are direct remote memory puts and gets driven my 
vector instructions.

     HT 3.0 is presumbly more elaborate than the CRAY X1 ISA, but can 
provide similar,
     more direct, off-chassis, non-coherent memory addressing, No?  This 
is in tune with
     the UPC and CAF programming models.
>>    That is where the pGAS programming models become more efficient.  Remote
>>    memory references expressed in the syntax and compiled to 
>> instructions for
>>    direct puts and gets without management or translation by a NIC.  It 
> We're talking lunatic fringe interconnects where the wire or the fibre
> is your FIFO, and the switch makes a routing decision after a few bits
> of the headers have streamed past -- which is reasonably close to c.
> With 10 GBit data rates and above that's a quick decision to take. 
> At 10 GBit/s your serial bit is just ~3 cm or 100 ps short -- in vacuum. 
> Shorter in glass, and much shorter in copper. So a very short message
> can arrive within a few ns, which is order of magnitude RAM access.
    I am talking about improving on the ~1500 nanos required by the best 
of today's interconnects
    for a single, remote 8-byte reference, and perhaps further hiding 
that reduced latency in a
    pipelined vector load operation inside the pipe.   The question 
was:  What can HT 3.0 provide
    non-coherently, off board in this regard?  Maybe the answer is 
nothing ... but I have not
    heard it cogently argued yet.
>> would seem
>>    that HT 3.0 supports this model across chassis as long as the 
>> programmer manages
>>    memory synchronization.
> You have to bite the bullet and manage synchronization by higher-order
> protocols. The physical world at the bottom is fundamentally message-passing. 
> You might notice it very much if you're working on us scale, but
> in ns and below it you can't ignore it. 
    OK ... everything is a message ... even a Cray X1 vector write, but 
I am comparing MPI
    messages with something much smaller and more primitive.
>>>>   Sounds like the Cray X1E pGAS memory model.  Is there a role for 
>>> I don't think there is any other model but message passing. It's not
>>> like this is a ccHT a la HORUS 
>>> http://en.wikipedia.org/wiki/HORUS_interconnect
>>    The inter-chassis, but non-coherent interface that HT 3.0 supports 
>> would seem to work
>>    very nicely with UPC and CAF.  They run very well on the Cray X1, 
>> which provides
>>    coherent memory on-board only as well. 


Richard B. Walsh

Project Manager
Network Computing Services, Inc.
Army High Performance Computing Research Center (AHPCRC)
rbw at ahpcrc.org  |  612.337.3467

This message (including any attachments) may contain proprietary or
privileged information, the use and disclosure of which is legally
restricted.  If you have received this message in error please notify
the sender by reply message, do not otherwise distribute it, and delete
this message, with all of its contents, from your files.

More information about the Beowulf mailing list