[Beowulf] Broadcast - not for HPC - or is it?

Matt Hurd matthurd at acm.org
Tue Oct 5 17:23:36 PDT 2010


> From your description as well as from a quick look at the website, it
> looks and smells like a hub - I mean a dumb hub, like those which
> existed in the '90s before switching hubs (now called switches) took
> over. If so, then HPC might not be a good target for you, as it has
> long ago adopted switches for good reasons.

Not as clever as a hub, as a hub goes from any one of N to any one or
all of N with collision sense/detect relying on back off.

This thing just goes from port A to port B1 ... port Bn using a simple
optical coupler in the core.  No contention as the paths are direct.
I can't see it being too useful for HPC myself but I guess as Kevin
pointed out perhaps there is a corner case or two.

It does allow one of the B ports to be bi-directional so that a trader
could set up a subsciption to a multicast group to be used by all
ports.  However, allowing no client to server is a security benefit
and I guess if an exchange used such a thing they should just
broadcast or some such and disable the bi-direction.
- Hide quoted text -

>> Primarily focused on low-latency
>> distribution of market data to multiple users as the port to port
>
> HPC usage is a mixture of point-to-point and collective
> communications; most (all?) MPI library use low level point-to-point
> communications to achieve collective ones over Ethernet.. Another
> important point is that the collective communications can be started
> by any of the nodes - it's not one particular node which generates
> data and then spreads it to the others; it's also relatively common
> that 2 or more nodes reach the point of collective communication at
> the same time, leading to a higher load on the interconnect, maybe
> congestion.
>
> What might be worth a try is a mixed network config where
> point-to-point communications go through one NIC connected to a switch
> and the collective communications that can use a broadcast go through
> another NIC connected to your packet replicator. However, IMHO it
> would only make sense if the packet replicator makes some guarantees
> about delivery: f.e. that it would accept a packet from node B even if
> a packet from node A is being broadcasted at that time; this packet
> from node B would be broadcasted immediately after the previous
> transmission has finished. This of course means that each link
> NIC-packet replicator needs to be duplex and some buffering should be
> present - this was not the case of the dumb hubs mentioned earlier. I
> think that such a setup would be enough for MPI_Barrier and MPI_Bcast.
>
> One other HPC related application that comes to my mind is distributed
> storage. One of the main problems is keeping redundant metadata to
> prevent the whole storage going down if one of the metadata servers
> goes down. With such a packet replicator, the active metadata server
> can broadcast it to the others; this would be just one operation -
> with a switched architecture, this would require N-1 operations (N
> being the total nr. of metadata servers) and would loose any pretence
> of atomicity and speed.

Not a bad thought the storage thought, but again I reckon that a sub
micro switch would be a winner there on the functionality front.
Switches, like the Fulcrum based ones, are pretty impressive and not
too expensive.

Along those lines, it's not a HPC app, at least in my head, but
replication has uses for being able to do small fault tolerant quorums
with microsecond oriented failover.

>> They suggested interest in bigger port counts and mentioned >1000 ports.
>
> Hmmm, if it's only like a dumb hub (no duplex, no buffering), then I
> have a hard time imagining how it would work at these port counts -
> the number of collisions would be huge...

Nope, not a dumb hub, even dumber ;-)  No collisions just a tree of
optical couplers frantically splitting the photon streams.  The only
real trick, albeit pretty minor, is ensuring the signal integrity is
within budget and suitable for non-thinking plug and play.

Regards,

--Matt.



More information about the Beowulf mailing list