[Beowulf] [EXTERNAL] IB vs. Ethernet

Scott Atchley e.scott.atchley at gmail.com
Mon Mar 2 14:08:31 UTC 2026


On Wed, Feb 25, 2026 at 9:04 PM Lawrence Stewart <stewart at serissa.com>
wrote:

> Arista has published 10G latency measurements for QSFP based copper and
> optical links from 1-6 meters
>
> Copper latency looks like about 5 ns per meter while optical is a little
> slower for short cables and a little faster for long ones.
>
>
> For 400 GB link modules, apparently you can use “analog” optical
> transceivers with 20 ns delays plus fiber delay up to 100 meters.  You can
> also use DSP based ones that could be 100 ns
>
> The Optical Analog/Clock and Data Recovery cables are much lower latency
> than the Active Optical Cables with retimers in them and perhaps equalizers.
>
> For connections within a rack, you can also use Direct Attach Copper,
> which is just a twinax parallel cable, up to about 5 meters.  Or there are
> Active Electrical Cables with equalizers that are a bit slower.
>
> The price tags for the optical 400G cables are eye-popping.
>
> I realize that most AI work is bandwidth-focussed, and a microsecond is
> fine, but I have a soft spot for SHMEM 8 byte puts and gets, and there is
> always a role for Barrier and small AllGathers.
>
> -L
>

How much does FEC add? I have been under the impression that it is now
mandatory ≥100Gbps.



>
>
> > On Feb 25, 2026, at 19:20, Lux, Jim (US 430E) <james.p.lux at jpl.nasa.gov>
> wrote:
> >
> >
> >
> > -----Original Message-----
> > From: Beowulf <beowulf-bounces at beowulf.org> On Behalf Of Lawrence
> Stewart
> > Sent: Saturday, February 21, 2026 4:34 AM
> > To: beowulf at beowulf.org
> > Cc: Lawrence Stewart <stewart at serissa.com>
> > Subject: [EXTERNAL] Re: [Beowulf] IB vs. Ethernet
> >
> >
> >
> >> On Feb 21, 2026, at 3:28 AM, Greg Lindahl <lindahl at pbm.com> wrote:
> >>
> >> On Thu, Jan 15, 2026 at 08:28:36PM -0500, Lawrence Stewart wrote:
> >>
> >>> I think a 64 byte store at a core should directly become a packet.  No
> on-die-network, no coherence, no root complex, no host-fabric adapter.
> Incoming short messages should be delivered directly to a fifo in the
> relevant core.
> >>
> >> I think that's a great idea!
> >>
> >> — greg
> >>
> >
> >
> > As Greg, I think, is hinting, this idea was a thing that QLogic HFI’s
> did, using the core write combining buffers to good effect.  It seems like
> it is also the basic idea behind MOVDIR64B, which specifies that a 64 byte
> write will be atomic all the way down.
> >
> > Using core registers for messaging is much older, with Transputers,
> Tilera, Dally’s J Machine and arguably Cray E-registers.
> >
> > What this is really about is end to end latency. We’ve been stuck at 1
> microsecond since the Cray T3D 30 years ago, in spite of 100x improvements
> in link speed.  If we can eliminate all the middlemen and get switches back
> to 50 ns forwarding, I think we should be able to get 300 ns end to end in
> a good size system.
> >
> > -Larry
> >
> >
> > Indeed, I suspect the 1 microsecond probably ties to something else that
> was convenient - If you're not running parallel wires (lanes) then sending
> 1000 bits at 1Gbps takes 1 microsecond.
> >
> > And if the actual link gets faster, the messages get bigger, so that
> they still take 1 microsecond.
> >
> > There are some practical issues - As your symbol rate gets higher on the
> wire, things like impedance discontinuities causing reflections become more
> important. You have a transition from die to package, one from package to
> board, one from board to connector/cable.   And those all have ~1-10 ns
> kind of time scales.  Stack all those up and it can take a long time for
> the cascade of reflections to die out.
> >
> > The fix, today, is to put equalizers (preferably adaptive equalizers)
> that essentially "undistort" the waveform.  But those equalizers have to
> look at many symbol times to work (typically, they're implemented as a
> tapped delay line with weights on each tap and summed - a FIR filter),
> which then means that your first bit out is delayed by however many symbols
> are in the filter's delay line.   I suspect that for "commodity" hardware,
> there's a particular length of delay line that is long enough to
> accommodate all possible wiring configurations.
> >
> > Let's look at Ethernet - the maximum ethernet run for GigE is 100
> meters, which not so oddly, is about 500 ns long (propagation speed is
> ~0.66c due to the dielectric and capacitance/inductance of the twisted
> pair). So the time for a reflection to get back to the sending end is,
> hmmm, 1 microsecond.
> >
> >
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20260302/8b5dfae3/attachment.htm>


More information about the Beowulf mailing list