Beowulf: A theorical approach

Thu Jun 22 15:30:29 PDT 2000

On Thu, 22 Jun 2000, Walter B. Ligon III wrote:

> --------
> 
> Well, yeah, but its the PCI interface I'm talking about.  Robert Brown's
> posting was really more to the point.  Build a NIC that interfaces
> driectly to the CPU and memory.

Sure, and memory is indeed another way to do it.  Build a small
communications computer that "fits" into a memory chip slot.  I'd guess
that one could make the actual interface a real (but small and very fast
-- SRAM?) memory chip that was on TWO memory buses -- the one in the
computer in question and on the "computer" built into the interface
whose only function is to manage communications and which would be
strictly responsible for avoiding timing collisions -- possibly with a
harness that allows it to generate interrupts to help even more. (Can a
memory chip per se generate trappable interrupts now? Don't know.) Then
accompany it with a kernel module that maps those memory addresses into
a dedicated interface space and manages the interrupts, so the CPU only
tries to write the memory when it is writable and read when it is
readable.

Control the interface with e.g. headers/trailers on the writes -- write
a block of data to it at memory speed (possibly even with DMA transfers
and the CPU doing something else).  Then write an address into a byte
block that initiates the transfer.  Reads require an interrupt -- the
data comes in and is buffered in a generous buffer (perhaps the
post-prom leftovers of an onboard 64 MB or 128 MB SDRAM chip -- build
the thing on top of the DIMM it replaces) and at the appropriate time an
interrupt is generated to tell the kernel/system to execute a read DMA
from the memory buffer into "real" memory.  I think this is pretty much
the way memory mapped, DMA capapble network interfaces operate now,
except that they are bottlenecked at a Gbps (PCI @ 32(bits)*33(MHz)) and
generally have a much higher latency.

This is just the same idea in sheep's clothing, only this way instead of
pulling the transferred data out of the pseudo-CPU's "cache" SRAM the
second CPU lives in an entirely different space not directly accessible
from the main CPU, where you can do with it whatever you wish.  This way
there is a single chip of "shared" memory (which is actually probably an
SRAM cache buffer living in both spaces that can convince the main
mobo's CPU that it is an SDRAM DIMM) and an attached communications
co-computer that does nothing but move stuff into and out of the SRAM
and into its own copious comm buffer (128 MB is actually probably gaudy,
but it gets the point across -- this sucker could exchange BIG blocks of
data or lots of little blocks in parallel with the main CPU because it
operates completely independently) and manage the transfers.  With <10
ns SRAM, there is probably time for it to be loaded and kept full (or
emptied) when attached to a relative slow (presumed SDRAM) interface.
Obviously the interface would ignore or simulate memory refresh and all
that.

Even with a hellacious latency (which there is no reason to expect a
priori) one could imagine a parallel fiber connection that gives you a
VERY high bandwidth to other nodes.  For example, 32 optical fibers and
switches (operating and controlled in parallel, each responsible for one
bit) could transfer data quite rapidly.  There are plenty of problems
that would be accessible with very high internode bandwidth even if one
DID have to pay a hundred microsecond hit in latency.  For one thing, it
would be refreshing to have a network that could transfer data faster
than main memory accesses.  One could think about distributed shared
memory paradigms (e.g. NUMA) with or without the underlying CC which
could be done in software if necessary.  For lots of problems or in a
(shared memory based) message passing environment, it wouldn't be.

I dunno, Greg, two beers into it I'm still tormented with the thought
that $2 million might be a reasonable investment, especially if one is a
company like Transmeta, with a significant investment in making tiny
computers that could -- given the right harnesses -- be assembled into a
beowulf.  But even for Intel it might make sense.

However, Greg's other point (about PCI-X making it all better) is still
well taken.  Not being an IEEE member, I've been unable to get PCI-X
specs from the web (and I've expended some effort in that direction --
it drives me bananas to have a closed forum where significant standards
development efforts occur without even a window where the world can see
in) it may well be that they've basically snugged it up much closer to
the CPU and memory busses.

Still, in the "old days" of PC development, PC's were so obviously
horrible in many features of their design (my original IBM PC was a 64K
-- yes, K -- motherboard) that lots of small companies got very, very
creative in their designs for enhancements and made decent fortunes for
their founders. Coprocessor boards, transputers, CPU plug-ins, and lots
of multifunction boards with memory and peripherals all hit the market
and sold (sometimes "like crazy").  These days, it seems like this
particular kind of bent-coat-hanger engineering has all but disappeared,
which is a bit of a shame.  Understandable enough -- computers now are
adequate for almost any mainstream application "out of the factory" --
but a shame, as it takes away a lot of creativity.

If the "standard" interface you have is too slow and there is an
interface available that is more than fast enough and standards based,
clever engineering and supporting software SHOULD be able to create a
kludge that uses the faster one (outside of the purpose for which it was
designed, of course).

Sigh.  I'll go down for another beer now.  Maybe three will make this
idea go away.  At least I have the blessed advantage of not being an EE
so I won't be tempted to go drum up the requisite $5M (why take risks)
of VC and go for it.  One can only hope, though, that Intel (or Compaq,
or any of the main computer/CPU/motherboard folks) has a light bulb turn
on and actually designs a motherboard/CPU connection a priori, outside
of the existing bus specs, "just for communications".  Something like
the benighted and hellish AGP slot, but instead of being useless and
devoted to graphics (which tend to work just fine on PCI, for the most
part) devoted to way deep, low latency/high bandwidth communications.

How hard can it be to design a standard API spec for this?  Put it on
the memory bus.  Put EVERYTHING on the memory bus and make all device
latencies the responsibility of the peripheral designer.  Make it the
responsibility of the peripheral designer to provide something that can
be read or written at memory bus speeds and latencies (subject to
interrupt line control), and put the burden on their shoulders to
provide interrupt-driven memory bus control of when the particular
device is ready to read or write again.  All of this can be managed with
a shared memory/co-processor paradigm.  In fact, a dual CPU system,
running a shared memory interface between a running program on CPU A and
its communications program on CPU B can emulate it now, sometimes even
profitably, except that the comm program has to go through the same damn
bus to get to the NIC and that a lot of NICs do DMA transfers anyway, so
most of CPU B is effectively wasted.

It's hard (at least for a novice like me) to understand why modern
computers require "a peripheral bus" with separate and distinct
latencies and bandwidths anymore at all.  Put everything on one bus and
let the peripheral itself decide how fast it "can" interface (up to the
actual lat/bw of real memory) and leave the decision of how fast it
"does" interface up to the interrupt-driven kernel software.  I think a
lot of the "peripheral bus" concept is an archaism left over from the
old days, when one had very slow devices that couldn't possibly saturate
a memory channel (so it made sense for several to have to share).

   rgb 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu