Beowulf: A theorical approach
David S. Greenberg
dsg at super.org
Mon Jun 26 09:06:34 PDT 2000
I'm at least partially responsible for coining the acronym SALC. I got tired
of talking about "the T3E memory model" when I meant something more general
such as Quadrics.
Several of us (including some hardware architects, compiler writers, and
applications writers) believe that it is important to populate the space in
between pure distributed memory and pure shared memory models. I'll describe
why we want something different by first summarizing our problems with the
extrema models and then describing what we believe is necessary.
"A pox on both your houses"
The pure distributed memory models tend to require message passing, emphasize
portability over all else, and please those whose applications are relatively
loosely coupled -- in a word beowulfers. The pure shared memory models tend to
push flat memory spaces, emphasize the need for global cache coherency, and
please those whose applications have not been tailored to parallelism and data
The hardware costs (in design, scalability, etc.) of a globally shared and
cache consistent memory can be huge. Most of the main stream shared memory
systems (SGI, Sun, IBM, Compaq) are, in fact, patinas over distributed memory.
Going beyond 64 processors gets difficult -- the systems tend to become
unstable and not worth the increased design costs. I am convinced that if you
want shared memory of this type then you are better off following more radical
designs such as the Cray/Tera MTA.
On the other hand, as has been pointed out by several people in this thread,
classic ethernet-connected beowulfs can have limited applications scope. They
are fantastic for some applications but quickly top out at 8 to 16 processors
for other applications. Faster (higher bandwidth and lower latency) networks
help somewhat but the processors (and nodes) are becoming more powerful faster
than the interconnects are improving. PCI-X will be probably be a big step
forward but almost certainly not enough. (I suspect that Infiniband like AGP
will be more of a marketing coup than a technical break-through except that it
will take all the credit for PCI-X).
As an example of our plight consider that the ASCI red machine and Cray T3E
machines had full featured, 400 MB/s interconnects four years ago with
processing nodes which are anemic compared to today's nodes. PCI-X, when it
appears, will maybe get us back to this level of bandwidth with many fewer nice
features in early PCI-X NICS/systems than were in the MPP machines.
We deserve better but are not greedy.
Which is where the SALC model comes in. It is possible, Quadrics is a good
example, to augment NICs with the ability to perform remote direct memory
access. "Standards" such as VIA make this an option for hardware. I think we
must make it a requirement. Furthermore, we need to, as a community, decide
what attributes are necessary. I believe, based on my experience with Portals
at Sandia and UPC here at CCS, that fairly simple hardware is sufficient. We
do not need all the bells and whistles of Cray's e-registers (though they are
very nice.) We do not need to support in hardware all the notification modes
of MPI-2 one-sided (though every mode has its strong adherents.) We do not
need the on NIC virtual-to-real address translation of Quadrics or Matt Welsh's
myrinet control program (though they are certainly nice also.)
Instead, we need the ability to register large chunks of contiguous physical
memory with the NIC and associate them with a tag usable from user-space on all
nodes. The registering is likely to occur in a library or be done by the
compiler. Next we need the ability to issue remote loads and stores by issuing
just a couple of machine instructions (ideally just two loads or a load and a
store). These instructions should take advantage of all the processor
machinery for outstanding loads and delayed writes. Lastly we need a
relatively efficient synchronization primitive which allows us (again usually a
library or the compiler) to create a fence guaranteeing that previous remote
operations have completed. All this should be possible in a $100 NIC and
require little or no increase in switch complexity.
So please help us move forward.
(1) Propose a better name than SALC. We need to keep the notion that addresses
are somehow shared, i.e. you can address most if not all of the memory of
remote nodes with simple efficient load/store type operations. We also need to
convey the warning that once a datum is fetched there will be no automatic
notification to the fetcher when someone else has written to the fetched
(2) Think about the real needs of your codes. Can you save lots of space my
not copying boundary regions? What are your "memes"? Do you use shared work
queues? Do you use ghost cells? How do you load balance? How do you improve
data locality and reuse? How do you synchronize?
(3) Demand more from compiler writers, system architects, and machine
purchasers? Beowulf's are nice but you can have more if you push for it.
I think I've passed my "one-screen per posting" limit so I'll stop here but I'd
love to continue the discussion one-on-one with anyone who cares to contact
me. Or come to the Atlanta Linux Showcase and Conference in October.
Greg Lindahl wrote:
> > > But it isn't actually cache coherent. Remember Cray shmem(): To use
> > > data, you fetch it to be close to you, and then it isn't cache
> > > coherent while it's local. It IS cache coherent in that the fetch
> > > gets you the right data.
> > Wrong. In the CS2 it _was_ cache coherent both locally and remotely.
> I wasn't discussing the CS2, I was discussing the T3E and SALC. Sorry if I
> gave a different impression. I don't know of any other machines like the
> CS2, nor do I think they're interesting.
> > Another of the reasons I don't like the SALC description is precisely
> > this confusion about what the "local consistency" is intended to mean.
> You can take it up with Bob Numrich; I'm just the messenger. Bob invented
> the shmem interface in the first place. shmem is interesting mainly because
> it's cheaper to build hardware for it than for things like the CS2.
> > Making the NIC properly cache coherent is one of the main reasons to
> > be on the processor bus, appearing as a second CPU. It allows the NIC
> > to implement the full coherency protocol when accessing data (either
> > bringing it in, or sending it out).
> That's far more expensive and difficult than the other benefit of getting on
> the processor bus: reduced latency.
> -- g
> Beowulf mailing list
> Beowulf at beowulf.org
More information about the Beowulf