Beowulf in a Box (fwd)

Bob Glamm glamm@ece.umn.edu
Mon, 28 Sep 1998 13:49:27 -0400


>> > http://www.pobox.com/~kragen/sa-beowulf/

I've read through this.  It _could_ be a really nice product,
if it were turned into an SMP/cc-NUMA -type device.  It's not
a bad product right now, but I think it has some overhead that
could be dispensed with.

First, if I recall my PCI spec correctly, it is possible to configure
each PCI device with an address space.  It should be possible to give
each StrongARM module (CPU + PCI interface + memory) its own 
distinct physical address space outside of the memory of the host
machine.  In addition, this address space should probably be
uncacheable as well to the host computer for a first iteration.

This configuration would gives the StrongARM "array" a SMP/NUMA
flavor.  CPU-CPU transfers are simply done by writing into/reading
from the other CPUs address space.  This makes it either SMP
(for processors on the same PCI board) or NUMA (for processors
on a different PCI board).  Of course, this has the added benefit
that you don't need the I2O protocol at all.  Simple bulk transfers
over the PCI bus will handle all your needs.

Second, it is not necessary to run the entirety Linux on any of the StrongARM
processors.  Doing so in this configuration (in any configuration
IMHO ;) is a waste of resources.  In fact, why bother to put
ethernet on board these PCI cards?  Well written (and scalable)
MPI programs don't do a lot of communication; one single
Gigabit ethernet card in the host should be plenty of network
resource for quite a few StrongARMs.

There are some design issues, of course, not the least of which
is atomic memory access/locking.  I haven't read much on the
StrongARM architecture, so I can't say much about this.  Also,
it will be necessary to execute *some* system code on each CPU
(for process scheduling, idle loops, some basic interface code
that sends syscalls to the host CPU).  In addition, some modifications
would be required for Linux to detect, use, and operate such a
module.

Personally, I prefer this type of NUMA solution.  It is STILL
possible to run MPI programs on this type of hardware (possibly with
some modifications to MPI), but there's a whole lot less
overhead & latency to this type of solution as to the one
proposed (and I guess implemented ;) on the web page.

Feel free to disagree, though. ;)  If you think I'm off my rocker
by suggesting such a thing, see the SHRIMP project 
(http://www.cs.princeton.edu/~rda/shrimp.html) where such a
thing has already been implemented.

-Bob