Beowulf in a Box

Kragen kragen@pobox.com
Mon, 28 Sep 1998 13:25:25 -0400


(I've forwarded this, and the post it's a reply to, to the sa-beowulf
list because Doug Eadline is a commercial Beowulf builder.  His
experience may be very useful in letting us know what to expect.)

On Mon, 28 Sep 1998, Douglas Eadline wrote:
> So what happened? Well (In my opinion) INMOS could not keep
> up with the other CPUs of the time 

This could conceivably happen with the StrongARM.  Intel is developing
it, so I'm sure they *can* keep up if they want to; they've said they
want to, but we all know how fast such things can change.

If I understand correctly, the ARM family is actually specced by Acorn,
and several companies actually produce ARMs -- is that right?  Acorn
seems to be refocusing on consumer electronics, so this might possibly
be something to worry about.

Intel is announcing the next generation of ARM stuff next month; people
say it will be 500MHz and have hardware floating-point.

> Failed promises were the first problem of the Transputer,
> but I think it's ultimate demise (other than an embedded CPU)
> came from the lack of acceptance of "niche hardware" by the 
> mainstream.  Sure the embedded system guys, love this kind of
> stuff, but I believe it is tough sell to get someone
> to use niche hardware for following reasons:
> 
> 1) single source hardware (possible overnight obsolescence)
> 2) support comes from a single vendor and is limited (manpower
>    pool is very limited)
> 3) because of the single source nature an organization must
>    make a large investment (of time and money) - this is the biggest
>    problem.   

Well, I think the StrongARM is more like the Pentium than like the
Transputer with regard to these characteristics.

1) The vendor is Intel, but there are ARM chips from other vendors.
The companies that have licensed the ARM core are Intel, NEC, Philips,
National Semiconductor, Rockwell, Samsung, TI, Intel, IBM, Lucent, and
20 others, according to <URL:http://www.arm.com/CoInfo/CoBackground/>.
As far as I can tell, nearly all of these companies actually make ARM
processors, although most of them are embedded into other products, and
many of them are rather slow -- for example, Cirrus Logic makes
something called CL-PS7110, which is 15MIPS and 66mW.  More to the
point, though, any high-speed, low-power-consumption CPU that can talk
to 100MHz SDRAM and a PCI bus could easily be substituted, as soon as
NetBSD or Linux was ported to it.

I think that, even independent of the market for this little
supercomputer-on-a-PCI-card, high-speed ARM CPUs will be available for
quite a while -- they're just used in too many things.

2) Well, see above.  Support for the ARM is pretty broad, and both
Intel and ARM Inc. are developing high-speed ARMs.

3) What's a large investment?  You'd probably have to port your
software to NetBSD, which would probably not be a large investment if
it currently runs on more than one Unix (other than Linux and Solaris,
which are so similar that most apps run with only trivial changes).
Testing your ported software could probably take a few weeks.  The
hardware will probably cost less than getting another PC.

This is quite likely to be a smaller investment than buying another
high-end machine.

(Of course, if you have to rewrite all your code to work in
fixed-point, you *would* have a large investment.)

> There is a large amount of comfort knowing that you are not
> relying on a single person, company, or product to run your machine.  

Agreed.

> I believe that one of the
> reasons Beowulf/Clusters are very popular is that they
> are "plug and play" replacements (from a software
> standpoint) for much more expensive machines and therefore, the cost to
> adopt clusters is small.  If you can deliver "plug and play"
> to a market segment, then $/MIPS is a good sell.   

I think this is a plug-and-play proposition; it doesn't require
rewriting applications in funny languages, modifying them to support
new messaging styles, etc.  (Unless they're on a shared-memory
multiprocessor machine.)  The worst case is that you can't do
floating-point at any kind of reasonable speed, which locks out a big
market segment.

> To say Intel is behind ARM helps a bit, but not much. Intel
> killed their own children(i860/960) and closed their 
> supercomputer shop (except for custom machines).  

True.

> Like it or not, the "PC" is known concept, people are more
> comfortable with things they know, than with better things
> they do not know.

Certainly true.  Linux is running into the same problem.

> Finally, my guess as to the amount of work/cost involved to 
> bring this idea market is rather large.

Well, it's not so much a matter of bringing it to market.  Simon Thorpe
needed some better hardware for his neural-network stuff, and so he got
in touch with the folks at Causality/Chaltech.  So they're designing
and building the boards for him.  They wanted to sell a few more boards
than Simon needed, in order to bring the cost down to something
reasonable.  They've found enough customers for that.

It may be that some of the folks involved want to invest some work in
getting this to a bigger market.

The amount of time should be a few months; if everything goes well (no
redesigns, etc), the hardware will be available in November.

> It sounds as though
> there is still some hardware/software  to be developed.

Yes.  Fortunately, the bulk of the software has been developed -- MPI,
PVM, NetBSD, Linux, etc. -- and nearly all of the hardware (by the good
folks at Digital); what remains is to put it together, write drivers,
and fabricate the PC boards.

> Performance is uncertain, lots of assumptions.

True enough.  I don't know how fast the PCI bus will turn out to be,
but I don't see how it could be terribly slow.  Even with 100
processors, it would be as fast as switched 10Mbps Ethernet.  With
smaller clusters (I say "clusters" because they're conceptually
separate machines; they just happen to share a power supply and talk to
each other over PCI) I think that performance will have to excel.

Simon says he's tested his neural-network code on StrongARMs already,
so the CPU performance is not in doubt (at least for his application).
I'm sure that other people could do the same.

> Is there a business plan?

I don't know.  Perhaps the companies actually producing these things
have one.

> By the time all this gets worked out, a lot may change.

True.  It looks like the ARM 10 processor will be released, which
should be twice as fast and provide some floating-point hardware to
boot.  It should be quick to design a new daughtercard for it.

The timeframe is so short, though, that this uncertainty is greatly
reduced.

> BTW, there are many applications that do not require FP.  
> We have some tools that can take hundreds of CPUs and
> make them do amazing things.  Most of these are database/datamining
> applications.

What fraction of the Beowulf market do you think needs FP?  5%, 10%,
20%, 50%, 80%, 90%, 95%?  I imagine not that many people have big
enough databases to mine.  Am I dead wrong?

> The one thing I have found, however, is that
> a clean simple design is best for using lots of CPUs efficiently.
> i.e. the time it takes for any CPU to talk to any other CPU
> is about the same for all CPUs.

That won't be the case here, certainly.  If you have three of these
cards in a bus, then each CPU will have two or three other CPUs on the
same bus with it, three or four CPUs on the same card but on the other
bus, and the rest of the CPUs will have to be communicated with through
the main system bus.

The bandwidth is high enough that this may not be a concern for a lot
of applications, but that's a guess on my part.

> Well there are some things to consider.  I tried to give
> some objective experience that may help you "fine tune" your strategy. 
> It is not my goal to say "this will not work (because I have no
> idea if it will or not)", but rather, "push your idea a little".

I really appreciate your experience!  I think it indicates that this is
going to be a big success.

About 500 people have visited the web page since I announced it on
Saturday.  Presumably a lot more will visit if it gets mentioned in LWN
and cola.

Kragen

-- 
<kragen@pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
The sages do not believe that making no mistakes is a blessing. They believe, 
rather, that the great virtue of man lies in his ability to correct his 
mistakes and continually make a new man of himself.  -- Wang Yang-Ming