about fast interconnects and SCI in particular

Florent Calvayrac fcalvay@aviion.univ-lemans.fr
Mon, 14 Jun 1999 11:48:52 -0400


I have  been involved since October 1998 in the definition, fund raising
and purchase  of a cluster for computational physics purposes,
and  we are about to take a final decision on the nature of the cluster
(processors and communications hardware).

I had already asked the following question on comp.parallel last year,
and got various and interesting answers, but am still in trouble :

-----------------------------------------
Considering a given total budget (around $100,000) is it better to spend nearly
all  of it into  ultrafast communications hardware (say Myrinet or  SCI)
and then to buy 16 CPUs, or to only buy a Fast Ethernet switch and 32 or 64
(with SMP) faster processors ?

 Since several users will be using the system, the needs for communications
 can not be estimated accurately.

------------------------------------------
I include a summary of the most informative answers at the end of this
posting.


Pondering a lot about the question, I now wonder if Myrinet, and Dolphin/Scali,
are not in the opposite direction of the Beowulf/Extreme Linux approach. For
Scali in particular :

-Linux is only supported thanks to the Paderborn PhD students, and
most of the source code is subject to nondisclosure, and definitely not GPL

-premium prices are charged for a technology which is admittedly a standard
but only supported by one hardware maker ; I do not like the idea
of being dependent of only one society which can then charge arbitrary prices
for upgrades, or go down as so many in this business.

-buying such hardware goes against the argumentation I used to get funding
for our cluster, this is, the use of consumer electronics hardware
selling for a low price thanks to the economies of scale made possible by mass
production (I even dream of the legendary cluster of Nintendos), and Linux
to save the purchase of a system such as Solaris.

-hardware (cache coherence which is the main argument for SCI
is not implemented) and software are not mature yet : my recent posting
confirms that the first versions of ScaMPI could not compete against
TCP/IP mpich over Fast Ethernet, at least for my codes.

-a correct choice of the mainboards is crucial, because of
incompatibilities and errors in the PCI implementation,
which is not the case for normal Fast Ethernet cards


So since we have access anyways to the T3E, SP2 and Origin 2000s
of the national and regional supercomputing center, our cluster
will mostly be used to develop and test MPI codes, and for production
runs for jobs requiring a long time to complete but with reasonable
requirements in terms of memory, communications and number
of processors, the most demanding tasks being kept on the "true"
supercomputers. We are about to make a decision for a 64 Pentium III
cluster  with Fast Ethernet. We still hesitate about dual CPU boards,
and about newest Alpha processors : is the gain in FP worth the
extra money ?

Any opinion/comments ? (no harm intended).  What is the situation
for Myrinet or Gigabit Ethernet, which appear to be similarly priced ?


Thanks in advance


************************************************
I received the following very informative answers from some expert
readers of comp.parallel :

************************************************
Justin Cormack wrote:

>
>
> Don't buy the slower processors. Myrinet is unlikely to make a
> proportionate
> performance improvement in line with costs. A lot of problems are
> dominated
> by FP performance. Switched fast ethernet gives good performance, as at
> this
> sort of scale you can get fully non-blocking switches with only one hop.
>
> Of course it depends on your codes, but without a proven case for cost

> effectiveness I would always go for fast ethernet.
>
> justin

DMP wrote:

> The question you ask is a complicated question, and one for which
there
> may
> not be a really good answer.  Be that is it may, here are some views
on
> the
> subject that I have and/or have heard from others.
>
> 1)  Many approaches to parallelizing programs can only be used (or at
> least
>     work best by an order of magnitude or more) on cache coherent
shared
>     memory SMP's.  Examples of this are Loop Level Parallelism (e.g.
> OPENMP)
>     based on compiler directives and HPF (yes I know it is supposed to

be
>
>     portable to distributed memory environments, but there is the
> question
>     of performance).
>
> 2)  It is easier to tune code on an SMP, since one can seperate out
the
> serial
>     optimizations from the parallel optimizations.
>
> 3)  Some codes do not make efficient use of large numbers of
processors.
>     Therefore, they can only be efficiently run on distributed memory
>     systems if the nodes have a lot of memory per processor.  This
makes
> the
>     nodes expensive to buy, unless all of the users have the same
> requirement.
>     On an SMP, each job will get the resources it needs (so long as
you
> don't
>     oversubscribe the system).  This can make it more cost effective
when
> one
>     has a variety of users to support (by the way, our current
favorite
>     SMP systems are from SGI and SUN, avoid HP).
>
> 4)  Some codes (e.g. chemistry) require all of the data to be
replicated
>     on each node when running on a distributed memory system.  When
> running
>     these codes on an SMP, one can have the same requirement, however
> when
>     using inherently Shared Memory codes such as Gaussian, this
> requirement
>     disappears.  Therefore, the proper combination of an SMP and the
> choice
>     of a chemistry code can substantially reduce one's hardware
> investment.
>     Note however, that many of the chemistry codes don't scale well
past
>     something like 8-16 processors.
>
> 5)  Shared memory systems should in general also provide excellent
> support
>     for message passing jobs that use large numbers of small messages
and
>
>     are therefore highly sensitive to the latency.
>
> 6)  Unfortunately, most SMP's have a higher memory latency and a lower

> per
>     processor memory bandwidth than do distributed memory systems.  As
a
>     result, trying to run canned routines that were optimized for
> distributed
>     memory systems that had a faster memory system, but lacked a large

> cache,
>     is frequently a recipe for poor performance!  However, if you
don't
> mind
>     tuning the code, this is an obstacle that can in general be
overcome.
>
> 7)  Trying to cluster together SMP's doesn't in general work very
well.
> Most
>     programs don't know that they are running on a cluster of SMP's.
As
> a
>     result, they will in general be doing far too much communication
> between
>     the SMP's and run out of communication bandwidth.  I have seen non

>     disclosure documents from at least two vendors which discussed
this
>     problem.  I have also seen one shop that claims to be able to live

> with
>     this problem by using a combination of message passing between
boxes
> and
>     loop level parallelism within the boxes (this can substantially
> reduce
>     the bandwidth requirements between boxes, but requires significant

>     code modifications).
>
> 8)  In terms of raw hardware, clusters of PC's (or possibly low end
>     workstations) with limited networking will be the cheapest.
However,
>
>     in general it will also give very poor performance.  Therefore,
one
>     needs to know if your goal is purely educational (in which case
>     performance may be less of an issue) or is to do real research.
In
> the
>     later case, you need a really well balanced system.  I strongly
> prefer
>     a well designed SMP.  However, if you want to go the route of a
>     distributed memory system, then clustering may be acceptable (it
> works
>     pretty well for the IBM SP).  I know of one person who finds that
>     Fast Ethernet works well for 4 PC's running Linux.  I strongly
doubt
> that
>     it will still be working well for most problems by the time you
get
> to
>     16 PC's.  If your total budget is small, you might start out
buying
> some
>     high end PC's and use Fast Ethernet as a low cost throw away
network.
>
>     As you grow to a larger sized system, you can then go back and
> upgrade
>     the network using Myrinet, SCALI, or something like that.
However,
I
>
>     strongly suggest that you look at a 4 processor Origin 200 and see

> how
>     much performance you can get from it.  Our experience is that by
> buying
>     third party memory boards (Kingston I believe), we could get an
> extremely
>     well configured system for under $100,000.  With educational
> discounts
>     and dropping prices, you might be able to do even better.  In many

> cases,
>     these systems will out perform larger networks of PC's.
>
>                           Hope this helps, bon chance,
>
>                           Daniel M. Pressel
>
>                           Computer Scientist
>
>                           U.S. Army Research Laboratory

Dale Talcott wrote:

>
> I would go for faster processors and more memory.  First, because our
> experience with an Intel Paragon and an IBM SP2 suggests that network
> speed is not that important: programs that do a lot of communication
> will be slow, no matter how fast your network.  On the other hand,
slow
> processors slow everything down, even well-parallelized programs.
>
> Second, it is very difficult to take advantage of a parallel system.
> So, while your users work on partitioning their programs better, they
> can still use the processors for productive, serial work.
>
> If your experience is similar to ours, you will find that most users
are
> unable to parallelize their programs well.  Those users will benefit
from
>
> the fastest serial system they can find, with the largest amount of
> memory.
>
> Some programs partition more easily.  Those programs will make up the
> bulk of your parallel work.  But those same programs are the ones that

> don't put a huge demand on your network.
>
> --
> Dale Talcott, Purdue University Computing Center,







"Michael A. K. Gross" wrote:

>
> As a user of a relatively large Linux cluster (64 dual Pentiums) in
> your price range, I have some not-very-good news for you.
>
> Without some idea of what applications are going to run on the system,

> you can't make the decision reasonably.
>
> If your applications have only nearest-neighbor communications and a
> lot of computation per message, they can be made quite efficient with
> a commodity network like Fast Ethernet.  Our system (theHive, at
> Goddard Space Flight Center) has a highly efficient implementation of
> the piecewise parabolic mesh method for compressible gas dynamics,
> which falls in this category, running on it.
>
> On the other hand, all-to-all communications can be a serious problem,

> and I've implemented a particle-mesh N-body code, which requires
> long-range force calculations crunched using a fast Fourier transform.

> The Fourier transform is a serious bottleneck because it requires
> all-to-all communications. That isn't to say it can't be done---a
> multigrid solver would very likely have less serious problems, but
> that's a major change to the code.
>
> By all means be VERY careful to link all the CPUs through a SINGLE
> switch. We recently upgraded our machine from four 16-port switches
> (with a fifth linking them together) to one 64-port switch, and the
> improvement for the FFT has been tremendous.  It's still far below
> supercomputer performance, but it is now feasible to use all 128 CPUs
> on a 3D FFT, whereas it wasn't before.
>
> To CYA, I'd poll the users, after telling them all this.  There isn't
> a solution that will satisfy everyone.
>
> Mike Gross
>
> --
>
************************************************************

--
Florent Calvayrac                          | Tel : 02 43 83 32 72
Laboratoire de Physique de l'Etat Condense | Fax : 02 43 83 35 18
UPRESA-CNRS 6087                           |
Universite du Maine-Faculte des Sciences   |
72085 Le Mans Cedex 9