very high bandwidth, low latency manner?
Hakon.Bugge at scali.com
Tue Apr 16 03:24:37 PDT 2002
I am sorry to hear that you was unable to achieve expected performance on
the mentioned SCI based systems. You raise a couple of issues, which I
would like to address:
Performance transparency is always goal. Nevertheless, sometimes an
implementation will have a performance bug. The two organizations owning
the mentioned systems, have both support agreements with Scali. I have
checked the support requests, but cannot find any request where your
incidents were reported. We find this fact strange if you truly were aiming
at achieving good performance. We are happy to look into your application
and report findings back to this news group.
2) Startup time.
You contribute the bad scalability to high startup time and mapping of
memory. This is an interesting hypothesis; and can easily be verified by
using a switch when you start the program, and measure the difference
between the elapsed time of the application and the time it uses after
MPI_Init() has been called. However, the startup time measured on 64-nodes,
two processors per node, where all processes have set up mapping to all
other processes, is nn second. If this contributes to bad scalability, your
application has a very short runtime.
3) SCI ring structure
You state that on a multi user, multi-process environment, it is hard to
get deterministic performance numbers. Indeed, that is true. True sharing
of resources implies that. Whether the resource is a file-server, a memory
controller, or a network component, you will probably always be subject to
performance differences. Also, lack of page coloring will contribute to
different execution times, even for a sequential program. You further
indicate that performance numbers reported f. ex. by Pallas PMB benchmark
only can be used for applying for more VC. I disagree for two reasons;
first, you imply that venture capitalists are naive (and to some extent
stupid). That is not my impression, merely the opposite. Secondly, such
numbers are a good example to verify/deny your hypothesis that the SCI ring
structure is volatile to traffic generated by other applications. PMB's
*multi* option is architected to investigate exactly the problem you
mention; Run f. ex. MPI_Alltoall() on N/2 of the machine. Then measure how
performance is affected when the other N/2 of the machine is also running
Alltoall(). This is the reason we are interested in comparative performance
numbers to SCI based systems. It is to me strange, that no Pallas PMB
benchmark results ever has been published for a reasonable sized system
based on alternative interconnect technologies. To quote Lord Kelvin: "If
you haven't measured it, you don't know what you're talking about".
As a bottom line, I would appreciate that initiatives to compare cluster
interconnect performance should be appreciated, rather than be scrutinized
and be phrased as "only usable to apply for more VC".
At 11:40 AM 4/15/02 +0200, Markus Fischer wrote:
>Steffen Persvold wrote:
> > Now we have price comparisons for the interconnects (SCI,Myrinet and
> > Quadrics). What about performance ? Does anyone have NAS/PMB numbers for
> > ~144 node Myrinet/Quadrics clusters (I can provide some numbers from a 132
> > node Athlon 760MP based SCI cluster, and I guess also a 81 node PIII
> > HE-SL based cluster).
>I would like to get/see some numbers.
>I have run tests with SCI for a non linear diffusion algorithm on a 96 node
>cluster with 32/33 interface. I thought that the poor
>scalability was due to the older interface, so I switched to
>a SCI system with 32 nodes and 64/66 interface.
>Still, the speedup values were behaving like a dog with more than 8 nodes.
>Especially, the startup time will reach minutes which is probably due to
>the exporting and mapping of memory.
>Yes, the MPI library used was Scampi. Thus, I think the
>(marketing) numbers you provide
>below are not relevant except for applying for more VC.
>Even worse, we noticed, that the SCI ring structure has an impact on the
>communication pattern/performance of other applications.
>This means we only got the same execution time if other nodes were
>I idle or did not have communication intensive applications.
>How will you determine the performance of the algorithm you just invented
>in such a case ?
>We then used a 512 node cluster with Myrinet2000. The algorithm scaled
>very fine up to 512 nodes.
> > Regards,
> > --
> > Steffen Persvold | Scalable Linux Systems | Try out the world's best
> > mailto:sp at scali.com | http://www.scali.com | performing MPI
> > Tel: (+47) 2262 8950 | Olaf Helsets vei 6 | - ScaMPI 1.13.8 -
> > Fax: (+47) 2262 8951 | N0621 Oslo, NORWAY | >320MBytes/s and <4uS
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
Håkon Bugge; VP Product Development; Scali AS;
mailto:hob at scali.no; http://www.scali.com; fax: +47 22 62 89 51;
Voice: +47 22 62 89 50; Cellular (Europe+US): +47 924 84 514;
Visiting Addr: Olaf Helsets vei 6, Bogerud, N-0621 Oslo, Norway;
Mail Addr: Scali AS, Postboks 150, Oppsal, N-0619 Oslo, Norway;
More information about the Beowulf