[Beowulf] Parallel application performance tests

Wed Nov 29 14:14:46 PST 2006

Doug

I think the difference is mainly in the collective communications, which are
improving and are about at LAM level in v1.2. Overall I don't see a big
difference between LAM and OpenMPI

Where can I find the driver settings for the Intel NICS to get the latency
down? I would like to try benchmarking the Intel NICS with reduced latency
TCP.

That being said, I think the very good scaling I see with GAMMA is due to
its simple and highly efficient flow control. You have tested GAMMA so you
will have observed how regular it is; this becomes important with large
numbers of nodes. I think this is the source of the good scaling.

These are scalable applications but not of the "Embarrassingly Parallel"
variety. Refering to my post of the DLPOLY benchmark: with 32 cpus and TCP
it takes 82 secs, and with 64 cpus it takes 84 secs. So NOT an EP
application. On the other hand MPI/GAMMA takes 56 and 34 secs respectively.
This is better scaling than our HPC center's IB cluster which was 50 and 42
secs.
There are a number of reasons why the IB cluster is not performing as well
as it should; but this is not so atypical of supercompter center
environments, which impose a number of constraints that can affect
performance. I discuss my interpretation of some of these issues on my
website.

I was skimming over your website today. I also found that HPL is not faster
with GAMMA than with TCP. An HPCC developer told me HPL overlaps calculation
and communication; GAMMA polls so it always utilizes 100% of the cpu and
cannot take advantage of this. I totally agree with your comment about
supercomputer utilization. A huge amount of money is being spent to make
these things scale to enormous numbers of processors; just for a couple of
HPL runs. Then it gets Balkanized down to groups of 16-64 cpus most of the
time. 

BTW we get 385 Gflops on HPL from a rack of 48 dual-core P4's: cost per rack
was $51K including switches and misc hardware (with good discounts from
vendors). So $132 per Gflop.

Tony

-----Original Message-----
From: Douglas Eadline [mailto:deadline at clustermonkey.net] 
Sent: Wednesday, November 29, 2006 9:30 AM
To: Tony Ladd
Cc: beowulf at beowulf.org
Subject: Re: [Beowulf] Parallel application performance tests

Tony,

Interesting work to say the least. A few comments. The TCP implementation of
OpenMPI is known to be sub-optimal (i.e. it can perform poorly in some
situations). Indeed, using LAM over TCP usually provides much better
numbers.

I have found that the single socket Pentium D
(now called Xeon 3000 series) provides great performance.
The big caches help quite a bit, plus it is a single
socket (more sockets means more memory contention).

That said, I believe for the right applications GigE
can be very cost effective. The TCP latency for
the Intel NICs is actually quite good (~28 us)
when the driver options are set properly and GAMMA
takes it to the next level.

I have not had time to read your report in it's entirety,
but I noticed your question about how GigE+GAMMA can do as
well as Infiniband. Well if the application does not
need the extra throughput then there will be no improvement. The same way
that the EP test in the NAS parallel suite is about the same for every
interconnect (EP stands for Embarrassing Parallel) IS (Integer Sort) on the
other hand is very sensitive to latency.

Now, with multi-socket/multi-core becoming the norm,
better throughput will become more important. I'll have
some tests posted before to long to show the difference
on dual-socket quad-core systems.

Finally, OpenMPI+GAMMA would be really nice. The good news
is OpenMPI is very modular.

Keep up the good work.

  --
  Doug

> I have recently completed a number of performance tests on a Beowulf 
> cluster, using up to 48 dual-core P4D nodes, connected by an Extreme 
> Networks Gigabit edge switch. The tests consist of single and 
> multi-node application benchmarks, including DLPOLY, GROMACS, and 
> VASP, as well as specific tests of network cards and switches. I used 
> TCP sockets with OpenMPI v1.2 and MPI/GAMMA over Gigabit ethernet. 
> MPI/GAMMA leads to significantly better scaling than OpenMPI/TCP in 
> both network tests and in application benchmarks. The overall 
> performance of the MPI/GAMMA cluster on a per cpu basis was found to 
> be comparable to a dual-core Opteron cluster with an Infiniband 
> interconnect. The DLPoly benchmark showed similar scaling
> to those reported for an IBM p690. The performance using TCP was typically
> a
> factor of 2 less in these same tests. Here are a couple of examples from
> the
> DLPOLY benchmark 1 (27,000 NaCl ions)
>
> CPUS   OpenMPI/TCP (P4D)   MPI/GAMMA (P4D)  OpenMPI/Infiniband (Opteron
> 275)
>
>  1		1255			1276
> 1095
>  2		614			635
> 773
>  4		337			328
> 411
>  8		184			173
> 158
> 16		125			95
> 84
> 32		82			56
> 50
> 64		84			34
> 42
>
> A detailed write up can be found at: 
> http://ladd.che.ufl.edu/research/beoclus/beoclus.htm
>
>
>
> Tony Ladd
> Chemical Engineering
> University of Florida
>
> -------------------------------
> Tony Ladd
> Chemical Engineering
> University of Florida
> PO Box 116005
> Gainesville, FL 32611-6005
>
> Tel: 352-392-6509
> FAX: 352-392-9513
> Email: tladd at che.ufl.edu
> Web: http://ladd.che.ufl.edu
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>
> !DSPAM:456c9566180417110611695!
>

--
Doug