beowulf performance with MPI

Sat Jun 24 14:21:48 PDT 2000

Tony Skjellum writes:
>You may find our free MPI - MPI/Pro for TCP+SMP for Linux - interesting.
>
>Anthony Skjellum, PhD, President (tony at mpi-softtech.com) 
>MPI Software Technology, Inc., Ste. 33, 101 S. Lafayette, Starkville, MS 39759
>+1-(662)320-4300 x15; FAX: +1-(662)320-4301; http://www.mpi-softtech.com

I should mention that I downloaded this software and I found that it
worked great.  I was getting crappy scaling with my software of interest
(AMBER6, see http://www.amber.ucsf.edu).  My cluster is 6 dual P-III 600MHz
w/ 256MB RAM, one of which is the master and 5 of which are compute servers.
Interconnect is simply 100BT (eepro, 2.2.16) with a 100BT 8-port switch
connecting them.  The switch was only $150, nothing impressive. 
AMBER6 is compiled using either LAM or MPICH, the latest respective versions.

AMBER was only going 4 times faster than 1 CPU using all 10 CPUs of
the system.  I was pretty much all but convinced that I needed to scale up
the interconnect to giga-net or myrinet, at very high relative cost.
However, I downloaded the MPI/Pro for TCP+SMP for Linux, and gave it a try
with AMBER.  The scaling is remarkably better!  In particular, here are the
numbers for the performance:

SIMULATION SYSTEM: DHFR in water, 23558 atoms
SIMULATION PARAMETERS: PME, 62.2x62.2x62.2 box, 1000 steps
COMPUTER SYSTEM: 6 dual Pentium-III 600MHz (100MHz bus) running Red Hat
6.2 connected by 100BT switch.  Total cost $15,000 at time of purchase,
early 2000.  Each machine approx $2460 + one 27GB hard drive ($250) + one
100BT switch ($149)

AMBER COMPILATION: g77 (egcs-1.1.2), flags:   -O3  -m486 -malign-double
-ffast-math -fno-strength-reduce

CPUs    Time (sec)      Speedup over 1 g77 CPU
1       5539		1.00

8       1429            3.79            (mpich)
10      1358            3.99            (mpich)

8        794            6.97            (mpipro)
10       692            8.00    (mpipro)

"Time" is wallclock time spent actually calculating the simulation,
not any setup or I/O time. 

I compared the profiling of the two simulations and it appears that
much of the time savings came from a significantly faster MPI_ALLGATHERV,
which AMBER uses to distribute out the new particle positions and
velocities at the end of each timestep.  The allgather occurs in a serialized
section of the code, and therefore scaling is highly dependent on the
performance of the implementation.

I have spoken with MPI/Pro to find out a little bit more.  Actually
there is no specific SMP optimization of yet, in fact, communication
will go through the localhost network code.  However, the design is multi-
threaded and doesn't poll the way MPICH does. I suspect also more effort
has gone into optimizing some of the MPI routines which are implemented
in MPICH with fairly naive code.

Overall the program was quite easy to work with. After downloading
the RPM and installing it on the master, I just created a file called
"/etc/machines" listing all the client nodes by their hostnames,
then recompiled my app with the MPI/Pro provided "mpicc" and "mpif77"
scripts, and ran the app with the provided "mpirun" script.  The syntax
is very similar to MPICH, and it integrates straightforwardly with
our queueing system, PBS, through the use of the PBS_NODEFILE enviroment
variable.

Dave