Scyld: bad scaling

Thu Sep 27 13:49:46 PDT 2001

On Wed, 26 Sep 2001, Ivan Rossi wrote:

> recently i rebuilt our tiny 10 CPUs cluster using Scyld. Before i have been
> using RedHat 6.2 + LAM MPI. And i like it, it is easier to mantain.
> Unfortunately, after the rebuild, I found a marked performance degradation
> with respect to the former installation.  In particular i found a
> disappointingly bad scaling for the application we use most, the MD program
> Gromacs 2.0.
>
> Now scaling goes almost exactly as the square root of the number of nodes,
> that is it takes 4 CPUs to double performance and nine CPUs to triple them.
> 
> Since no hardware has been changed, in my opinion it must be either the
> pre-compiled Scyld kernel, bpsh or Scyld MPICH. So i hope that some fine
> tuning of them should solve the problem.

There isn't an inherent problem with Scyld and scaling.
  (Obviously we wouldn't have released a product with a specific problem.)

Some things you should initially check
   Verify that you are not seeing network errors
     check /proc/net/dev for non-zero error counts
   Verify that you are using the SMP kernel
     CPU1 should show some activity with beostat.
   Verify that jobs are being places on all nodes
     beostat again.

For reference, the Scyld releases up through "-8" use MPICH as the
base.  We modified the process initiation code to work with the Scyld system
(it's now much faster to start jobs), but not the code of the run-time
e.g. send/receive calls.

It's very easy to use LAM on Scyld, however that's beyond the limit of
our commercial support.

Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993