[Warewulf] Re: [Beowulf] hpl size problems

Mark Hahn hahn at physics.mcmaster.ca
Mon Sep 26 12:52:31 PDT 2005


> Check me please if this is correct, as I am not familiar with HPL: The
> HPL benchmark depends on all the nodes progressing in lockstep, and if
> any one node takes longer, than all the others must wait until the
> slow node catches up, right?  (That's called a barrier.)  And those
> barriers occur frequently, at relatively short time intervals, right?

I'm speaking through my hat, but believe that HPL's synchronization
is fairly coarse.  I'd love to see an MPI profile of it though!

> causes disastrous performance.  On the 8192 processor ASCI Q, they saw
> a FACTOR OF TWO performance loss due to those effects...

I seem to recall they mentioned (perhaps at HP-CAST) an even higher lossage
on a nuke code from the black side of the lab.  I'm skeptical whether this 
would explain the current example - wasn't the cluster relatively small
(128?).  if I remember the paper, the slowdown is exponential in the 
the number of nodes but linear in the probability of jitter...




More information about the Beowulf mailing list