[Warewulf] Re: [Beowulf] hpl size problems

Ashley Pittman ashley at quadrics.com
Tue Sep 27 03:58:40 PDT 2005


On Mon, 2005-09-26 at 15:43 -0400, Andrew Piskorski wrote:

> It's amusing that Mark Hahn is already participating in this thread,
> because his post to the Beowulf list gave a link explaining a detailed
> real-world example of that effect very nicely:
> 
>   http://www.beowulf.org/archive/2005-July/013215.html
>   http://www.sc-conference.org/sc2003/paperpdfs/pap301.pdf
> 
> Basically, daemons cause interrupts which are not synchronized across
> nodes, which causes lots of variation in barrier latency across the
> nodes - AKA, jitter.  And with barrier-heavy code, lots of jitter
> causes disastrous performance.  On the 8192 processor ASCI Q, they saw
> a FACTOR OF TWO performance loss due to those effects...

Well remembered, that is indeed a very good paper and one that everybody
should read.

If I remember though the effect didn't really kick in until ~700 nodes
and would only be reproducible if the barrier time is significantly
lower than the tick rate of the scheduler.  I doubt it's relevant in
this case.

There is a wonderful tool written by LANL specifically for measuring
this kind of background "jitter" on nodes, it's called 'whatelse' and is
a perl script that samples node state before and after <something> and
reports on the different.  <something> can either be an application or a
sample time.  It allows you to see precisely how many CPU cycles are
free for the application to use.

Running in on one (of the not particularly tuned) systems here I see
99.983% IDLE CPU time over a minute with two processes using JIFFIES and
four page faults.  My desktop did worse with 70% idle whilst writing
this mail.

Ashley,



More information about the Beowulf mailing list