[Beowulf] 1.2 us IB latency?

Wed Apr 25 08:52:57 PDT 2007

On Wed, 25 Apr 2007, Ashley Pittman wrote:

> I'm not sure I follow, surely using PIO over DMA is a lose-lose
> scenario?  As you say conventional wisdom is offload should win in this
> situation...
> 
> Mind you we do something completely different for larger messages which
> rules out the use of PIO entirely.

Sorry -- I mean that in line with the goal of spending the least
amount of time in MPI, there's no obvious answer for protocol breaks
and send mechanisms.  

> I'm sure I've seen a benchmark like this before, something that measured
> the latency of messages and then sees how much "work" can be done before
> latency increases, in effect measuring the CPU overhead of a send.
> Quadrics tends to look good when these figures are presented as absolute
> numbers and bad when presented as % of latency by virtue of having lower
> latency to start with.  I was recently asked to improve the percentage
> figure and the best I could come up with was to put a sleep(1) on the
> critical path.  I'm not sure if it is or not but if it is the GASNet
> benchmark I'm thinking of could you change the way it reports results
> please?

Funny that you mention percentage of time because the same argument
applies with using PIO sends for a "largish" message.  In relative
terms of cpu availability it doesn't look that good but in terms of
absolute time spent in the send it's not all that bad.  

GASNet should compile out-of-the-box -- in fact, I think Dan has it
to point where you can just do './configure && gmake run-tests' and
it will compile, link and submit tests with prun (and run testqueue).
None of the GASNet tests have relative figures in them, you must be
thinking of another test or suite of tests.  

If absolute numbers in microbenchmarks require scrutiny as to
methodology used, i think it's even more so with relative numbers.
If relative metrics are shown, absolute metrics *must* be shown since
relative numbers can be used to effectively remove elapsed time and
the true cost of an operation.  I.e. This optimization caused a 30%
speedup on NAS CG/FT/EP.. on 256 processors (and the graphs/results
fail to mention that the smallest problem size class A is being
used).  You'd think this is obvious but you can find these types of
omissions in published material.

    . . christian

-- 
christian.bell at qlogic.com
(QLogic SIG, formerly Pathscale)