[Beowulf] MPI2007 out - strange pop2 results?
kevin.ball at qlogic.com
Fri Jul 20 11:27:23 PDT 2007
Thank you for the personal attack that came, apparently without even
reading the email I sent. Brian asked about why the publicly available,
independently run MPI2007 results from HP were worse on a particular
than the Cambridge cluster MPI2007 results. I talked about three
contributing factors to that. If you have other reasons you want to put
forward, please do so based on data, rather than engaging in a blatant
ad hominem attack.
If you want to engage in a marketing war, there are venues with which
to do it, but I think on the Beowulf mailing list data and coherent
thought are probably more appropriate.
On Fri, 2007-07-20 at 10:43, Gilad Shainer wrote:
> Dear Kevin,
> You continue to set world records in providing misleading information.
> You had previously compared Mellanox based products on dual single-core
> machines to the "InfiniPath" adapter on dual dual-core machines and
> claim that with InfiniPath there are more Gflops.... This latest release
> follow the same lines...
> Unlike QLogic InfiniPath adapters, Mellanox provide different InfiniBand
> HCA silicon and adapters. There are 4 different silicon chips, each with
> different size, different power, different price and different
> performance. There is the PCI-X device (InfiniHost), the single-port
> device that was deigned for best price/performance (InfiniHost III Lx),
> the dual-port device that was designed for best performance (InfiniHost
> III Ex) and the new ConnectX device that was designed to extend the
> performance capabilities of the dual port device. Each device provide
> different price and performance points (did I said different?).
> The SPEC results that you are using for Mellanox, are of the single port
> device. And even that device (that its list price is probably half of
> your InfiniPath) had better results with 8 server nodes than yours....
> Your comparison of InfiniPath to the Mellanox single-port device should
> have been on price/performance and not on performance. Now, if you want
> to really compare performance to performance, why don't you use the dual
> port device, or even better, ConnectX? Well... I will do it for you.
> Every time I had compared my performance adapters to yours, your
> adapters did not even come close...
> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]
> On Behalf Of Kevin Ball
> Sent: Thursday, July 19, 2007 11:52 AM
> To: Brian Dobbins
> Cc: beowulf at beowulf.org
> Subject: Re: [Beowulf] MPI2007 out - strange pop2 results?
> Hi Brian,
> The benchmark 121.pop2 is based on a code that was already important
> to QLogic customers before the SPEC MPI2007 suite was released (POP,
> Parallel Ocean Program), and we have done a fair amount of analysis
> trying to understand its performance characteristics. There are three
> things that stand out in performance analysis on pop2.
> The first point is that it is a very demanding code on the compiler.
> There has been a fair amount of work on pop2 by the PathScale compiler
> team, and the fact that the Cambridge submission used the PathScale
> compiler while the HP submission used the Intel compiler accounts for
> some (the serial portion) of the advantage at small core counts, though
> scalability should not be affected by this.
> The second point is that pop2 is fairly demanding of IO. Another
> example to look at for this is in comparing the AMD Emerald Cluster
> results to the Cambridge results; the Emerald cluster is using NFS over
> GigE from a single server/disk, while Cambridge has a much more
> optimized IO subsystem. While on some results Emerald scales better,
> for pop2 it scales only from 3.71 to 15.0 (4.04X) while Cambridge scales
> from 4.29 to 21.0 (4.90X). The HP system appears to be using NFS over
> DDR IB from a single server with a RAID; thus it should fall somewhere
> between Emerald and Cambridge in this regard.
> The first two points account for some of the difference, but by no
> means all. The final one is probably the most crucial. The code pop2
> uses a communication pattern consisting of many small/medium sized
> (between 512 bytes and 4k) point to point messages punctuated by
> periodic tiny (8b) allreduces. The QLogic InfiniPath architecture
> performs far better in this regime than the Mellanox InfiniHost
> This is consistent with what we have seen in other application
> benchmarking; even SDR Infiniband based off of the QLogic InfiniPath
> architecture performs in general as well as DDR Infiniband based on the
> Mellanox InfiniHost architecture, and in some cases better.
> Full disclosure: I work for QLogic on the InfiniPath product line.
> On Wed, 2007-07-18 at 18:50, Brian Dobbins wrote:
> > Hi guys,
> > Greg, thanks for the link! It will no doubt take me a little while
> > to parse all the MPI2007 info (even though there are only a few
> > submitted results at the moment!), but one of the first things I
> > noticed was that performance of pop2 on the HP blade system was beyond
> > atrocious... any thoughts on why this is the case? I can't see any
> > logical reason for the scaling they have, which (being the first thing
> > I noticed) makes me somewhat hesitant to put much stock into the
> > results at the moment. Perhaps this system is just a statistical blip
> > on the radar which will fade into noise when additional results are
> > posted, but until that time, it'd be nice to know why the results are
> > the way they are.
> > To spell it out a bit, the reference platform is at 1 (ok, 0.994) on
> > 16 cores, but then the HP blade system at 16 cores is at 1.94. Not
> > bad there. However, moving up we have:
> > 32 cores - 2.36
> > 64 cores - 2.02
> > 128 cores - 2.14
> > 256 cores - 3.62
> > So not only does it hover at 2.x for a while, but then going from
> > 128 -> 256 it gets a decent relative improvement. Weird.
> > On the other hand, the Cambridge system (with the same processors
> > and a roughly similar interconnect, it seems) has the follow scaling
> > from 32->256 cores:
> > 32 cores - 4.29
> > 64 cores - 7.37
> > 128 cores - 11.5
> > 256 cores - 15.4
> > ... So, I'm mildly confused as to the first results. Granted,
> > different compilers are being used, and presumably there are other
> > differences, too, but I can't see how -any- of them could result in
> > the scores the HP system got. Any thoughts? Anyone from HP (or
> > QLogic) care to comment? I'm not terribly knowledgeable about the MPI
> > 2007 suite yet, unfortunately, so maybe I'm just overlooking
> > something.
> > Cheers,
> > - Brian
> > ______________________________________________________________________
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org To change your subscription
> > (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
> Beowulf mailing list, Beowulf at beowulf.org To change your subscription
> (digest mode or unsubscribe) visit
More information about the Beowulf