[Beowulf] Re: Beowulf Digest, Vol 37, Issue 58
Hakon.Bugge at scali.com
Mon Mar 26 14:34:09 PDT 2007
Hi again Christian,
At 16:59 26.03.2007, Christian Bell wrote:
>I'm unsure if i would call significant a
>submission comparing results between
>configurations not compared at scale (in
>appearance large versus small switch, much
>heavier shared-memory component at small process
>counts). For example, in your submitted
>configurations, the interconnect communication
>(inter-node) is never involved more than shared
>memory (intra-node) and when the interconnect
>does become dominant at 32 procs, that's when InfiniPath is faster.
Not sure how you count this. In my "world", all
processes communicates with more remote processes
that local ones in all cases except for the
single node runs. I.e., in a two node case with 2
or 4 processes per node, a process has 1 or 3
other local processes and 2 or 4 other remote
processes. Excluding the single node cases, we
have six runs (2x2, 4x2, 8x2, 2x4, 4x4, 8x4)
where RDMA is faster than message passing in 5 of the cases.
As to the 32 core case, I am running equal fast
as Infinipath on this one, but this is not a
released product (yet). Hence I haven't published it.
And based on this I did not call it significant
findings, but merely an indication of RDMA being
faster (upto 16 cores) or equal fast as message
passing for _this_ application and dataset.
>On the flip side, you're right that these
>results show the importance of an MPI
>implementation (at least for shared memory),
>which also means your product is well positioned
>for the next generation of node configurations
>in this regard. However, because of the node
>configurations and because this is really one
>benchmark, I can't take these results as
>indicative of general interconnect
>performance. Oh, and because you're forcing me
>to compare results on this table, I now see what
>Patrick at Myricom was saying -- the largest
>config you show that stresses the interconnect
>(8x2x2) takes 596s walltime on a similar
>Mellanox DDR and 452s walltime on InfiniPath SDR
>(yes, the pipe is "100%" smaller but the performance is 25% better).
Just to avoid any confusion, the 596s number is
_not_ with Scali MPI Connect (SMC), but a
competing MPI implementation. SMC achieves 551s
using SDR. I must admit your Infinipath number is
new to me, as topcrunch reports 482s for this configuration with Infinipath.
>We have performance engineers who gather this
>type of data and who've seen these trends on
>other benchmarks, and they'll be happy to right
>any wrong misconceptions, I'm certain.
>Now I feel like I'm sticking my tongue out like
>a shameless vendor and yet my original
>discussion is not really about beating the
>InfiniPath drum, which your reply insinuates.
Well, my intent was to draw the wulfers attention
to some published facts containing
apples-to-apples comparisons, in an interesting
discussion of RDMA vs. message passing. Given the
significant (yes, I mean it) difference in
latency and message rates, I was indeed
surprised. My question still is; if there existed
an RDMA API with similar characteristics as the
best message passing APIs, how would a good MPI implementation perform?
More information about the Beowulf