[Beowulf] Re: Beowulf Digest, Vol 37, Issue 58

Thu Apr 12 02:37:26 PDT 2007

Hi Christian,

Sorry for this very delayed answer.

At 03:16 27.03.2007, Christian Bell wrote:
>I can't type, 482 was indeed a typo.  But still, I wouldn't look at
>the absolute numbers "as is" since the single-node base case has
>different performance.  Since 1x2x1 is our only common base case and
>since Scali is faster at 4212 versus 4863, the IB interconect you're
>testing should be achieving 416s instead of 550s to produce strong
>scaling similar in line with the 8x2x2 InfiniPath time to solution
>(at 482s).

Well, you do know Amdahl vs. Gustavson, right? 
The dataset is fixed, elapsed time includes 
initialization, write of animation files and 
more. Hence, slower per node performance would
_scale_ better.

For this application field, crash worthiness 
testing, most users keep the number of cores 
constant throughout the duration of a project (12 
- 18 mnths). This due to numerical stability and 
verification thereof. Hence, the interesting 
point is not how far and fast you could run, but 
the cost of the system capable of running the 
application instances at 60-80% parallel efficiency.

As to the RMDA vs. MP based interconnect 
semantics, the problem I am phasing is that the 
RDMA interconnect I am using is more or less 
collapsing using 32 cores. Using alltoall with 1k 
packet size, it actually perform worse than Gbe. 
Sigh! (And please, do not turn this into a vendor 
harassment, as I am pretty sure this has to do 
with implementation and not architecture). So, 
what I have shown is that an RDMA interconnect 
performs faster than a message passing 
interconnect which has roughly  3x lower latency 
and 20x (?) higher message rate upto a scaling 
point where the RDMA _implementation_ collapses. 
And this _despite_ the fact the RDMA based MPI 
has to perform the MPI message matching.

>With equal metrics/performance and phrased in this manner, it seems
>that RDMA still has to implement the semantics that message-passing
>already provides, which suggests in this case that the RDMA interface
>is at a loss.  Maybe I'm missing something to your question...

I doubt you're missing anything;-) But let me 
stress that as the number of cores per node 
scale, a message passing semantics HCA with 
message matching in the HCA will have a constant 
message matching rate. An RDMA based MPI which 
uses the cores for message matching, the message 
matching rate would be almost proportional to the number of cores...

Håkon

--
Håkon Bugge
CTO
dir. +47 22 62 89 72
mob. +47 92 48 45 14
fax. +47 22 62 89 51
Hakon.Bugge at scali.com
Skype: hakon_bugge

Scali - http://www.scali.com
Scaling the Linux Datacenter