[Beowulf] RDMA NICs and future beowulfs

Tue Apr 26 03:31:13 PDT 2005

Obviously programming for a CELL is less difficult than programming for a
network. It requires just a few persons to make a few libraries that work
fast for CELL, the rest can simply use those libraries then to do their
matrix calculations. It will be dudes like me who are not having much
floating point software, who will be trying to get some branchy integer
software to work at it :)

I'm sure it's easy to find a few persons building matrix libraries for cell
processors.

I don't want to start big wars on whether or not the first released cheap
cell gets 256 gflop or more, or whether it gets less. Nor do i want to
start a war on whether IBM/Sony will be releasing the fastest cell processor.

My point is that if you jump from todays 4.8 gflop that an opteron
delivers, or 5.2 gflop that an itanium2 1.3Ghz delivers, suddenly to the
orders of 256 gflop a processor, then it's obvious that AMD and INTEL run
lightyears behind that in terms of floating point performance when using
classical processors and that they will have to build their own cell
architecture processors to keep up. 

Whether it's 350 gflop a processor, 500 gflop a processor or 200 gflop a
processor is not real relevant for the discussion here.

The interconnects used by cell are just as cheap as for dual opteron, the
memory used perhaps a tad expensive (RAMBUS XDR if i read correctly the
David Wang article). He discusses whether it can have much RAM. Well for
major matrix calculations you are basically busy streaming, so a lot of RAM
is helpful only if you calculate at a few processors.

The total number of itanium2 processor made previous year was like a 100000
processors. Perhaps they'll sell again 100000 this year, or even reach
their break even point of 200000 processors sold. 

You cannot build your own cheap cluster obviously from such processors.

Cell has a garantueed sales of roughly 70000000 processors, as a version of
it ends up in sony playstation 2.

Main point therefore is that for a reasonable price one can obtain a cell
processor machine. Be it dual or not.

Those cheapo machines can get easily clustered to a beowulf.

Majority of all system time of supercomputers goes to matrix calculations
in physical applications. In fact > 50% just goes to physics (statistics i
derived from report supercomputing europe).

As this is a valid number for europe, it probably is a world wide valid
number. Obviously, MORE than 50% of system time goes to floating point 
matrix calculations, as we didn't calculate in yet nuclear explosions,
which IMHO eat way more system time than most at this list might realize.
Everything that has to do with atoms, electrons; vaste majority of all
those applications are just busy with embarrassingly parallel floating
point calculations.

Those are dead easy to do at a cell processor type architecture. 

Additional the processor is cheap.

Obviously clever governments, who currently have giants of supercomputers
which costs several million, will conclude they can buy a few cheapo cell
processor machines which do more work than the entire system currently.

To give example the dutch government has a 2.2 tflop 416 processor itanium2
1.3Ghz. I do not argue either that it is 2.2 tflop, just like i do not
argue that a cell processor delivers > 0.25 gflop on its own.

Just 10 cell processors are faster than the theoretical peak a 416
processor machine delivers. This for far under 1/1000 of the price that
this machine has cost.

The only question left therefore is HOW to make a beowulf network for the
huge data that 1 cell node spits out?

At 08:26 PM 4/25/2005 -0400, Joe Landman wrote:
>
>
>Vincent Diepeveen wrote:
>> At 06:02 PM 4/25/2005 -0400, Mark Hahn wrote:
>> 
>>>>Would anyone on this list have pointers to
>>>>which network cards on market support 
>>>>RDMA (Remote Direct Memory Access)?
>>>
>>>ammasso seems to have real products.  afaikt, you link with their 
>>>RDMA-enabled MPI library and get O(15) microsecond latencies.
>>>to me, it's hard to see why this would be worth writing home...
>
>FWIW, we installed one of the first Ammasso based clusters.  Still 
>working on building some stuff for it.  Mostly for StarCD.
>
>I won't comment on positioning other than to say that they occupy an 
>interesting price performance niche.  Infiniband is dropping rapidly in 
>price, and is getting more attractive over time.  As recent as 6 months 
>ago, it added about $2k/node ($1kUS/HCA + $1kUS/port) to the cost of a 
>cluster.  More recently the cost per port appears to be quickly moving 
>towards 400 $US, and the HCA's are dropping so that you can add only 
>$1kUS to the price per node for your cluster.
>
>If you select the right switch, which you need anyway for your command 
>and control net, you can get good port-port latencies.
>
>I think the next reasonable question to answer is fundamentally what is 
>the cost benefit analysis?  If you are performance bound and have an 
>infinite budget, you need to look at the highest performance fabrics. 
>Currently the Ammasso is not that.  If you need to optimize performance 
>versus cost constraints, and your code gets some boost from the lower 
>latencies vs ethernet, the question is whether or not the value of that 
>performance is enough to justify the added cost of the cards.
>
>
>>>>Would anyone have hands on experience 
>>>>with performance, usability, and cost aspects
>>>>of this new RDMA technology?
>>>
>>>they work, 
>
>Agreed.  I would like to see a LAM implementation in addition to the 
>MPICH.  The installation is actually not that hard, and I have a simple 
>perl script to auto-generate an rnic_cfg file from your host IP 
>according to some simple rules.
>
>> but it's very unclear where their natural niche is.
>
>Not sure I agree with this.  If there were no value in TCP offload, then 
>why would Intel announce (recently) that they want to include this 
>technology in their future chipsets?  Basically the argument that I make 
>here is that I think there is a natural place for them, but it is on the 
>motherboards.  Much in the same way you have an Graphics Processing 
>Offload Engine in desktop systems, though in the case of motherboards, 
>they found value in supplying the high performance interface rather than 
>the offload engines.
>
>I personally think that the offload engine concept is a very good one. 
>I like this model.  With PCI-e (and possibly HTX), I think it has some 
>very interesting possibilities.
>
>>>if you want high bandwidth, you don't want gigabit.
>
>Agreed.  Out of sheer curiousity, what codes are more bandwidth bound 
>than latency bound over the high performance fabrics?  Most of the codes 
>we play with are latency bound.
>
>I wrote a message passing example for my class with a humorous name to 
>illustrate passing vectors (or matrices).  I could easily turn this into 
>a bandwidth test by using huge vectors.  But I am not sure that most 
>folks are doing that in their codes.
>
>>>if you want low latency, you don't want gigabit,
>
>agreed ...
>
>>> even RDMA-gigabit.
>
>I think a CBA is worth doing (and we may do this).  If it gives a 5% 
>boost for a 2% increase in cluster cost, is that worth it?  If it gives 
>a 30% boost for a 2% increase in cost, is that worth it?  What 
>fundamentally is the right cutoff (rhetorical question, cutoff varies 
>based upon needs, funds, application,...)
>
>[...]
>
>> In reality the bandwidth/latency hunger gets even bigger in future when the
>> cell type processors arrive. Correct me if i'm wrong, it really needs a
>> branch prediction table for my branch intensive integer code, but even then
>> such a processor is kicking butt. I mean 8 processing help units (SPE's) at
>> 1 cpu and a main power pc processor. 
>> 
>> For floating point that's like 250 Gflop or so practical to their avail.
>
>The cell will not automagically give you 250 Gflop.  It will not be easy 
>to program.
>
>> 
>> That *really* will make the networks the weakest chain.
>
>I haven't looked at the design in detail, but it looks like you are 
>going to need a multistage resource scheduler to handle streaming data 
>into the cell.  Think of it as a super-multi-core NPU that has a more 
>general instruction set.
>
>The Itanium has been out for a while and compilers for it are still 
>maturing.  VLIW^H^H^H^H^EPIC is hard.  Anyone remember Trace Multiflows? 
>  I would not expect to see a gcc for the cell (and now watch IBM make 
>me eat my words).  I would expect that programming it is going to be a 
>challenge.
>
>[...]
>
>> So obviously cell processor is kind of a step back for such software, but
>> even then we can see a single cell 4.0Ghz probably like a 8 processor
>> 2.8Ghz Xeon MP machine. 
>
>Again, this is going to be difficult to program for in all likelihood 
>(and if there are IBMers out there with this who know I am wrong, please 
>let me know, or even better, let me at it :) ).  Good compilers are 
>hard.  Very good compilers are rare.  Suboptimal compilers are the norm.
>
>
>-- 
>Joseph Landman, Ph.D
>Founder and CEO
>Scalable Informatics LLC,
>email: landman at scalableinformatics.com
>web  : http://www.scalableinformatics.com
>phone: +1 734 786 8423
>fax  : +1 734 786 8452
>cell : +1 734 612 4615
>
>
>