[Beowulf] building Infiniband 4x cluster questions
diep at xs4all.nl
Mon Nov 7 18:33:46 PST 2011
On Nov 8, 2011, at 2:46 AM, Gilad Shainer wrote:
>> I just test things and go for the fastest. But if we do theoretic
>> math, SHMEM
>> is difficult to beat of course.
>> Google for measurements with shmem, not many out there.
> SHMEM within the node or between nodes?
shmem is the programming library that cray had and that quadrics had.
so basically your program doesn't need silly message catching mpi
everywhere. You only define at program start whether an array is
by elan4 and which nodes it gets updated to etc.
So no need to check for MPI overfows for the complex code of
starting / stopping cpu's.
Can reuse code there easily to start remote nodes and cpu's.
So where the majority of the latency is needed for RDMA reads and/or
remote elan memory, the tough yet in overhead neglectible complicated
start/stop cpu's, is a bit easier to program with SHMEM library.
the caches on the quadrics cards have shmem so you don't access the
RAM at all,
it's already in the cards. didn't check whether those features got
added to mpi somehow.
so you just need to read the card - it's not gonna go through pci-x
at all at the remote node.
Yet of course all this is not so relevant to explain here - as
quadrics is long gone,
and i just search for a cheapo solution :)
So you lose only 2x the pci-x latency, versus 4x pci-e latency in
In case of a RDMA read i doubt latency of DDR infiniband is faster
that 0.7 you mentionned if it is microseconds sounds like a bit
for pci-x. From the 1.3 us that the MPI-one-way pingpong is at QM500,
if we multiply it by
2 it's 2.6 us. From that 2.6 us, according to your math it's already
2.8 us cost to pci-x,
then , which has a cost of 2x pci-x, receiving elan has a cost of 130
ns, switch say 300 ns including cables
for a 128 port router, 100 ns from the sending elan. that's 530 ns,
and that times 2 is 1060 ns. There's
really little left for the pci-x. as 2.6 - 1.06 = 1.44 us left for 4
1.44 / 4 = 0.36 us for pci-x.
I used the Los Alamos National Laboratory example numbers here for
In the end it is about price, not user friendliness of programming :)
>> Fact that so few standardized/rewrote their floating point
>> software to gpu's,
>> is already saying enough about all the legacy codes in HPC world :)
>> When some years ago i had a working 2 cluster node here with
>> QM500- A , it
>> had at 32 bits , 33Mhz pci long sleeve slots a blocked read
>> latency of under 3
>> us is what i saw on my screen. Sure i had no switch in between it.
>> connection between the 2 elan4's.
>> I'm not sure what pci-x adds to it when clocked at 133Mhz, but it
>> won't be a
>> big diff with pci-e.
> There is a big different between PCIX and PCIe. PCIe is half the
> latency - from 0.7 to 0.3 more or less.
Well i'm not so sure the difference is that huge. All those
measurements in past was at oldie Xeon P4 machines,
and i've never really seen a good comparision there.
Furthermore fabrics like Dolphin at the time with a 66Mhz, 64 bits
PCI card already got like 1.36 us one-way pingpong latencies,
not exactly a lot slower than DDR infinibands qlogics of a claimed
>> PCI-e probably only has a bigger bandwidth isn't it?
> Also bandwidth ...:-)
That's a non discussion here. I need latency :)
If i'd really need big bandwidth for transport i'd use of course a
boat - 90% of all cargo here
gets transported over the rivers and hand dug canal; especially river
>> Beating such hardware 2nd hand is difficult. $30 on ebay and i can
>> install 4
>> rails or so.
>> Didn't find the cables yet though...
>> So i don't see how to outdo that with old infiniband cards which are
>> $130 and upwards for the connectx, say $150 soon, which would
>> allow only
>> single rail
>> or maybe at best 2 rails. So far didn't hear anyone yet who has
>> more than
>> single rail IB.
>> Is it possible to install 2 rails with IB?
> Yes, you can do dual rails
>> So if i use your number in pessimistic manner, which means that
>> there is
>> some overhead of pci-x, then the connectx type IB, can do 1
>> million blocked
>> reads per second theoretic with 2 rails. Which is $300 or so,
>> cables not
> Are you referring to RDMA reads?
As i use all cpu cores 100%, i simply cannot catch mpi messages, let
So anything that has the cards processor do the job of digging inthe
RAM rather than bug
one of the very busy cores, is very welcome form of communication.
99.9% of all communication to remote nodes is 32 byte RDMA wites and
128-256 byte reads.
I can set myself whether it's 128, 192 or 256.
Probably i'll make it 128. The number of reads is a few percent more
That other 0.01% is the very complex parallel algorithm that
basically parallellizes a sequential algorithm.
That algorithm is a 150 pages of a4 roughly full of insights and
proofs why it works correct :)
More information about the Beowulf