[Beowulf] building Infiniband 4x cluster questions
diep at xs4all.nl
Mon Nov 7 10:51:28 PST 2011
In Game Tree Search basically algorithmically it is a century further
than many other sciences
as the brilliant minds have been busy with it. For the brilliant guys
it was possible to make CASH with it.
In Math there is stil many challenges to design 1 kick butt
algorithm, but you won't get rich with it.
As a result from Alan Turing up to the latest Einstein, they al have
put their focus upon Game Tree Search.
I'm moving now towards robotica in fact, building a robot. Not the
robot, as i suck in building robots,
but the software part so far hasn't been realy developed very well
for robots. Unexplored area still for
civil use that is.
But as for the chessprograms now, they combine a bunch of algorithms
and every single one of them
profits bigtime (exponential) from caching. That caching is of course
random. So the cluster we look at
in number of nodes you can probably count at one hand, yet i intend
to put 4 network cards (4 rails for
insiders here) into each single machine. Machine is a big word, it
wil be stand alone mainboard of course
to save costs.
So the price of each network card is fairly important.
As it seems now, the old quadrics network cards QM500-B that you can
pick up for $30 each or so on ebay
are most promising.
At Home i have a full working QM400 setup which is 2.1 us latency one
way ping pong. So i'd guess a blocked
read has a latency not much above that.
I can choose myself whether i want to do reads of 128 bytes or 256
bytes. No big deal in fact. It's scathered
through the RAM, so each read is a random read fromthe RAM.
With 4 nodes that would mean of course odds 25% it's a local RAM read
(no nothing network read then),
and 75% odds it's somewhere in the gigabytes of RAM from a remote
As it seems now 4x infiniband has a blocked read latency that's too
slow and i don't know for which sockets 4x
works, as all testreports i read the 4x infiniband just works for old
socket 604. So am not sure it works for socket 1366
let alone socket 1155; those have a different memory architecture so
it's never sure whether a much older network card
that works DMA will work for it.
Also i hear nothing about putting several cards in 1 machine. I want
at least 4 rails of course from those old crap cards.
You'll argue that for 4x infiniband this is not very cost effective,
as the price of 4 cards and 4 cables is already gonna
be nearly 400 dollar.
That's also what i noticed. But if i put in 2x QM500-B in for example
a P6T professional, that's gonna be cheaper including
the cables than $200 and it will be able to deliver i'd guess over a
million blocked reads per second.
By already doing 8 probes which is 192-256 bytes currently i already
'bandwidth optimized' the algorithm. Back in the days
that Leierson at MiT ran cilkchess and other engines at the
origin3800 there and some Sun supercomputers, they requested
in slow manner a single probe of what will it have been, a byte or
So far it seems that 4 infiniband cards 4x can deliver me only 400k
blocked reads a second, which is a workable number
in fact (the amount i need depends largely upon how fast the node is)
for a single socket machine.
Yet i'm not aware whether infiniband allows multiple rails. Does it?
The QM400 cards i have here, i'd guess can deliver with 4 rails
around 1.2 million blocked reads a second, which already
allows a lot faster nodes.
The ideal kick butt machine so far is a simple supermicro mainboard
with 4x pci-x and 4 sockets.
Now it'll depend upon which cpu's i can get cheapest whether that's
intel or AMD.
If the 8 core socket 1366 cpu's are going to be cheap @ 22 nm, that's
of course with some watercooling, say clock them to 4.5Ghz,
gonna be kick butt nodes. Those mainboards allow "only" 2 rails,
which definitely means that the QM400 cards, not to mention 4x
is an underperformer.
Up to 24 nodes, infiniband has cheap switches.
But it seems only the newer infiniband cards have a latency that's
sufficient, and all of them are far over $500, so that's far outside
Even then they still can't beat a single QM500-B card.
It's more than said that the top500 sporthall hardly needs bandwidth
let alone latency. I saw that exactly a cluster in the same sporthall
with simple built in gigabit that isn't even DMA was only 2x slower
than the same machines equipped with infiniband.
Now some wil cry here that gigabit CAN have reasonable one way
pingpong's, not to mention the $5k solarflare cards of 10 gigabit
yet in all sanity we must be honest that the built in gigabits from
practical performance reasons are more like 500 microseconds latency
have all cores busy. In fact even the realtime linux kernel will
central lock every udp packet you ship or receive. Ugly ugly.
That's no compare with the latencies of the HPC cards of course,
whether you use MPI or SHMEM doesn't really matter there. That
As a result it seems there was never much of a push to having great
That might change now with gpu's kicking butt, though those need of
course massive bandwidth, not latency.
For my tiny cluster latency is what matters. Usually 'one way
pingpong' is a good representation of the speed of blocked reads,
as the SHMEM allows way faster blocked reads there than 2 times the
price for a MPI one-way pingpong.
Quadrics is dead and gone. Old junk. My cluster also will be old junk
probably, with exception maybe of the cpu's. Yet if i don't find
sponsorship for the cpu's,
of course i'm on a big budget there as well.
On Nov 7, 2011, at 12:35 PM, Eugen Leitl wrote:
> On Mon, Nov 07, 2011 at 11:10:50AM +0000, John Hearns wrote:
>> I cannot answer all of your questions.
>> I have a couple of answers:
>> Regarding MPI, you will be looking for OpenMPI
>> You will need a subnet manager running somewhere on the fabric.
>> These can either run on the switch or on a host.
>> If you are buying this equipment from eBay I would imagine you
>> will be
>> running the Open Fabrics subnet manager
>> on a host on your cluster, rather than on a switch.
>> I might be wrong - depends if the switch has a SM license.
> Assuming ebay-sourced equipment, what price tag
> are we roughly looking at, per node, assuming small
> (8-16 nodes) cluster sizes?
> Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
> ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
> 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf