[Beowulf] building Infiniband 4x cluster questions

Mon Nov 7 10:51:28 PST 2011

hi Eugen,

In Game Tree Search basically algorithmically it is a century further  
than many other sciences
as the brilliant minds have been busy with it. For the brilliant guys  
it was possible to make CASH with it.

In Math there is stil many challenges to design 1 kick butt  
algorithm, but you won't get rich with it.

As a result from Alan Turing up to the latest Einstein, they al have  
put their focus upon Game Tree Search.
I'm moving now towards robotica in fact, building a robot. Not the  
robot, as i suck in building robots,
but the software part so far hasn't been realy developed very well  
for robots. Unexplored area still for
civil use that is.

But as for the chessprograms now, they combine a bunch of algorithms  
and every single one of them
profits bigtime (exponential) from caching. That caching is of course  
random. So the cluster we look at
in number of nodes you can probably count at one hand, yet i intend  
to put 4 network cards (4 rails for
insiders here) into each single machine. Machine is a big word, it  
wil be stand alone mainboard of course
to save costs.

So the price of each network card is fairly important.

As it seems now, the old quadrics network cards QM500-B that you can  
pick up for $30 each or so on ebay
are most promising.

At Home i have a full working QM400 setup which is 2.1 us latency one  
way ping pong. So i'd guess a blocked
read has a latency not much above that.

I can choose myself whether i want to do reads of 128 bytes or 256  
bytes. No big deal in fact. It's scathered
through the RAM, so each read is a random read fromthe RAM.

With 4 nodes that would mean of course odds 25% it's a local RAM read  
(no nothing network read then),
and 75% odds it's somewhere in the gigabytes of RAM from a remote  
machine.

As it seems now 4x infiniband has a blocked read latency that's too  
slow and i don't know for which sockets 4x
works, as all testreports i read the 4x infiniband just works for old  
socket 604. So am not sure it works for socket 1366
let alone socket 1155; those have a different memory architecture so  
it's never sure whether a much older network card
that works DMA will work for it.

Also i hear nothing about putting several cards in 1 machine. I want  
at least 4 rails of course from those old crap cards.
You'll argue that for 4x infiniband this is not very cost effective,  
as the price of 4 cards and 4 cables is already gonna
  be nearly 400 dollar.

That's also what i noticed. But if i put in 2x QM500-B in for example  
a P6T professional, that's gonna be cheaper including
the cables than $200 and it will be able to deliver i'd guess over a  
million blocked reads per second.

By already doing 8 probes which is 192-256 bytes currently i already  
'bandwidth optimized' the algorithm. Back in the days
that Leierson at MiT ran cilkchess and other engines at the  
origin3800 there and some Sun supercomputers, they requested
in slow manner a single probe of what  will it have been, a byte or  
8-12.

So far it seems that 4 infiniband cards 4x can deliver me only 400k  
blocked reads a second, which is a workable number
in fact (the amount i need depends largely upon how fast the node is)  
for a single socket machine.

Yet i'm not aware whether infiniband allows multiple rails.  Does it?

The QM400 cards i have here, i'd guess can deliver with 4 rails  
around 1.2 million blocked reads a second, which already
allows a lot faster nodes.

The ideal kick butt machine so far is a simple supermicro mainboard  
with 4x pci-x and 4 sockets.
Now it'll depend upon which cpu's i can get cheapest whether that's  
intel or AMD.

If the 8 core socket 1366 cpu's are going to be cheap @ 22 nm, that's  
of course with some watercooling, say clock them to 4.5Ghz,
gonna be kick butt nodes. Those mainboards allow "only" 2 rails,  
which definitely means that the QM400 cards, not to mention 4x  
infiniband
is an underperformer.

Up to 24 nodes, infiniband has cheap switches.

But it seems only the newer infiniband cards have a latency that's  
sufficient, and all of them are far over $500, so that's far outside  
of budget.
Even then they still can't beat a single QM500-B card.

It's more than said that the top500 sporthall hardly needs bandwidth  
let alone latency. I saw that exactly a cluster in the same sporthall  
top500
with simple built in gigabit that isn't even DMA was only 2x slower  
than the same machines equipped with infiniband.

Now some wil cry here that gigabit CAN have reasonable one way  
pingpong's, not to mention the $5k solarflare cards of 10 gigabit  
ethernet,
yet in all sanity we must be honest that the built in gigabits from  
practical performance reasons are more like 500 microseconds latency  
if you
have all cores busy. In fact even the realtime linux kernel will  
central lock every udp packet you ship or receive. Ugly ugly.

That's no compare with the latencies of the HPC cards of course,  
whether you use MPI or SHMEM doesn't really matter there. That  
difference is
so huge.

As a result it seems there was never much of a push to having great  
network cards.

That might change now with gpu's kicking butt, though those need of  
course massive bandwidth, not latency.

For my tiny cluster latency is what matters. Usually 'one way  
pingpong'  is a good representation of the speed of blocked reads,  
Quadrics excepted,
as the SHMEM allows way faster blocked reads there than 2 times the  
price for a MPI one-way pingpong.

Quadrics is dead and gone. Old junk. My cluster also will be old junk  
probably, with exception maybe of the cpu's. Yet if i don't find  
sponsorship for the cpu's,
of course i'm on a big budget there as well.

On Nov 7, 2011, at 12:35 PM, Eugen Leitl wrote:

> On Mon, Nov 07, 2011 at 11:10:50AM +0000, John Hearns wrote:
>> Vincent,
>> I cannot answer all of your questions.
>> I have a couple of answers:
>>
>> Regarding MPI, you will be looking for OpenMPI
>>
>> You will need a subnet manager running somewhere on the fabric.
>> These can either run on the switch or on a host.
>> If you are buying this equipment from eBay I would imagine you  
>> will be
>> running the Open Fabrics subnet manager
>> on a host on your cluster, rather than on a switch.
>> I might be wrong - depends if the switch has a SM license.
>
> Assuming ebay-sourced equipment, what price tag
> are we roughly looking at, per node, assuming small
> (8-16 nodes) cluster sizes?
>
> -- 
> Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
> ______________________________________________________________
> ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
> 8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf