[Beowulf] Home beowulf - NIC latencies

Fri Feb 4 12:39:47 PST 2005

At 11:38 4-2-2005 -0800, Bill Broadley wrote:
>> 
>> One way pingpong with 64 bytes will do great.
>> 
>
>A very similar number I build a circularly linked list and read a value,
>add 1 to it, and send it to the next host, with a GigE network:
>
>compute-0-8.local compute-0-7.local compute-0-2.local compute-0-4.local
compute-0-8.local compute-0-7.local compute-0-2.local compute-0-4.local
>size=   10, 131072 hops, 8 nodes in  5.30 sec ( 40.4 us/hop)    966 KB/sec
>
>Oh, you said 64 (I'm sending INTs, so 16):
>size=   16, 131072 hops, 8 nodes in  5.35 sec ( 40.8 us/hop)   1531 KB/sec

I'm amazed you get it to 40.8 us. Probably you tested at an idle network?

How fast is it when the cpu's are 100% busy doing integer work?

>> CPU's are 100% busy and after i know how many times a second the network
>> can handle in theory requests i will do more probes per second to the
>> hashtable. The more probes i can do the better for the game tree search.
>
>With a gigE network that sounds like 40us or so.  With Myrinet or IB
>it's in the 4-6us range.  If you bought dual opterons with the special

At the quadrics and dolphin homepage they both claim 12+ us for Myrinet.

For example :
  http://www.dolphinics.com/pdf/datasheet/Dolphin_socket_4p.pdf

>hypertransport slot you could get it down to 1.5us or so.  SGI
>altix machines can get that down again to around 1.0us.  Of course
>speed isn't cheap.

Altix3000 has worse latency than origin3800 if interpret results well. 
Altix3000 is 3-4 us one way pingpong at 64 processors, which origin3800
gets at 512 processors.

At 64 processors see extensive benchmarking by prof Aad v/d Steen for dutch
government organisations. His results are at www.sara.nl in pdf format.
Look for his presentation 1 july 2003. 

When i ran at limited number of cpu's my latency tests (using shared
memory) the origin3800 really is a lot faster in latency than altix3000.

A problem of altix3000 design is that of course scheduling is very hard
thanks to the complex routing as each brick is connected to 2 routers which
each connect to other parts of the machine.

This causes for immense scheduling problems when there is a 150 users
simultaneously on the machine normally spoken which are not there when you
can benchmark an entire empty machine with just 1 user.

>> The few one way pingpong times i can find online from gigabit cards are not
>> exactly promising, to say it very polite. Something in the order or 50 us
>> one way pingpong time i don't even consider worth taking a look at at the
>> picture.
>> 
>> Each years cpu's get faster. For small networks 10 us really is the upper
>> limit.
>
>Okay, so dolphin, myrinet, or IB.

Have URL's from where IB is buyable without needing to buy entire system?

>> Let's not discuss parallel chess algorithm too much in depth. 100 different
>> algorithms/enhancements get combined with each other. They are not the
>> biggest latency problem. The latency problem is caused by the hashtable.
>> Hashtable is a big cache. The bigger the better. It avoids researching the
>> same tree again.
>
>Okay, so my question is, which would be better:
>* 8 4GB caches that you could query 80 million times a second?

This one by far.

Actually for the top searches not such big caches are needed. Locally i may
allocate 200-400MB a cpu for cache, but a shared cache can be easily as low
as 4MB a cpu, no problem. Could get it even down to less than that if needed.

99% of all nodes (chesspositions) that get searched are near the leafs. So
if i move up the variable where it also may lookup at remote cpu's from 0
to 2, then already 99% of all nodes don't get looked up remote.

>* 1 64GB cache that you could query 200,000 times a second?

>> In games like chess and every search terrain (even simulated flight) you
>> can get back to the same spot by different means causing a transposition.
>> Like suppose you start the game with 1.e4,e5 2.d4 that leads to the same
>> position like 1.d4,e5 2.e4. So if we have searched already 1.e4,e5 2.d4
>> that position P we store into a large cache. Other cpu's first want to know
>> whether we already searched that position. 
>
>Right.  But if you can calculate a few Billion operations per second
>sometimes it is faster to recalculate then wait 10-20us for an answer.

To look 1 ply deeper in search is exponential. At a 460 cpu search
(origin3800) moving the variable from 1 (default so it was already not
storing/looking up the leaves remote) to 10, lost me 7 ply search depth.

That's about 3^7 = factor 2187

To answer the question, YES 1 fast pc processor would outsearch in such a
case handsdown a 512 processor supercomputer.

Supercomputers are of course notorious here. It takes a year or so to
deliver them and the processor chosen at the time of buying already wasn't
the fastest, so when they finally work fine for users the processors are at
least 2 times slower than pc processors (for integer work).

Clusters are far superior in that respect.

>> Those hashtable positions get created quite quickly. Deep Blue created them
>> at a 100 million positions a second and simply didn't store vaste majority
>> in hashtable (would be hard as it was in hardware). That's one of the
>> reasons why it searched only 10-12 ply, already in 1999 that was no longer
>> spectacular when 4 processor pc's showed up at world champs. 
>
>Indeed, better algorithms can allow a 4 cpu to compete with a 2000.

The Sheikh (one of the princes of the united arab emirates, see
www.hydrachess.com) plans on building a 1024 processor chess computer he
told me over MSN. He's having bad advisors IMHO. He's using myrinet and a
bad parallel search (speedup less than square root out of total number of
cpu's). Objectivity and desert sand are a bad combination.

>> At a PC with a shared hashtable nowadays i get 10-12 ply (ply = half move,
>> full move is when both sides make a move) in a few seconds, searching a
>> 100000 positions per second a cpu.
>> 
>> So before we start searching every node (=position) we quickly want to find
>> out whether other cpu's already searched it.
>
>So that operation will cost around 80us with GigE, and 10-16us with IB
>or Myri.

80 us is what i read elsewhere too yes for GigE. 

Is it so hard to make a card with lower latency for a few dollar?

I mean if i buy for 135 euro a cpu i can get myself an opteron 1.4Ghz or
something. If i buy for 1000 euro i get myself say a 2.4Ghz opteron.

Less than factor 2 faster.

If you buy for 135 euro a network card it is 80 us. When you buy a highend
netwerk card it's factor 10 faster from user viewpoint.

That's quite a lot!

>> At the origin3800 at 512 processors i used a 115 GB hashtable (i started
>> search at 460 processors). Simply because the machine has 512GB ram.
>
>The origin 3800 has a very healthy interconnect, shared memory lookups
>are in the few 100 ns range, and MPI with the newest libraries are
>in the 1-2us range.

If the interconnects (hubs) of the origin are fine, then they must use real
slow routers.

It's 5.8 us is a shared memory lookup on average at 460 processors
origin3800, no one else at the system (looking up 8 bytes). 3-4 us one way
pingpong.

That machine is equipped with so called 35ns routers.

Lookup to local memory is 280 ns by the way at both itanium2 as well as
origin. Of course everything is randomized. It's complete TLB trashing.

>> So in short you take everything you can get.
>
>Of course.
>
>> The search works with internal iterative deepending which means we first
>> search 1 ply, then 2 ply, then 3 ply and so on.
>> 
>> The time it takes to get to the next iteration i hereby define as the
>> branching factor (Knuth has a different definition as he just took into
>> account 1 algorithm, the 'todays' definition looks more appropriate).
>> 
>> In order to search 1 ply deeper obvious it's important to maintain a good
>> branching factor. I'm very bad in writing out mathematical proofs, but it's
>> obvious that the more memory we use, the more we can reduce the number of
>> legal moves in this position P as next few ply it might be in hashtable,
>> which trivially makes the time needed to search 1 ply deeper shorter.
>> 
>> Storing closer to the root (position where we started searching) is of
>> course more important than near the leafs of the search tree.
>> 
>> When for example not storing in hashtable last 10 ply near the leafs in an
>> overnight experiment the search depth dropped at 460 processors from 20 ply
>> to 13 ply.
>> 
>> Of course each processor of supercomputers is deadslow for game tree search
>> (it's branchy 100% integer work completely knocking down the caches), so
>> compared to pc's you already start at a disadvantage of a factor 16 or so
>> very quickly, before you start searching (in case of TERAS i had to fight
>> with outdated 500Mhz MIPS processors against opterons and high clocked quad
>> Xeons), so upgrading my own networkcards is more clever. 
>
>Interesting.  Of course the Origin 3800 is quite dated, not that the
>Itanium is an opteron killer, but it is much more competitive, and has
>much larger caches.

Itanium 1.3Ghz using 24 hours of PGO and after i figured out all kind of
options in the compiler to not take shortcuts by default, is same speed
like a 1.3Ghz opteron for DIEP. I understand why governments buy them. They
are good on paper and have no real weak spots. Horror & co to program for
those itaniums. 

L3 cache sizes for diep are not important. See extensive benchmarking at
the different hardware sites of my program. For example by Johan de Gelas or :

  Aceshardware : http://www.aceshardware.com/read.jsp?id=60000259  
  Sudhian : http://www.sudhian.com/showdocs.cfm?aid=635&pid=2403
  Soon also tested at www.anandtech.com !

>> Yet getting yourself a network even between a few nodes as quick as those
>> supercomputers is not so easy...
>
>Quadrics and Pathscale's infinipath have networks available that are in the
>same ballpark as the SGI origin.  Even dolphin although I'm not very
>familar with them.

I am very impressed by the quadrics and dolphin cards. Probably by
infinipath too when i check them out. Will do. 

I'm not so impressed yet by myrinet actually, but if cluster builders can
earn a couple of hundreds of dollars more on each node i'm sure they'll do it.

>> Additional your own beowulf network you can first decently test at before
>> playing at a tournament, and without good testing at the machine you play
>> at in tournaments you have a hard 0% chance that it plays well. 
>> 
>> The only thing in software that matters is testing.
>
>Indeed, good luck, thanks for the overview.  I'm planning on a cluster
>with a very fast (sub 2.5us network), but I won't have it for a few months.
>
>I had some infiniband hardware on loan, but I had to return it.
>
>-- 
>Bill Broadley
>Computational Science and Engineering
>UC Davis
>
>