[Beowulf] Three notes from ISC 2006
diep at xs4all.nl
Wed Jun 28 14:48:30 PDT 2006
> On Wed, 28 Jun 2006, Patrick Geoffray wrote:
>> High message rate is good, but the question is how much is enough ? At 3
>> million packet per second, that's 0.3 us per message which all of it is
>> used by the communication library. Can you name real world applications
>> that need to send messages every 0.3 us in a sustained way ? I can't,
>> only benchmarks do that. At 1 million packet per second, that one
>> message per microsecond. When does the host actually compute something ?
>> Did you measure the effective messaging rates of some applications ?
>> From you flawed white papers, you compared your own results against
>> numbers picked from the web, using older interconnect with unknown
>> software versions. Comparing with Myrinet D cards for example, you have
>> 4 times the link bandwidth and half the latency (actually, more like 1/5
>> of the latency because I suspect most/all Myrinet results were using GM
>> and not MX), but you say that it's the messaging rate that drives the
>> performance ??? I would suspect the latency in most cases, but you
>> certainly can't say unless you look at it.
In search it gives an exponential speedup when you can avoid doing the same
calculation that other nodes already have done.
So in order to do that, preferably at *every node* you do a lookup to the
The hashtable you can simply spread of course over all nodes. Entries 0..n
at node P0, entries n+1 .. 2n at node P1 etc.
That hashtable i'm using nowadays in a 64 bytes length. So i try to obtain
as much as possible as the network allows without
getting real slow, to get a packet from remote and read it.
That means for example a 16 nodes woodcrest 5160 dual node i would prefer
towards the 16 node switch a total packet rate
1 million * 16 = 16 million blocking reads a second
If the network can't deliver that, then i of course simply hash a bit less
more down the tree (near the leafs). That means a direct loss
of 20-40% in time just not hashing in the leafs and it reduced the number of
blocking reads done by factor 3 to 4 nearly. Additional
the loss is hurting more at supercomputers/superclusters because it will
take longer then to put all nodes to work.
So the knife cuts on 2 sides:
a) you can put sooner and more efficiently nodes to work
b) a direct performance penalty when a node doesn't know whether other
nodes already calculated the position it wants to start in.
The reason for this work is becuase in all kind of forms of search in
artificial intelligence (or guided flight) is that transpositions are
First you visit State A then State B to get in state C.
However if you first visit B and then A then you ALSO get in state C.
That's called "transposition" and the fundamental reason why it's important
that all nodes can share information.
More information about the Beowulf