[Beowulf] Home beowulf - NIC latencies

Fri Feb 4 09:31:01 PST 2005

On Fri, 2005-02-04 at 13:35 +0100, Vincent Diepeveen wrote:
> At 00:29 4-2-2005 -0800, Bill Broadley wrote:
> >On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote:
> >> Good morning!
> >> 
> >> With the intention to run my chessprogram on a beowulf to be constructed
> >> here (starting with 2 dual-k7 machines here) i better get some good advice
> >> on which network to buy. Only interesting thing is how fast each node can
> >> read out 64 bytes randomly from RAM of some remote cpu. All nodes do that
> >> simultaneously.
> >
> >Is there any way to do this less often with a larger transfer?  
> >If you
> >wrote a small benchmark that did only that (send 64 bytes randomly
> >from a large array in memory) and make it easy to download, build, run,
> >and report results, I suspect some people would.
> 
> One way pingpong with 64 bytes will do great.

pingpong is not really the same, adding a random element can slow down
comms and ideally it sounds like you want a one-sided operation.
Perhaps you should look at tabletoy (cray shmem) or gups (MPI) as a
benchmark.

> CPU's are 100% busy and after i know how many times a second the network
> can handle in theory requests i will do more probes per second to the
> hashtable. The more probes i can do the better for the game tree search.

Are you overlapping comms and compute or doing blocking reads?  If you
are overlapping then the issue rate for reads is more important than the
raw latency.

> >> quadrics/dolphin seems bit out of pricerange. Myrinet is like 684 euro per
> >> card when i altavista'ed online and i wonder how to get more than 2 nodes
> >> to work without switch. Perhaps there is low cost switches with reasonable
> >> low latency?
> >Do you know that gigabit is too high latency?
> 
> The few one way pingpong times i can find online from gigabit cards are not
> exactly promising, to say it very polite. Something in the order or 50 us
> one way pingpong time i don't even consider worth taking a look at at the
> picture.
> 
> Each years cpu's get faster. For small networks 10 us really is the upper
> limit.

10us is easily achievable, I've just measured a read time of a little
over 3us and a issue rate of 1.33us.

> So before we start searching every node (=position) we quickly want to find
> out whether other cpu's already searched it.
> 
> At the origin3800 at 512 processors i used a 115 GB hashtable (i started
> search at 460 processors). Simply because the machine has 512GB ram.
> 
> So in short you take everything you can get.

So is this a parallel algorithm or simply a big "memory farm" you are
after?  You don't hear much of clusters being used for the latter but in
some cases it's a eminently sensible thing to do.

Ashley,