[Beowulf] Good 1 Gbit switches - which ones?

Mon May 24 17:12:20 PDT 2004

On Fri, May 21, 2004 at 12:49:47PM -0700, Konstantin Kudin wrote:
>  Hi there,
> 
>  Can anyone offer an insight with respect to 1 Gbit switches for a
> Beowulf cluster? There are all these reports that a lot of inexpensive
> switches on the market tend to choke under heavy internal traffic. Can
> anyone suggest an affordable switch with good internal bandwidth, which
> was tested under heavy load, and actually worked well?

I've written a small benchmark which allows testing various number of
MPI_INTs in a message between a variable number of pairs of nodes.

With a 32 node dual opteron cluster and a Nortel Baystack 470 48 port
switch:

    # of 
    MPI_INT  BetweenPairs of           Wallclock  Latency          Bandwidth
==============================================================================
size=     1, 131072 hops,  8 nodes in  7.04 sec ( 53.7 us/hop)     73 KB/sec
size=     1, 131072 hops, 16 nodes in  7.46 sec ( 56.9 us/hop)     69 KB/sec
size=     1, 131072 hops, 24 nodes in  7.51 sec ( 57.3 us/hop)     68 KB/sec
size=     1, 131072 hops, 32 nodes in  8.44 sec ( 64.4 us/hop)     61 KB/sec
(19% or so drop)

size=    10, 131072 hops,  8 nodes in  7.15 sec ( 54.5 us/hop)    716 KB/sec
size=    10, 131072 hops, 16 nodes in  7.39 sec ( 56.4 us/hop)    693 KB/sec
size=    10, 131072 hops, 24 nodes in  7.59 sec ( 57.9 us/hop)    674 KB/sec
size=    10, 131072 hops, 32 nodes in  8.06 sec ( 61.5 us/hop)    635 KB/sec
(13% or so drop)

size=  1000, 16384 hops,  8 nodes in  1.93 sec (117.8 us/hop)  33163 KB/sec
size=  1000, 16384 hops, 16 nodes in  1.96 sec (119.6 us/hop)  32652 KB/sec
size=  1000, 16384 hops, 24 nodes in  1.98 sec (120.6 us/hop)  32400 KB/sec
size=  1000, 16384 hops, 32 nodes in  2.20 sec (134.1 us/hop)  29129 KB/sec
(13% or so drop)

size= 10000, 16384 hops,  8 nodes in  9.71 sec (592.5 us/hop)  65930 KB/sec
size= 10000, 16384 hops, 16 nodes in  9.92 sec (605.2 us/hop)  64543 KB/sec
size= 10000, 16384 hops, 24 nodes in 10.13 sec (618.4 us/hop)  63164 KB/sec
size= 10000, 16384 hops, 32 nodes in 17.47 sec (1066.4 us/hop)  36629 KB/sec
(80% or so drop)

size=100000, 16384 hops,  8 nodes in 100.00 sec (6103.5 us/hop)  64000 KB/sec
size=100000, 16384 hops, 16 nodes in 104.72 sec (6391.3 us/hop)  61118 KB/sec
size=100000, 16384 hops, 24 nodes in 103.68 sec (6328.0 us/hop)  61730 KB/sec
size=100000, 16384 hops, 32 nodes in 134.14 sec (8187.3 us/hop)  47711 KB/sec
(34% or so drop)

Seems like in all cases I'm seeing a substantial drop off by the time
I keep 32 ports busy, I suspect the drop off at 48 would be even worse.

Does this seem like a reasonable way to benchmark switches?  Anyone
have suggested improvments or better tools?  If people think this would
be valuable I could clean up the source and provide a central location
for storing benchmark results.

-- 
Bill Broadley
Computational Science and Engineering
UC Davis