[Beowulf] Home beowulf - NIC latencies

Mon Feb 7 00:11:58 PST 2005

Vincent,

Vincent Diepeveen wrote:
> Thanks for your kind answer Patrick,
> 
> Obviously i mentionned that number because i read it elsewhere.

I know, I have seen worse.

> Note that so far i didn't find any desperate vendor. For sure quadrics
> doesn't look desperate to me, they aren't even selling old cards anymore
> though they must have still thousands of them lying at home from returned
> upgraded networks. Finding second hand highend cards seems to be very seldom.

Tip: desperate companies are usually young and spend a lot of VC money 
on marketing. Quadrics does not fit, I am afraid, they have been around 
too long :-) Furthermore, selling old hardware is not very cost 
effective for a vendor: compatibility troubles with newer machines, 
require to support old hardware in new drivers and new middlewares, tap 
in inventory reserved for replacement parts, etc.

> First of all i'm interested in how quick i can get 4-64 bytes from remote
> memory. So not from some kind of network card cache, as myrinet doesn't
> have some megabytes on chip, but just a few tens of kilobytes. The memory
> has to come therefore from the remote nodes main memory, at a random adress
> in the main memory. No streaming at all happens. that 400 ns extra that the
> TLB gives is definitely not the problem i guess.

Myrinet has 2 MB of SRAM in standard, used by firmware code, data and 
buffers.

What you want to do basically is a Get. In practice, the origin of the 
Get will send a small packet with a virtual address or a RDMA handle and 
an offset, the NIC on the target side converts it in a physical address, 
fetches the data by DMA and sends it back to the origin side.

> All latencies i see quoted at all hardware sites, it is very hard to figure
> for me out whether that's a latency that is supported by paper, or whether
> it's a practical latency i can take into account as a programmer with all
> software layers overhead when each cpu is 100% running a program.

No, it's not likely to fit your usage. Vendors quote MPI latency on 
pingpong. That's pretty much the cost of sending/receiving an MPI 
message from user space to user space. Often, this is also with only 2 
nodes, optimal conditions and everybody holding their breath.

You want RMA Get. The latency for a Get is larger than for a MPI send. 
For 64 bytes, it is basically the MPI latency for 0 bytes (for the Get 
request) + the latency for 64 bytes (for the reply). Assuming that you 
don't Get all over the host memory, the virtual/physical translation 
will be hot in the target NIC so the translation cost will be very 
small. You want less than 3us per Get of 64 Bytes ? I don't know if even 
Quadrics can do it. The good news is that you can pipeline it very well. 
So it may cost more than 3 us for one Get, but you may complete a Get 
every 0.5 us if you post a bunch of them.

> Secondly, but as i'm not a cluster expert i don't know how to avoid that,
> it's of course a big LOSS in sequential speed if my program each few
> instructions must check whether there is some MPI message to get handled.

If you want perfect overlap and if you are ready to go as low level as 
possible, one-sided communication are for you (no host CPU involved on 
the target side). All low level communication interfaces support 
one-sided communications (not yet released for MX on Myrinet, but GM has 
it).

> However if i show up with 2 pc's and 2 network cards, then it sure matters
> when i lose a lot of speed. 
> 
> Obviously for embarassingly parallel software this is no issue, but usually
> for embarrassingly parallel software all you need is gigabit ethernet. 

If you can and know how to overlap, latency is irrelevant. It's hard to 
do on complex irregular codes, but you can usually do it if you can use 
one-sided communications. Don't put your communications in the critical 
path. Post them early and post many of them concurrently, pipelining 
will hide the latency of the critical path.

That's why desperate vendors use pipelined pingpong to get better curves.

> There is so many MPI applications which are not exactly embarassingly
> parallel from which you see that a decent programmer single cpu would be
> doing that 20 times faster. Or to quote someone who has been doing such

Most of the times, you go parallel to go bigger, not faster. If the 
problem size fits in one node, don't use a cluster, use a 
multi-processor nodes. You will have more bangs for your bucks.

> So it is very interesting for us all and me especially to understand how
> *fast* you can get that memory under full load of all the logical cpu's.

Using one-sided communications, there is little difference if the CPUs 
are loaded or not on the target side.

> Third each pc has 2 cheapo k7 processors which are a lot slower than opterons.

IO bus is more important for the communications part. I don't know of 
cheapo k7 machines with a decent PCI bus. However, for 64 bytes, even a 
cheesy PCI will not slow things down that much.

> Dolphin can deliver 'bytes' they say at their homepage in 3.3 us at MPX
> mainboards and claim somewhere a paper latency of 1.x us.

How long can you hold your breath ?

> What is the achieved read speed to remote memory myrinet gets at 64 bits /
> 66Mhz in software, so ready to use 4-64 bytes for applications? 

I have no idea, I am not even sure that I have a 64 bits/66 Mhz machine 
around to measure it. With GM, I would say at least 10 us. Certainely more.

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com