[Beowulf] Three notes from ISC 2006
patrick at myri.com
Wed Jun 28 12:04:34 PDT 2006
Joachim Worringen wrote:
> An offer for "getting a secret white paper on request" is marketing, you
> are right. But at least the SPEC number was technical content - and we
> don't want to analyse every posting sentence-by-sentence, do we?
The SPEC stuff was actually fine. I didn't register it in my brain
because I don't care about compiler stuff, but you are right. Actually,
the white paper was borderline but acceptable, the
I-know-something-but-I-can-tell-you was the problem, and I trust that
Greg will be aware of the sensibilities.
> Let me summarize what I consider the key issues:
> - explicit MPI_Irecv/MPI_Send/MPI_Wait, or similar patterns implicitely
> in MPI_Reduce/MPI_Alltoall/MPI_Allreduce with small messages (a few
> doubles, or a few kB) are the dominant communication pattern in many MPI
> applications. There are quite some (but not as many as one could wish)
> studies that show this.
> - This means it's generally a good thing if the "ping" latency (duration
> of MPI_Send in number of CPU cycles) is as low as possible.
There are two metrics here. The latency is the time it takes for the
message to be received by the other side. The duration of MPI_Send is
actually the send overhead, typically the time it takes to copy the data
from the application buffer to an internal buffer. It makes no sense to
do zero-copy for small messages, unless you patch the kernel to not have
to deal with memory registration (Quadrics can do that) or unless you
use a lightweight kernel that has no virtual memory (Cray and Blue Gene
> - At this message size, CPU utilization or overlapping computing and
> communication is not relevant, as (zero-copy) RDMA does not pay off
> until the message gets at least some (typically >32, or more) kB in
> size, due to the implied pinning and rendez-vous overhead. Also,
> MPI_Send has no opportunity for overlap, and having a progress thread on
> the receive CPU steal cycles from the application doesn't really help,
Absolutely correct. Overlap is irrelevant for small messages. Progress
can be a problem with extreme cases though: if you have a lot of
incoming small messages but the application does not consume them or
call MPI, then you will have a flow control problem and progression is
useful. But this is a pathological case.
> - In these cases, all(?) interconnects do some sort of memcpy() within
> MPI_Send to get rid of the data. The differences are
> * How long does it take to prepare things for the memcpy()? This is
> Greg's message rate.
I don't think it's the time it take to prepare things. For very small
message, Greg and I both do PIO, ie we copy directly to the NIC. I do
that up to 128 bytes, because PIO writes stall you processors on slow IO
bus. Greg does that for all messages sizes on send side, from what I
hear from the grapevine. For 128 Bytes to 32K, I do (pipelined memcpy +
DMA), and then zero-copy above 32KB. I could do PIO writes up to 4 KB
for example, and this is exactly what I will do for MX-10G because
PCI-Express will not stall the processor as much as PCI-X does. That's a
tradeoff between PIO writes and memcpy/DMA and the parameters are
different on Hypertransport or PCI-X/PCI-Express.
> But I don't think that Greg's "Real Appliation Performance" white paper
> is infamous. It states where the data comes from, you have to trust him
> for his own numbers, and it does not directly link the differences in
> the application performance to the messaging rate. Of course, it does
> not offer a scientific analysis, and you can not compare it to papers
> like the ones from Leonid Oliker. But I don't think it's unfair, and
> surely stimulates the competition for better technical solutions or
> better white papers.
White papers are evil by definition. They show what you want to show,
and there is no peer review so you can say what you want.
It's not fair to use old hardware/software or use third parties results
that you know nothing about. If you want to do comparison, get your hand
on your competitors products and do the testing yourself. We bought a
Quadrics cluster a long time ago to do just that :-) You can also ask
friends to get access to clusters. The web is the last place I would
look to find reliable information.
More information about the Beowulf