[Beowulf] Re: Re: Home beowulf - NIC latencies

Fri Feb 11 12:14:11 PST 2005

On Fri Feb 11, 2005 11:49:48... Joachim Worringen wrote
> Greg Lindahl wrote:
> >On Thu, Feb 10, 2005 at 01:35:06PM -0700, Maurice Hilarius wrote:
> >>If I have a fantastic device that uses infinitely small time (latency) 
> >>and moves huge amounts of data (bandwidth) but in doing so it takes 80% 
> >>of a CPU, we do not have a useful solution..
> >
> >If large cpu usage is a problem, it will show up nicely in real
> >application benchmarks.
> 
> True. I always wonder what the low-CPU-usage-advocates want the MPI 
> process to do while i.e. an MPI_Send() is executed. For small messages 
> (which are critical for many applications), it's somewhat like 
> requesting that a local memory-write has to show low CPU usage.

For blocking operations with short messages, low CPU usage shouldn't be the
main concern.  Measuring latency relative to CPU usage doesn't make much sense.

> 
> Of course, I can think of scenarios in which data transfers w/o CPU 
> usage do promise advantages, and I have implemented and evaluated such 
> techniques myself. But in the end (for the application), it always 
> boiled down to latency and bandwidth as most applications don't honor 
> "true" asynchronous communication.

Yep.  We seem to have several micro-benchmarks that determine what the overlap
potential of the network is, but I've never seen anything that determines
what the overlap potential of an application is.  It would be interesting
to see what the overlap potential of real applications is.

> 
> The latest unsuccessful case of uncoupling computation and MPI 
> communication I read about was BG/L when using the second CPU as a 
> message processor. Maybe Myrinet MX will behave differently by making 
> the MPI itself more concurrent on hardware level (is this a correct 
> description, Patrick?) - but it will need matching applications, too.
> 

BG/L is unique is many ways.  For example, using the second processor for
communications doesn't actually help with progress -- the application still has
to make MPI library calls to make progress on outstanding posted operations.
So, even if the application was coded to take advantage of overlap, it
probably wouldn't gain much by using the second processor.

MX should be able to provide overlap and progress, like Quadrics and a few
other technologies do.

-Ron