Network Charteristics and Applications

Fri Jan 4 14:01:53 PST 2002

Yea! beowulf.org is back!

> The Question... Which specific parallel applications/algorithms/problem 
> classes benefit significantly from bandwidth increases,decreased network 
> latency or a combination of both?

Here are some gross generalizations that might help:

With most algorithms, the less data per cpu, the more bandwidth and
latency count. So runs with lots of cpus, or smaller datasets, are
harder.

Example: Climate modeling generally involves running a relatively
coarse grid for a large number of timesteps. It's hard to get a good
speedup unless you have a really great machine, and so there was some
bruhaha recently about how the US needed to buy (Japanese) vector
machines for this problem. (However, I don't think this is the case,
the climate people simply need to use best practices with MPI.)

Example: QCD, quantum chromodynamics. QCD computes on a 4 dimensional
grid. Sometimes people want to compute large grids, sometimes
small. Less data on a node means relatively more communications and
lower required latencies. Steve Gottlieb has a theoretical slide
demonstrating this:

http://physics.indiana.edu/~sg/utah/performance_model.html

If you want to build a QCD machine that sustains 10 TFlop/s over a
wide range of grid sizes, this is a hard problem. For example, if I
have a 200 MF/s sustained processor, I can get to a local grid size of
4^4 using Myrinet and 12^4 using fast ethernet. 12^4 is so large of a
grid that it isn't so useful for fast computations.

Example: Weather forecasting. Similar to climate, but there are
multiple kinds of forecasts: regional, national, global, each with
more data. The regional forecast is *hardest* to speed up because it
has the least data. You can get a speedup of say 8x today with fast
ethernet before you hit a wall. But if you're doing global forecasts,
you can get much bigger. The 10x number comes from an experiment that
the Utah people did for their upcoming Olympic forecasts. Meanwhile,
while doing the FSL bid, I computed that an extra 100 usec of latency
wouldn't hurt their 40km national forecast at all, and the average
bandwidth needed was 1/3 gigabit/sec, at 40-odd cpus.

2) With other algorithms, the range of data sizes people want to use
is in a fairly linear area of performance on some hardware. One
example of this is CHARMM on the Cray T3E, which has a great
interconnect (and a slow processor) by today's standards.

I actually built a little tool using the MPI profiling interface which
does some gross computations of compute/comm ratios. I'd like to turn
it into a tool usable by the community; would anyone like to volunteer
to help? With such a tool you could take existing MPI codes and find
out how they behave in practice.

greg