[Beowulf] Performance characterising a HPC application

Wed Mar 21 03:41:07 PDT 2007

On Mar 21, 2007, at 12:05 AM, Mark Hahn wrote:

> if the net is a bandwidth bottleneck, then you'd see lots of back- 
> to-back
> packets, adding up to near wire-speed.  if latency is the issue,  
> you'll see
> relatively long delays between request and response (in NFS, for  
> instance).
> my real point is simply that tcpdump allows you to see the  
> unadorned truth
> about what's going on.  obviously, tcpdump will let you see the  
> rate and scale of your flows, and between which nodes...
>
>>> anything which doesn't speed up going from gigabit to IB/10G/ 
>>> quadrics is what I would call embarassingly parallel...
>>
>> True - I guess I'm trying to do some cost/benefit analysis so the  
>> magnitude of the improvement is important to me .. but maybe  
>> measuring it on a test cluster is the only way to be sure of this  
>> one.
>
> well, maybe.  it's a bit jump from 1x Gb to IB or 10GE - I wish it  
> were easier to advocate Myri 2G as an intermediate step, since I  
> actually don't see a lot of apps showing signs of dissatisfaction  
> with ~250 MB/s interconnect - and IB/10GE don't have much  
> advantage, if any, in latency.

Mark,

I have not benchmarked any applications that need more than 250 MB/s  
during computation, although I know someone at ORNL that could get  
close to 125 MB/s on the X1e (which doesn't use a Myricom fabric).  
Where 10G comes in to play is data movement. You can get ~700-900 MB/ 
s with IB SDR, ~1,200 MB/s with Myri-10G (Ethernet or MX), and ~1,400  
MB/s using IB DDR with Lustre, for example.

There is little difference between latency for Myrinet-2000 E and F  
cards and Myri-10G for small messages. Once messages start to go over  
1 KB, then the extra bandwidth helps. As always, profile your code.

<product plug>
If you are using MX, we have added some optional statistics available  
with our debug library that will give you the counts for each size  
class of message at the completion of the run. In addition to a  
profiler like pMPI, it can help you determine if your app is more  
latency or bandwidth sensitive.
</product plug>

Scott