[Beowulf] Latency dependant software

Thu May 5 11:46:21 PDT 2005

At 06:03 PM 5/3/2005 +0200, Philippe Blaise wrote:
>Robert G. Brown wrote:
>
>>....
>>
>>Still, the marketplace speaks for itself.  It doesn't argue, and isn't
>>legendary, it just is.
>>....
>>  
>>
>

Allo,

>But, does the hpc marketplace have a direction ?

I had missed you are from France, sorry for that.

Like me you have a language problem of course and might be able to expect a
long writing of RGB who doesn't recognize grammatical constructions as used
in Europe as a sign of poor english rather than respect; but your questions
and analysis is very relevant. 

I feel HPC is moving in 2 different directions. The 2 categories i
distinguish:
  A) embarrassingly parallel machines with the only focus upon dollar cost
per gflop, which require a network just delivering enough bandwidth from
one node to another (as we can see from Earth one way ping pong latency is
allowed to be horrible).
  B) "one way ping pong latency" dependant supercomputers/clusters with the
focus upon processors being fast for branchy codes and algorithms that
aren't easy to parallellize. Usually integer or database type applications
with huge code sizes that can get executed.

In past it was possible to make 1 overall performing chip which was fast
for both codes. I am slowly starting to doubt this for the future.

Most interesting in this respect for category A are CELL type processors in
the coming years in this area.

For category B the race is open. Usually what matters for category B is how
fast the latency is to 'remote' memory. 

Personally i see little future in terms of performance for SOFTWARE for
beowulfs that use integer type processors good at branchy code with bad
latency networks. Usually software is either very good vectorizable and
embarrassingly parallel, or it requires a lot of branchy code and can be
speeded up mega mega by using gigabytes, preferably terabytes, of RAM.

>Few years ago, some people had a "fantastic vision" to replace the 
>vector machines market :
>use big clusters of SMPs with the help of the new paradigm of hybrid 
>mpi/openmp programming.
>Then the main vendors (usa), except Cray, were very happy to sell giant 
>clusters of smp machines.
>
>Nevertheless, the japanese guys built the "earth simulator" ; which is 
>still the most powerful machine in the world
>(don't trust this stupid top500 list).
>
>Then Cray came back ... with vector machines...
>
>Don't underestimate the power of vector machines.
>Yes Fujitsu or NEC vector machines are still very efficient, even with 
>non contiguous memory access (!!).
>
>One year ago, the only cpus that sometimes were able to equal vectorial 
>cpus were alpha (ev7) and itanum2 with
>big caches and / or fast memory access. Remember that alpha is dead. 
>Have a look to the itanium2 market shares.
>The marketplace is not a good argument at all.

Completely dead, or sunk in 1912. 

The big 361mm^2+ ship is not superior in integer work loads (opteron way
faster) and it's not superior in floating point (several other chips
qualify for that). 

A few years ago when i cheered loud for itanium2 i had underestimated the
huge price a system would cost and that it would be so low clocked for such
a long period of time. 

Montecito might for a short period of time win at integer workloads the
real world benchmarks. Of course at a big price.

Perhaps intel might consider selling Montecito not with 1000 chips at a
time, but with 4 chips at a time. That makes the platform more attractive
for 'quad type users'. So Montecito will probably for a category B be fast
again, keeping the itanium2 ship afloat for another hour.

>Vectorization and parallelization are compatible
>Hybrid mpi/openmp programming  is a harder task than mpi/vector programming.
>If you have enough money and if your program is vectorizable, buy a 
>vector machine of course.

See category A. Matter of a few cells with a decent high quality network.

>Cluster of SMPs ? they will remain an efficient and low cost solution, 
>(and quite easy to be sold
>by a mass vendor).
>And thanks to cluster of SMPs with the help of linux, the HPC market is 
>now "democratic".
>
>Of course, it would be nice to have a true vector unit on a P4 or Opteron.
>But the problem will be the memory access again.

Reacted before on that with a relative simple trick opteron could be made
faster in floating point. I doubt however they'll do it assuming it will
reduce yields and any change in execution cores seem to eat a lot of time
from engineering teams. 

In another quoted message :
> Today everyone should be happy to see that some companies like Cray
> are trying to do better than HP or IBM in the linux cluster area, event 
> it if it's not
> the Cray of yesterday as you said.
> I think that this is partially due to the "NEC earth simulator effect", 
> and that linux clusters
> performances and functionalities are not fantastic for parallel jobs 
> compare to a good old T3E.

It is a good idea to look to a table produced by Ken Thompson, already a
few years ago. The reason why T3E looks so good is apart from its quadrics
type network also because i/o and network speed have not kept up with
processor speed, and connections are done cheaper now.

Take for example altix3000 series with origin3800 compared. Same type of
problem. 1 node in origin3800 is 4 cpu's. 1 node in altix3000 is a dual.

2 routers/switches connect to each dual, together forming a brick of 4 cpu's.

So basically the routing system has become more complex, but the one way
ping pong latency of altix3000 series is not better than from origin3800
(see presentation prof Aad v/d Steen for SARA at 1 july 2003 www.sara.nl).

This where the 500Mhz origin3800 chip is over 4 to 5 times slower than
itanium2 for my software.

To give a similar performance like at T3E the network would need a 5 times
faster latency.

To get 8 bytes randomly from memory of other cpu's at 512 processor
origin3800 was 5.8 us. At 512 processor altix3000 that will not be much
better seeing the one way pingpong times at already 64 cpu's being 3-4 us
(similar to origin3800).

This where what would be needed is a random read latency of 1 us, or a one
way pingpong latency of 600-800 ns to give same performance for type B
applications.

So the 'good performance of T3E' is not real true. It's the cpu's that have
become a lot faster and the networks do not keep up with it in one-way
pingpong latency.

>Bye,
>
>  Phil.
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>