[Beowulf] scalability

Thu Dec 10 10:43:07 PST 2009

Hi Amjad

amjad ali wrote:
> Hi Gus,
> your nice reply; as usual.
> 
> I ran my code on single socket xeon node having two cores; It ran linear 
> 97+% efficient.
> 
> Then I ran my code on single socket xeon node having four cores ( Xeon 
> 3220 -which really not a good quad core) I got the efficiency of around 85%.
> 
> But on four single socket nodes I ran 4 processes  (1 process on each 
> node); I got the efficiency of around 62%.
> 

This is about the same number I've got for an atmospheric model
in a dual-socket dual-core Xeon computer.
Somehow the memory path/bus on these systems is not very efficient,
and saturates when more than two processes do
intensive work concurrently.
A similar computer configuration with dual- dual-
AMD Opterons performed significantly better on the same atmospheric
code (efficiency close to 90%).

I was told that some people used to run two processes only on
dual-socket dual-core Xeon nodes , leaving the other two cores idle.
Although it is an apparent waste, the argument was that it paid
off in terms of overall efficiency.

Have you tried to run your programs this way on your cluster?
Say, with one process only per node, and N nodes,
then with two processes per node, and N/2 nodes,
then with four processes per node, and N/4 nodes.
This may tell what is optimal for the hardware you have.
With OpenMPI you can use the "mpiexec" flags
"-bynode" and "-byslot" to control this behavior.
"man mpiexec" is your friend!  :)

> Yes, CFD codes are memory bandwidth bound usually.
>

Indeed, and so is most of our atmosphere/ocean/climate codes,
which has a lot of CFD, but also radiative processes, mixing,
thermodynamics, etc.
However, most of our models use fixed grids, and I suppose
some of your aerodynamics may use adaptive meshes, right?
I guess you are doing aerodynamics, right?

> Thank you very much.
> 

My pleasure.

I hope this helps,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

> 
> 
> 
> 
> run with 2core
> On Wed, Dec 9, 2009 at 9:11 PM, Gus Correa <gus at ldeo.columbia.edu 
> <mailto:gus at ldeo.columbia.edu>> wrote:
> 
>     Hi Amjad
> 
>     There is relatively inexpensive Infiniband SDR:
>     http://www.colfaxdirect.com/store/pc/showsearchresults.asp?customfield=5&SearchValues=65
>     <http://www.colfaxdirect.com/store/pc/showsearchresults.asp?customfield=5&SearchValues=65>
>     http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=12
>     http://www.colfaxdirect.com/store/pc/viewCategories.asp?SFID=12&SFNAME=Brand&SFVID=50&SFVALUE=Mellanox&SFCount=0&page=0&pageStyle=m&idcategory=2&VS12=0&VS9=0&VS10=0&VS4=0&VS3=0&VS11=0
>     <http://www.colfaxdirect.com/store/pc/viewCategories.asp?SFID=12&SFNAME=Brand&SFVID=50&SFVALUE=Mellanox&SFCount=0&page=0&pageStyle=m&idcategory=2&VS12=0&VS9=0&VS10=0&VS4=0&VS3=0&VS11=0>
>     Not the latest greatest, but faster than Gigabit Ethernet.
>     A better Gigabit Ethernet switch may help also,
>     but I wonder if the impact will be as big as expected.
> 
>     However, are you sure the scalability problems you see are
>     due to poor network connection?
>     Could it be perhaps related to the code itself,
>     or maybe to the processors' memory bandwidth?
> 
>     You could test if it is network running the program inside a node
>     (say on 4 cores) and across 4 nodes with
>     one core in use on each node, or other combinations
>     (2 cores on 2 nodes).
> 
>     You could have an indication of the processors' scalability
>     by timing program runs inside a single node using 1,2,3,4 cores.
> 
>     My experience with dual socket dual core Xeons vs.
>     dual socket dual core Opterons,
>     with the type of code we run here (ocean,atmosphere,climate models,
>     which are not totally far from your CFD) is that Opterons
>     scale close to linear, but Xeons get nearly stuck in terms of scaling
>     when there are more than 2 processes (3 or 4) running in a single node.
> 
>     My two cents.
>     Gus Correa
>     ---------------------------------------------------------------------
>     Gustavo Correa
>     Lamont-Doherty Earth Observatory - Columbia University
>     Palisades, NY, 10964-8000 - USA
>     ---------------------------------------------------------------------
> 
> 
>     amjad ali wrote:
> 
>         Hi all,
> 
>         I have, with my group, a small cluster of about 16 nodes (each
>         one with single socket Xeon 3085 or 3110; And I face problem of
>         poor scalability. Its network is quite ordinary GiGE (perhaps
>         DLink DGS-1024D 24-Port 10/100/1000), store and forward switch,
>         of price about $250 only.
>         ftp://ftp10.dlink.com/pdfs/products/DGS-1024D/DGS-1024D_ds.pdf
> 
>         How should I work on that for better scalability?
> 
>         What could be better affordable options of fast switches?
>         (Myrinet, Infiniband are quite costly).
> 
>         When buying a switch what should we see in it? What latency?
> 
> 
>         Thank you very much.
> 
> 
>         ------------------------------------------------------------------------
> 
>         _______________________________________________
>         Beowulf mailing list, Beowulf at beowulf.org
>         <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>         To change your subscription (digest mode or unsubscribe) visit
>         http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 
>     _______________________________________________
>     Beowulf mailing list, Beowulf at beowulf.org
>     <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>     To change your subscription (digest mode or unsubscribe) visit
>     http://www.beowulf.org/mailman/listinfo/beowulf
> 
>