[Beowulf] Re: building a new cluster

Wed Sep 1 16:56:49 PDT 2004

SC Huang wrote:

> Thanks to your suggestions. I am summarizing the questions and answers 
> here.
>  
> 1. cluster usage
>  
> The cluster will be used solely for running numerical simulations 
> (number crunching) codes written in fortran90 and MPI, which involves 
> finite difference calculations and fast fourier transforms for solving 
> 3D Navier-Stokes equations. It calls mpi_alltoall a lot (for FFTs) as 
> well as other mpi_send/recv, so communication is intensive. THe 
> problem is unsteady and 3D, so computation is also heavy. A typically 
> run can take 1-2 weeks using 8-16 nodes (depending on the problem size).
>  

Doing frequency domain modeling of CFD results? Getting frequency domain
models of time domain simulations?

> We have been OK with a "hybrid" 25-node (COMPAQ Alpha & Dell Xeon 
> 2.4GHz) cluster running right now using a 3Com 100 Mbps 
> (ethernet) switch and a LAM/MPI library.
> I will post some benchmarks later.
>  
> 2. Many people recommended Opteron (or at least encourage a test run 
> on Opteron) because it seems to be more cost effective. I picked Xeon 
> because of the following reasons:
>  
> (1) free Intel FORTRAN 90 compiler, which is also used for other 
> individual workstations in our lab and some supercomputers that we 
> have access to (we are kind of trying to stay away from the hassle of 
> switching between compilers when writing new codes)

You can use the Intel compiler on the Opteron and it will run great! In fact
the Spec benchmarks for the Opteron were run with the Intel compiler :)
The next version of the compiler will work for EM64T chips and should
build 64-bit code on Opterons just fine.

However, if you need to buy a compiler (I recommend doing so just to get
64-bit code), they're not that expensive - especially with educational
pricing.

> (2) We have a few users to share the cluster, so we have to get 
> "enough" nodes

Yes, but if they are faster the jobs go through faster and you get the same
or more work.

> (3) Xeon seems to be more common, so it's easier to get consultanting 
> or support

I disagree.

> BTW, what are the common fortran 90 compilers that people use on 
> Opteron? Any comparison to other compilers?
>  
>  
> 3. My MPI code periodically writes out data files to local disk, so I 
> do need hard disk on every node. Diskless sounds good (cost, 
> maintenance,etc), but the data size seems too big to be transferred to 
> the head node (well technically it can be done, but I would rather 
> just use local scratch disk).

Don't discount diskless. It's a wonderful idea. You can use PVFS or even 
Lustre
to write out results. Just fill a certain percentage of the nodes with 
disks for
PVFS or Lustre and use those nodes for double duty (computation and 
storage).
Opterons could handle this easily.

However, to take full advantage of PVFS you might have to rewrite your
code a bit but it's not that difficult.

If you don't go diskless at least think about a Ramdisk based OS. The 
local drives
are all storage. There are many good reasons for doing this.

> 4. Managed or unmanaged?
>  
> People already recommended some switches that I will not repeat here. 
> However, I am still not clear about "managed" and "unmanaged" 
> switches. Some vendors told me that I need an managed one, while other 
> said the opposite. Will need to study more...

Unmanaged means you can't/won't/shouldn't log into the switch to configuring
it because you can't configure it. It just works (well the unmanaged 
switches
from real vendors just work). A managed switch you need to configure the
switch by logging into. You, or ideally the vendor, needs to configure 
it for
HPC (some vendors never bother to do this - sigh....  ).

You might also want to consider an alternative network. CFD codes can 
respond
very well to Hypercubes or Torus networks. I know a vendor who can do 
this for
you very easily.

> 5. I only have wall-clocking timing of my code on various platforms. I 
> don't know how sensitive it is to cache size. I guess the bigger 
> cache, the better, because the code is operating large arrays all the 
> time.
>  
> I will post more summary here if I find out more information about 
> these issues. Thanks.

Keep posting questions and ideas. Lots of good people here who can help.

Jeff

>  
> SCH
>
>
> */SC Huang <schuang21 at yahoo.com>/* wrote:
>
>     Hi,
>      
>     I am about to order a new cluster using a $100K grant for running
>     our in-house MPI codes. I am trying to have at least 36-40 (or
>     more, if possible) nodes. The individual node configuration is:
>      
>     dual Xeon 2.8 GHz
>     512K L2 cache, 1MB L3 cache, 533 FSB
>     2GB DDR RAM
>     gigabit NIC
>     80 GB IDE hard disk
>      
>     The network will be based on a gigabit switch. Most vendors I
>     talked to use HP Procurve 2148 or 4148.
>      
>     Can anyone comment on the configuration (and the switch) above?
>     Any other comments (e.g. recommeded vendor, etc) are also welcome.
>      
>     Thanks!!!
>      
>      
>      
>
>     ------------------------------------------------------------------------
>     Do you Yahoo!?
>     Yahoo! Mail Address AutoComplete
>     <http://us.rd.yahoo.com/mail_us/taglines/aac/*http://promotions.yahoo.com/new_mail/static/ease.html>
>     - You start. We finish.
>
> ------------------------------------------------------------------------
> Do you Yahoo!?
> Yahoo! Mail Address AutoComplete 
> <http://us.rd.yahoo.com/mail_us/taglines/aac/*http://promotions.yahoo.com/new_mail/static/ease.html> 
> - You start. We finish.
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>  
>