[Beowulf] Re: building a new cluster

Joe Landman landman at scalableinformatics.com
Wed Sep 1 18:46:25 PDT 2004


SC Huang wrote:

> Thanks, Jeff.
>  
> 1. It is not frequency domain modeling. The FFT's, together with a 
> multigrid method, are used in solving a Poisson equation (for each 
> time step) in the solution procedure.


Nice.  FFT's have a communication pattern that emphasises low latency 
interconnections (or huge caches if you can get them to fit in there).  
Since you are doing 3D (and I presume 3D FFT and inverse FFT) the 
likelihood of them fitting in cache is small, unless you are looking at 
small problems... 

You will want to look at the Intel MKL and the AMD ACML.  They will help 
out with good implementations of some of these routines.

>  
> 2. I didn't know that Intel FORTRAN compiler can run on Opteron; it 
> was not mentioned in the intel website. Thank you for pointing this out!!



I would be ... surprised ... if the Intel website indicated code 
generation for the Opteron :).  You should know that the Portland Group 
compiler does emit code for Xeon's as well as Opterons (the -tp ... 
switch).  I believe that the PathScale EKO emits code for EM64T as 
well.  Both are available from the respective suppliers for demo. 


>  
> 3. Could you let me know the vendor who can do the Hypercubes or Torus 
> networks? I will be very interested in that. Is it expensive?
>  
> 4. Thanks for the suggestions on the diskless or other file systems. I 
> will discuss that with my group members.


This is an interesting way to go if your IO can be non-local, or if you 
just need local scratch space.  It makes building clusters quite 
fast/easy. 

>  
> Overall, many people are suggesting an Opteron system. I just learned 
> that one of the vendor is providing us an Opteran (with PG fortran 
> compiler) cluster for test runs. I better rush to do that first. I 
> will post new results after the tests.


Make sure that the PG version is 5.2.2 (their current download).  Make 
sure they install the ACML.  Also grab the Pathscale and the Intel.  
Make sure when you build your code, you build for 64 bit architecture 
(requires that the loaner machine is running in 64 bit mode).  The 
additional SSE2 and GP registers available to the optimizer, the 
simplified (non-segmented) address model generally make your codes run 
faster.  Yes, you could use it as a fast Xeon in 32 bit mode.  If you 
do, in the cases I have been playing with, you are giving up lots of 
performance because your codes do not have access to all the nice 
features on the chip.

>  
> Also, I heard of the name "channel bonding" here and there. Is that 
> some kind of network connection method for cluster (to use standard 
> switches to achieve faster data transfer rate)? Can someone briefly 
> talk about it, or point me to some website that I can read about it? I 
> did some google search about it but the materials are too technical to 
> me. :( Is it useful for a cluster of about 30-40 nodes?


There are plusses and minuses to channel bonding.  Search back through 
the archives for details.  I am not sure if the issues with latency and 
out of order packet send/receive have been addressed in 2.6.


Joe

>  
> Thanks!
>  
>  
>
>
> */"Jeffrey B. Layton" <laytonjb at charter.net>/* wrote:
>
>     SC Huang wrote:
>
>     > Thanks to your suggestions. I am summarizing the questions and
>     answers
>     > here.
>     >
>     > 1. cluster usage
>     >
>     > The cluster will be used solely for running numerical simulations
>     > (number crunching) codes written in fortran90 and MPI, which
>     involves
>     > finite difference calculations and fast fourier transforms for
>     solving
>     > 3D Navier-Stokes equations. It calls mpi_alltoall a lot (for
>     FFTs) as
>     > well as other mpi_send/recv, so communication is intensive. THe
>     > problem is unsteady and 3D, so computation is also heavy. A
>     typically
>     > run can take 1-2 weeks using 8-16 nodes (depending on the
>     problem size).
>     >
>
>
>     Doing frequency domain modeling of CFD results? Getting frequency
>     domain
>     models of time domain simulations?
>
>     > We have been OK with a "hybrid" 25-node (C! OMPAQ Alpha & Dell Xeon
>     > 2.4GHz) cluster running right now using a 3Com 100 Mbps
>     > (ethernet) switch and a LAM/MPI library.
>     > I will post some benchmarks later.
>     >
>     > 2. Many people recommended Opteron (or at least encourage a test
>     run
>     > on Opteron) because it seems to be more cost effective. I picked
>     Xeon
>     > because of the following reasons:
>     >
>     > (1) free Intel FORTRAN 90 compiler, which is also used for other
>     > individual workstations in our lab and some supercomputers that we
>     > have access to (we are kind of trying to stay away from the
>     hassle of
>     > switching between compilers when writing new codes)
>
>
>     You can use the Intel compiler on the Opteron and it will run
>     great! In fact
>     the Spec benchmarks for the Opteron were run with the Intel
>     compiler :)
>     The next version of the compiler will work for EM64T chips and should
>     build 64-bit code on Opterons just fine.
>
>     However, if you! need to buy a compiler (I recommend doing so just
>     to get
>     64-bit code), they're not that expensive - especially with educational
>     pricing.
>
>
>     > (2) We have a few users to share the cluster, so we have to get
>     > "enough" nodes
>
>
>     Yes, but if they are faster the jobs go through faster and you get
>     the same
>     or more work.
>
>
>     > (3) Xeon seems to be more common, so it's easier to get
>     consultanting
>     > or support
>
>
>     I disagree.
>
>
>     > BTW, what are the common fortran 90 compilers that people use on
>     > Opteron? Any comparison to other compilers?
>     >
>     >
>     > 3. My MPI code periodically writes out data files to local disk,
>     so I
>     > do need hard disk on every node. Diskless sounds good (cost,
>     > maintenance,etc), but the data size seems too big to be
>     transferred to
>     > the head node (well technically it can be done, but I would rather
>     > just use local scratch disk).
>
>
>     Don't disc! ount diskless. It's a wonderful idea. You can use PVFS
>     or even
>     Lustre
>     to write out results. Just fill a certain percentage of the nodes
>     with
>     disks for
>     PVFS or Lustre and use those nodes for double duty (computation and
>     storage).
>     Opterons could handle this easily.
>
>     However, to take full advantage of PVFS you might have to rewrite your
>     code a bit but it's not that difficult.
>
>     If you don't go diskless at least think about a Ramdisk based OS. The
>     local drives
>     are all storage. There are many good reasons for doing this.
>
>     > 4. Managed or unmanaged?
>     >
>     > People already recommended some switches that I will not repeat
>     here.
>     > However, I am still not clear about "managed" and "unmanaged"
>     > switches. Some vendors told me that I need an managed one, while
>     other
>     > said the opposite. Will need to study more...
>
>
>     Unmanaged means you can't/won't/shouldn't log into the switch to
>     configuring
>     it becau! se you can't configure it. It just works (well the
>     unmanaged
>     switches
>     from real vendors just work). A managed switch you need to
>     configure the
>     switch by logging into. You, or ideally the vendor, needs to
>     configure
>     it for
>     HPC (some vendors never bother to do this - sigh.... ).
>
>     You might also want to consider an alternative network. CFD codes can
>     respond
>     very well to Hypercubes or Torus networks. I know a vendor who can do
>     this for
>     you very easily.
>
>
>     > 5. I only have wall-clocking timing of my code on various
>     platforms. I
>     > don't know how sensitive it is to cache size. I guess the bigger
>     > cache, the better, because the code is operating large arrays
>     all the
>     > time.
>     >
>     > I will post more summary here if I find out more information about
>     > these issues. Thanks.
>
>
>     Keep posting questions and ideas. Lots of good people here who can
>     help.
>
>     Jeff
>
>     >
>     > SCH
>     >
>     >!
>     > */SC Huang /* wrote:
>     >
>     > Hi,
>     >
>     > I am about to order a new cluster using a $100K grant for running
>     > our in-house MPI codes. I am trying to have at least 36-40 (or
>     > more, if possible) nodes. The individual node configuration is:
>     >
>     > dual Xeon 2.8 GHz
>     > 512K L2 cache, 1MB L3 cache, 533 FSB
>     > 2GB DDR RAM
>     > gigabit NIC
>     > 80 GB IDE hard disk
>     >
>     > The network will be based on a gigabit switch. Most vendors I
>     > talked to use HP Procurve 2148 or 4148.
>     >
>     > Can anyone comment on the configuration (and the switch) above?
>     > Any other comments (e.g. recommeded vendor, etc) are also welcome.
>     >
>     > Thanks!!!
>     >
>     >
>     >
>     >
>     >
>     ------------------------------------------------------------------------
>     > Do you Yahoo!?
>     > Yahoo! Mail Address AutoComplete
>     >
>     > - You start. We finish.
>     >
>     >
>     ------------------------------------------------------------------------
>     > Do you Yahoo!?
>     > Yahoo! Mail Address AutoComplete
>     >
>     > - You start. We finish.
>     >
>     >------------------------------------------------------------------------
>     >
>     >_______________________________________________
>     >Beowulf mailing list, Beowulf at beowulf.org
>     >To change your subscription (digest mode or unsubscribe) visit
>     http://www.beowulf.org/mailman/listinfo/beowulf
>     >
>     >
>
> ------------------------------------------------------------------------
> Do you Yahoo!?
> Win 1 of 4,000 free domain names from Yahoo! Enter now 
> <http://us.rd.yahoo.com/evt=26640/*http://promotions.yahoo.com/goldrush>.
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>  
>


-- 
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 612 4615




More information about the Beowulf mailing list