[Beowulf] Re: building a new cluster

Wed Sep 1 22:38:15 PDT 2004

--- Joe Landman <landman at scalableinformatics.com> wrote:

> 
> Nice.  FFT's have a communication pattern that emphasises low latency
> 
> interconnections (or huge caches if you can get them to fit in
> there).  
> Since you are doing 3D (and I presume 3D FFT and inverse FFT) the 
> likelihood of them fitting in cache is small, unless you are looking
> at 
> small problems... 

I need to do 1D FFT in some cases and 2D FFT in others. The data size
(per MPI task) is about several megabytes (double precision
calculation), plus some working variables, so I guess they cannot fit
in cache completely.

> 
> You will want to look at the Intel MKL and the AMD ACML.  They will
> help 
> out with good implementations of some of these routines.
>

I try to avoid that because I have to port the code to various
platforms from time to time. Compatability is very important to me.
Currently I am using a package from netlib.org. I plan to migrate to
FFTW (but haven't got time to do so), as Tim Mattox have suggested.

> > 2. I didn't know that Intel FORTRAN compiler can run on Opteron; it
> 
> > was not mentioned in the intel website. Thank you for pointing this
> out!!
> 
> 
> 
> I would be ... surprised ... if the Intel website indicated code 
> generation for the Opteron :).  You should know that the Portland
> Group 
> compiler does emit code for Xeon's as well as Opterons (the -tp ... 
> switch).  I believe that the PathScale EKO emits code for EM64T as 
> well.  Both are available from the respective suppliers for demo. 

Yes, I am testing Portland Group f90 compiler now. I will post my
findings later. :-)

> 
> > 3. Could you let me know the vendor who can do the Hypercubes or
> Torus 
> > networks? I will be very interested in that. Is it expensive?
> >  
> > 4. Thanks for the suggestions on the diskless or other file
> systems. I 
> > will discuss that with my group members.
> 
> 
> This is an interesting way to go if your IO can be non-local, or if
> you 
> just need local scratch space.  It makes building clusters quite 
> fast/easy. 

My IO is totally local. Sometimes I need to run the same code on
clusters operated by others. I want to save the trouble to change the
code back and forth.

> > Overall, many people are suggesting an Opteron system. I just
> learned 
> > that one of the vendor is providing us an Opteran (with PG fortran 
> > compiler) cluster for test runs. I better rush to do that first. I 
> > will post new results after the tests.
> 
> 
> Make sure that the PG version is 5.2.2 (their current download). 
> Make 
> sure they install the ACML.  Also grab the Pathscale and the Intel.  
> Make sure when you build your code, you build for 64 bit architecture
> 
> (requires that the loaner machine is running in 64 bit mode).  The 
> additional SSE2 and GP registers available to the optimizer, the 
> simplified (non-segmented) address model generally make your codes
> run 
> faster.  Yes, you could use it as a fast Xeon in 32 bit mode.  If you
> 
> do, in the cases I have been playing with, you are giving up lots of 
> performance because your codes do not have access to all the nice 
> features on the chip.

Thank you for the hints. I wll pay attention to these things.

> 
> >  
> > Also, I heard of the name "channel bonding" here and there. Is that
> 
> > some kind of network connection method for cluster (to use standard
> 
> > switches to achieve faster data transfer rate)? Can someone briefly
> 
> > talk about it, or point me to some website that I can read about
> it? I 
> > did some google search about it but the materials are too technical
> to 
> > me. :( Is it useful for a cluster of about 30-40 nodes?
> 
> 
> There are plusses and minuses to channel bonding.  Search back
> through 
> the archives for details.  I am not sure if the issues with latency
> and 
> out of order packet send/receive have been addressed in 2.6.

Thank you! I will look into these.

> 
> 
> Joe
> 
> >  
> > Thanks!
> >  
> >  
> >
> >
> > */"Jeffrey B. Layton" <laytonjb at charter.net>/* wrote:
> >
> >     SC Huang wrote:
> >
> >     > Thanks to your suggestions. I am summarizing the questions
> and
> >     answers
> >     > here.
> >     >
> >     > 1. cluster usage
> >     >
> >     > The cluster will be used solely for running numerical
> simulations
> >     > (number crunching) codes written in fortran90 and MPI, which
> >     involves
> >     > finite difference calculations and fast fourier transforms
> for
> >     solving
> >     > 3D Navier-Stokes equations. It calls mpi_alltoall a lot (for
> >     FFTs) as
> >     > well as other mpi_send/recv, so communication is intensive.
> THe
> >     > problem is unsteady and 3D, so computation is also heavy. A
> >     typically
> >     > run can take 1-2 weeks using 8-16 nodes (depending on the
> >     problem size).
> >     >
> >
> >
> >     Doing frequency domain modeling of CFD results? Getting
> frequency
> >     domain
> >     models of time domain simulations?
> >
> >     > We have been OK with a "hybrid" 25-node (C! OMPAQ Alpha &
> Dell Xeon
> >     > 2.4GHz) cluster running right now using a 3Com 100 Mbps
> >     > (ethernet) switch and a LAM/MPI library.
> >     > I will post some benchmarks later.
> >     >
> >     > 2. Many people recommended Opteron (or at least encourage a
> test
> >     run
> >     > on Opteron) because it seems to be more cost effective. I
> picked
> >     Xeon
> >     > because of the following reasons:
> >     >
> >     > (1) free Intel FORTRAN 90 compiler, which is also used for
> other
> >     > individual workstations in our lab and some supercomputers
> that we
> >     > have access to (we are kind of trying to stay away from the
> >     hassle of
> >     > switching between compilers when writing new codes)
> >
> >
> >     You can use the Intel compiler on the Opteron and it will run
> >     great! In fact
> >     the Spec benchmarks for the Opteron were run with the Intel
> >     compiler :)
> >     The next version of the compiler will work for EM64T chips and
> should
> >     build 64-bit code on Opterons just fine.
> >
> >     However, if you! need to buy a compiler (I recommend doing so
> just
> >     to get
> >     64-bit code), they're not that expensive - especially with
> educational
> >     pricing.
> >
> >
> >     > (2) We have a few users to share the cluster, so we have to
> get
> >     > "enough" nodes
> >
> >
> >     Yes, but if they are faster the jobs go through faster and you
> get
> >     the same
> >     or more work.
> >
> >
> >     > (3) Xeon seems to be more common, so it's easier to get
> >     consultanting
> >     > or support
> >
> >
> >     I disagree.
> >
> >
> >     > BTW, what are the common fortran 90 compilers that people use
> on
> >     > Opteron? Any comparison to other compilers?
> >     >
> >     >
> >     > 3. My MPI code periodically writes out data files to local
> disk,
> >     so I
> >     > do need hard disk on every node. Diskless sounds good (cost,
> >     > maintenance,etc), but the data size seems too big to be
> >     transferred to
> >     > the head node (well technically it can be done, but I would
> rather
> >     > just use local scratch disk).
> >
> >
> >     Don't disc! ount diskless. It's a wonderful idea. You can use
> PVFS
> >     or even
> >     Lustre
> >     to write out results. Just fill a certain percentage of the
> nodes
> >     with
> >     disks for
> >     PVFS or Lustre and use those nodes for double duty (computation
> and
> >     storage).
> >     Opterons could handle this easily.
> >
> >     However, to take full advantage of PVFS you might have to
> rewrite your
> >     code a bit but it's not that difficult.
> >
> >     If you don't go diskless at least think about a Ramdisk based
> OS. The
> >     local drives
> >     are all storage. There are many good reasons for doing this.
> >
> >     > 4. Managed or unmanaged?
> >     >
> >     > People already recommended some switches that I will not
> repeat
> >     here.
> >     > However, I am still not clear about "managed" and "unmanaged"
> >     > switches. Some vendors told me that I need an managed one,
> while
> >     other
> >     > said the opposite. Will need to study more...
> >
> >
> >     Unmanaged means you can't/won't/shouldn't log into the switch
> to
> >     configuring
> >     it becau! se you can't configure it. It just works (well the
> >     unmanaged
> >     switches
> >     from real vendors just work). A managed switch you need to
> >     configure the
> >     switch by logging into. You, or ideally the vendor, needs to
> >     configure
> >     it for
> >     HPC (some vendors never bother to do this - sigh.... ).
> >
> >     You might also want to consider an alternative network. CFD
> codes can
> >     respond
> >     very well to Hypercubes or Torus networks. I know a vendor who
> can do
> >     this for
> >     you very easily.
> >
> >
> >     > 5. I only have wall-clocking timing of my code on various
> >     platforms. I
> >     > don't know how sensitive it is to cache size. I guess the
> bigger
> >     > cache, the better, because the code is operating large arrays
> >     all the
> >     > time.
> >     >
> >     > I will post more summary here if I find out more information
> about
> >     > these issues. Thanks.
> >
> >
> >     Keep posting questions and ideas. Lots of good people here who
> can
> >     help.
> >
> >     Jeff
> >
> >     >
> >     > SCH
> >     >
> >     >!
> >     > */SC Huang /* wrote:
> >     >
> >     > Hi,
> >     >
> >     > I am about to order a new cluster using a $100K grant for
> running
> >     > our in-house MPI codes. I am trying to have at least 36-40
> (or
> >     > more, if possible) nodes. The individual node configuration
> is:
> >     >
> >     > dual Xeon 2.8 GHz
> >     > 512K L2 cache, 1MB L3 cache, 533 FSB
> >     > 2GB DDR RAM
> >     > gigabit NIC
> >     > 80 GB IDE hard disk
> >     >
> >     > The network will be based on a gigabit switch. Most vendors I
> >     > talked to use HP Procurve 2148 or 4148.
> >     >
> >     > Can anyone comment on the configuration (and the switch)
> above?
> >     > Any other comments (e.g. recommeded vendor, etc) are also
> welcome.
> >     >
> >     > Thanks!!!
> >     >
> >     >
> >     >
> >     >
> >     >
> >    
>
------------------------------------------------------------------------
> >     > Do you Yahoo!?
> >     > Yahoo! Mail Address AutoComplete
> >     >
> >     > - You start. We finish.
> >     >
> >     >
> >    
>
------------------------------------------------------------------------
> >     > Do you Yahoo!?
> >     > Yahoo! Mail Address AutoComplete
> >     >
> >     > - You start. We finish.
> >     >
> >    
>
>------------------------------------------------------------------------
> >     >
> >     >_______________________________________________
> >     >Beowulf mailing list, Beowulf at beowulf.org
> >     >To change your subscription (digest mode or unsubscribe) visit
> >     http://www.beowulf.org/mailman/listinfo/beowulf
> >     >
> >     >
> >
> >
>
------------------------------------------------------------------------
> > Do you Yahoo!?
> > Win 1 of 4,000 free domain names from Yahoo! Enter now 
> >
>
<http://us.rd.yahoo.com/evt=26640/*http://promotions.yahoo.com/goldrush>.
> >
>
>------------------------------------------------------------------------
> >
> >_______________________________________________
> >Beowulf mailing list, Beowulf at beowulf.org
> >To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> >  
> >
> 
> 
> -- 
> Joseph Landman, Ph.D
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web  : http://www.scalableinformatics.com
> phone: +1 734 612 4615
> 
> 

_______________________________
Do you Yahoo!?
Win 1 of 4,000 free domain names from Yahoo! Enter now.
http://promotions.yahoo.com/goldrush