[Beowulf] Building new cluster - estimate

Mark Kosmowski mark.kosmowski at gmail.com
Thu Aug 7 04:36:40 PDT 2008


>
> Message: 7
> Date: Wed, 06 Aug 2008 22:01:17 -0400
> From: Joe Landman <landman at scalableinformatics.com>
> Subject: Re: [Beowulf] Building new cluster - estimate
> To: kyron at neuralbs.com
> Cc: Bogdan Costescu <Bogdan.Costescu at iwr.uni-heidelberg.de>,    Beowulf
>        List <Beowulf at beowulf.org>, Chris Samuel <csamuel at vpac.org>
> Message-ID: <489A576D.4000005 at scalableinformatics.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Eric Thibodeau wrote:
>
> >> Advantage of modules is you can upgrade them without upgrading the
> >> kernel.  Go ahead, build in that e1000 driver.  I dare yah... :(
> > Ok...I didn't put enought emphasis on "main" stuff....as in, _all you
> > need to get the system booted, which essentially means HDD chipset
> > drivers, the rest I do build as a module (NIC, video and such).
> >>
> >> More to the point it does give some good flexibility for end users
> >> with a need to keep the core "separate" from the drivers for maintenance.
> >>
> >> Initrd is subtle and quick to anger.  One must use burnt offerings to
> >> placate the spirits of initrd.
> > LOL!
>
>  ... now I don't mean hardware burnt offerings ... smoke rising from
> your motherboard may not placate the spirits of initrd, they definitely
> may impede further operations ...

You beat me to it - I was going to ask whether initrd preferred power
supplies or motherboards. ;)

>
> >>
> >> Well, it would be a heck of a lot nicer if the tools were a little
> >> more forgiving ... Oh you don't have this driver in your initrd ... ok
> >> ... PANIC (mwahahahaha)
> > Pahahahahah... Point in case, I am building a CD-only cluster system
> > (based on Gentoo) and I am currently _NOT_ using initrd because all that
> > really needs to be built in is NFSroot support an all NICs I care to put
> > in. Obviously this is a deprecated approach but it's proven to be the
> > most effective and easy to maintain in my case.
>
> We build an integrated NFSroot and e1000 and a few other things for a
> customer.  Fixed hardware for their cluster.  From bare-metal-off to
> operational infiniband compute node in ~45-60 seconds (I say 45, but a
> few things took a little longer to start, like SGE).
>
> >>>
> >>> <rant>
> >>> ...and such. I'd tell you to use the Gentoo Clustering LiveCD but
> >>> that's work in progress...you could still build the cluster using
> >>> Gentoo...if you're performance savvy...and want things like OpenMP
> >>> capable compiler
> >>
> >> I have been hearing claims like this for a long time.  I have not seen
> >> any real tests that back these claims up.  Do you have any?
> > I'm actually working on such benchmarks. Did you know that compiling
> > with the default ICC optimization will cause your bridge to crumble due
> > to floating point assumptions?...
> >
> > Ok, so my computation have diverged horribly mostly because I am
> > computing 47(vector size)*5000(K-Means clusters)*6,787,955(learning
> > dataset)*5(iterations to convergence) for a total of 7,975,847,125,000
> > FLOPS (or about 8Tera FLOPS) as part of an iterative learning process,
> > the error adds up. So performance is very sensitive to what your
> > intended goal is too ;)
>
> Hmmm.... sounds like a fun computation.  Error definitely adds up.
> Renormalization is your friend (well, some times, assuming a linear system).
>
> >>   Most of the arguments I have heard are "oh but its compiled with
> >> -O3" or whatever. Any decent HPC code person will tell you that that
> >> is most definitely not a guaranteed way to a faster system ...
> > Hey...as I stated above, one would have to be quite silly to claim -O3
> > as the all well and all good optimization solution. At least you can
> > rest assured your solutions will add up correctly with GCC. To get a
>
> Well, sometimes.  You still need to be careful with it.
>
> This said, I am not sure icc/pgi/... are uniformly better than gcc.  I
> did an admittedly tiny study of this http://scalability.org/?p=470 some
> time ago.  What I found was the gcc really held its own.  It did a very
> good job on a very simple test case.
>
> Then again, the fortran version was simply faster than the C version,
> but that can be explained ... by ... er ... ah ... something.

I have heard that in many codes the choice of math library (fftw and
atlas for example) makes a far, far greater difference in compiled
application speed than choice of compiler.  Can anyone comment on
this?

>
> > "faster" system, you really have to look at your app, use strace, ltrace
> > and gprof, then you can play with that. What I _am_ saying though is
> > that Gentoo _does_ empower the administrator by giving him the ability
> > to customize the OS if a bottleneck is to be identified.
>
> Yup.  There is nothing like a profile of an app running the code, to see
> where it is spending its time to decide between code shifts and
> algorithmic shifts.
>
> >>
> >>> (gcc-4.3.1, or ICC ;) ) _integrated_ into your system (not a hackish
> >>
> >> Er... We often use several different compilers in several different
> >> trees.  Several gccs, pgi, icc, eieio ... you name it.  All are
> >> integrated.
> > Are-you currently able to run GCC-4.3.x versions on your current setup,
>
> Currently running 4.2.3-2ubuntu7 on my laptop.  Other machines
> (development box) has something like 4 different gccs there.  I haven't
> tried 4.3.x yet ... had planned to, but work gets in the way.

Speaking of gcc 4.3 and math libraries, has anyone had issues with
acml 4.1.0?  I had some undefined reference errors when I tried.  I'm
not a programmer and got my code to compile using a different set of
math libraries so haven't wrestled with this much.  The undefined
references are given below and more details can be found in my post at
the OpenSUSE forum:
http://forums.opensuse.org/programming-scripting/390838-opensuse-11-0-acml-4-1-0-a.html

dgemv.F:(.text+0x3fd): undefined reference to `_gfortran_allocate64'
dgemv.F:(.text+0x451): undefined reference to `_gfortran_internal_free'
dgemv.F:(.text+0x516): undefined reference to `_gfortran_deallocate'

>
> > I'm actually eager to know. I'm still living under the ASSumption od
> > binary distributions not coping too well with multi-library
> > environments. Point in case, one of my colleagues _really_ wanted
>
> No, our systems (Ubuntu, SuSE, Centos) seem to have no real problems
> apart from the occasional broken hard wired /usr/lib with the wrong ABI
> in a configure/make file.  Usually easy to fix.
>
> > firefox 3 on his ubuntu system. The installer trickled down to having to
> > uninstall glibc...and he forced it to YES (and this is just a browser,
> > not something that is used to _make_ code and would be tied to glibc)
>
> Hmmm... I have firefox 3 on this system (64 bit) and I run icecat for 32
> bit access (java and other things).  No glibc changes (apart from
> security patches).  He must have done something horribly wrong.  We have
> multiple mixed ABI ubuntu/centos/suse systems, and haven't had issues.

I am starting to really appreciate the OpenSUSE community
repositories.  Lots of stuff is prebuilt and known to work.  Yast
makes it easy to see what the dependancies are as well.  I'm currently
running OpenSUSE 11.0 (but with KDE 3.5.9 - the 4.x is a bit too
bleeding edge for me).

>
> >>
> >>> afterthought of an RPM that pulls in a new glibc that breaks the install
> >>
> >> Er ... not the slightest clue as to what you are talking about.  I
> >> haven't seen gcc, icc, pgi, ... touch our glibc.
> >>
> >> Maybe I am missing the fun.  Which ICC version is this?  Which gcc is
> >> this, which glibc is this?
> >>
> > Sorry about that I might have been misleading, GCC is generally the one
> > most sensitive to glibc, not the other ones although the latest ICC
> > (10.1.x series) do claim compatibility with the GNU environment so it
> > might get a little more dependency there.
>
> We have installed the 10.1.015 on customer machines from Centos 5.2
> through SuSE 10.x through Ubuntu with nary a problem.  Very different
> glibc's.  No issues with code generation.
>
> Binary distributions aren't evil.  They do work, quite well in most cases.
>
>
> >
> > Cheers!
> >
> > Eric
>
>
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web  : http://www.scalableinformatics.com
>        http://jackrabbit.scalableinformatics.com
> phone: +1 734 786 8423
> fax  : +1 866 888 3112
> cell : +1 734 612 4615
>
Mark Kosmowski



More information about the Beowulf mailing list