[Beowulf] Re: vectors vs. loops

Art Edwards edwardsa at afrl.kirtland.af.mil
Thu Apr 28 09:07:30 PDT 2005

I'll just tell you that for a well written, local basis, DFT, that's not
the case (that the limit is the two-electron integrals and the Fockian).
It surely is not for SEQQUEST, and I'd bet it's not for SIESTA, as it is 
based on similar methods. 

Art Edwards

On Thu, Apr 28, 2005 at 06:46:32PM +0400, Mikhail Kuzminsky wrote:
> In message from Joe Landman <landman at scalableinformatics.com> (Wed, 27 
> Apr 2005 15:51:49 -0400):
> >Hi Art:
> >
> >  Any particular codes you have in mind?  I used to play around with 
> >lots of DFT (LDA) codes.  Back then, large systems were 256 x 256, 
> >with periodic BC's.
> Most (practically all) DFT codes are not limited by eigenvalues
> problem. The limiting stage is computation of 2-electron integrals and
> fockian.
> Yours
> Mikhail Kuzminsky
> Zelinsky Institute of Organic Chemistry
> Moscow  
> >We used a number of eigensolvers, and eventually 
> >settled on LAPACK's zheev.  Modeling supercells of much larger than 
> >64 atoms with 4 electronic basis states was a challenge using that 
> >code.
> >
> >  Do you have a particular model system in mind as well?  A nice 
> >model (or similar) might work out nicely.  I would like to include 
> >some electronic structure codes in our (evolving) BBS system.
> >
> >Joe
> >
> >Art Edwards wrote:
> >>This subject is pretty important to us. We run codes where the
> >>bottleneck is eigensolving for matrices with a few thousand 
> >>elements.
> >>Parallel eigen solvers are not impressive at this scale. In the dark
> >>past, I did a benchmark on a Cray Y-MP using a vector eigen solve 
> >>and
> >>got over 100x speedup. What I don't know is how this would compare 
> >>to
> >>current compilers and CPU's. However the vector pipes are not very 
> >>deep
> >>on any of the current processors except, possibly the PPC. So, I 
> >>would
> >>like to see benchmarks of electronic structure codes that are bound 
> >>by
> >>eigensolvers on a "true vector" machine. 
> >>
> >>Art Edwards
> >>
> >>On Wed, Apr 27, 2005 at 01:15:42PM -0400, Robert G. Brown wrote:
> >>
> >>>On Wed, 27 Apr 2005, Ben Mayer wrote:
> >>>
> >>>
> >>>>>However, most code doesn't vectorize too well (even, as you say, with
> >>>>>directives), so people would end up getting 25 MFLOPs out of 300 
> >>>>>MFLOPs
> >>>>>possible -- faster than a desktop, sure, but using a multimillion 
> >>>>>dollar
> >>>>>machine to get a factor of MAYBE 10 in speedup compared to (at the 
> >>>>>time)
> >>>>>$5-10K machines.
> >>>>
> >>>>What the people who run these centers have told me that a
> >>>>supercomputer is worth the cost if you can get a speed up of 30x over
> >>>>serial. What do others think of this?
> >>>
> >>>I personally think that there is no global answer to this question.
> >>>There is only cost-benefit analysis.  It is trivially simple to 
> >>>reduce
> >>>this assertion (by the people who run the centers, who are not 
> >>>exactly
> >>>unbiased here:-) to absurdity for many, many cases.  In either 
> >>>direction
> >>>-- for some it might be worth it for a factor of 2 in speedup, for
> >>>others it might NEVER be worth it at ANY speedup.
> >>>
> >>>For example, nearly all common and commercial software isn't worth it 
> >>>at
> >>>any cost.  If your word processor ran 30x faster, could you tell? 
> >>>Would
> >>>you care?  Would it be "worth" the considerable expense of rewriting 
> >>>it
> >>>for a supercomputer architecture to get a speedup that you could 
> >>>never
> >>>notice (presuming that one could actually speed it up)?  
> >>>Sure it's an obvious exception, but the problem with global answers 
> >>>is
> >>>they brook no exceptions even when there are obvious ones.  If you 
> >>>don't
> >>>like word processor, pick a suitable rendered computer game (zero
> >>>productive value, but all sorts of speedup opportunities).  Pick any
> >>>software with no particular VALUE in the return or with a low
> >>>OPPORTUNITY COST of the runtime required to run it.
> >>>
> >>>A large number of HP computations are in the latter category.  If I 
> >>>want
> >>>to run a simple simulation that takes eight hours on a serial machine
> >>>and that I plan to run a single time, is it worth it for me to spend 
> >>>a
> >>>month recoding it to run in parallel in five minutes?  Obviously not.
> >>>If you argue that I should include the porting time in the 
> >>>computation
> >>>of "speedup" then I'd argue that if I have a program that takes two
> >>>years to run without porting and that takes a six months to port into 
> >>>a
> >>>form that runs on a supercomputer in six months more, well, a year of 
> >>>MY
> >>>life is worth it, depending on the actual COST of the "supercomputer"
> >>>time compared to the serial computer time.  Even in raw dollars, my
> >>>salary for the extra year is nontrivial compared to the cost of
> >>>purchasing and installing a brand-new cluster just to speed up the
> >>>computation by a measley factor of two or four, depending on how you
> >>>count.
> >>>
> >>>So pay no attention to your supercomputer people's pronouncement. 
> >>>That
> >>>number (or any other) is pulled out of, uh, their nether regions and 
> >>>is
> >>>unjustifiable.  Instead, do the cost-benefit analysis, problem by
> >>>problem, using the best possible estimates you can come up with for 
> >>>the
> >>>actual costs and benefits.
> >>>
> >>>That very few people EVER actually DO this does not mean that it 
> >>>isn't
> >>>the way it should be done;-)
> >>>
> >>>
> >>>>:) I needed to do some CHARMM runs this summer. The X1 did not like 
> >>>>it
> >>>>much (neither did I, but when the code is making references to punch
> >>>>cards and you are trying to run it on a vector super, I think most
> >>>>would feel that way), I ended up running it in parallel by a similar
> >>>>method as yours. Worked great!
> >>>
> >>>The easy way into cluster (or nowadays, "grid") computing, for sure. 
> >>>If
> >>>your task is or can be run embarrassingly parallel, well, parallel
> >>>scaling doesn't generally get much better than a straight line of 
> >>>slope
> >>>one barring the VERY few problems that exhibit superlinear scaling 
> >>>for
> >>>some regime....;-)
> >>>
> >>>
> >>>>>If it IS a vector (or nontrivial parallel, or both) task, then the
> >>>>>problem almost by definition will EITHER require extensive "computer
> >>>>>science" level study -- work done with Ian Foster's book, Amalsi and
> >>>>>Gottlieb for parallel and I don't know what for vector as it isn't my
> >>>>>area of need or expertise and Amazon isn't terribly helpful (most 
> >>>>>books
> >>>>>on vector processing deal with obsolete systems or are out of print, 
> >>>>>it
> >>>>>seems).
> >>>>
> >>>>So what we should really be trying to do is matching code to the
> >>>>machine. One of the problems that I have run into is that unless one
> >>>>is at a large center there are only one or two machines that provide
> >>>>computing power. Where I am from we have a X1 and T3E. Not a very 
> >>>>good
> >>>>choice between the two. There should be a cluster coming up soon,
> >>>>which will give us the options that we need. ie Vector or Cluster.
> >>>
> >>>No, what you SHOULD be doing is matching YOUR code to the cluster you
> >>>design and build just for that code.  With any luck, the cluster 
> >>>design
> >>>will be a generic and inexpensive one that can be reused (possibly 
> >>>with
> >>>minor reconfigurations) for a wide range of parallel problems.  If 
> >>>your
> >>>problem DOES trivially parallelize, nearly any grid/cluster of OTS
> >>>computers capable of holding it in memory on (even) sneakernet will 
> >>>give
> >>>you linear speedup.  
> >>>Given Cluster World's Really Cheap Cluster as an example, you could
> >>>conceivably end up with a cluster design containing nodes that cost
> >>>between $250 and $1000 each, including switches and network and 
> >>>shelving
> >>>and everything, that can yield linear speedup on your code.  Then you 
> >>>do
> >>>your cost-benefit analysis, trade off your time, the value of the
> >>>computation, the value of owning your own hardware and being able to 
> >>>run
> >>>on it 24x7 without competition, the value of being able to redirect 
> >>>your
> >>>hardware into other tasks when your main task is idle, any additional
> >>>costs (power and AC, maybe some systems administration, maintenance).
> >>>This will usually tell you fairly accurately both whether you should
> >>>build your own local cluster vs run on a single desktop workstation 
> >>>vs
> >>>run on a supercomputer at some center and will even tell you how many
> >>>nodes you can/should buy and in what configuration to get the 
> >>>greatest
> >>>net benefit.
> >>>
> >>>Note that this process is still correct for people who have code that
> >>>WON'T run efficiently on really cheap node or network hardware; they
> >>>just have to work harder.  Either way, the most important work is
> >>>prototyping and benchmarking.  Know your hardware (possibilities) and
> >>>know your application.  Match up the two, paying attention to how 
> >>>much
> >>>everything costs and using real world numbers everywhere you can. 
> >>>AVOID
> >>>vendor provided numbers, and look upon published benchmark numbers 
> >>>for
> >>>specific micro or macro benchmarks with deep suspicion unless you 
> >>>really
> >>>understand the benchmark and trust the source.  For example, you can
> >>>trust anything >>I<< tell you, of course...;-)
> >>>
> >>>
> >>>>The manual for the X1 provides some information and examples. Are the
> >>>>Apple G{3,4,5} the only processors who have real vector units? I have
> >>>>not really looked at SSE(2), but remember that they were not full
> >>>>precision.
> >>>
> >>>What's a "real vector unit"?  On chip?  Off chip?  Add-on board?
> >>>Integrated with the memory and general purpose CPU (and hence
> >>>bottlenecked) how?
> >>>
> >>>Nearly all CPUs have some degree of vectorization and parallelization
> >>>available on chip these days; they just tend to hide a lot of it from
> >>>you.  Compilers work hard to get that benefit out for you in general
> >>>purpose code, where you don't need to worry about whether or not the
> >>>unit is "real", only about how long it takes the system to do a 
> >>>stream
> >>>triad on a vector 10 MB long.  Code portability is a "benefit" vs 
> >>>code
> >>>specialization is a "cost" when you work out the cost-benefit of 
> >>>making
> >>>things run on a "real vector unit".  I'd worry more about the times
> >>>returned by e.g. stream with nothing fancy done to tune it than how
> >>>"real" the underlying vector architecture is.
> >>>
> >>>Also, if your problem DOES trivially parallelize, remember that you 
> >>>have
> >>>to compare the costs and benefits of complete solutions, in place. 
> >>>You
> >>>really have to benchmark the computation, fully optimized for the
> >>>architecture, on each possible architecture (including systems with
> >>>"just" SSE but perhaps with 64 bit memory architectures and ATLAS for
> >>>linear algebra that end up still being competitive) and then compare 
> >>>the
> >>>COST of those systems to see which one ends up being cheaper. 
> >>>Remember
> >>>that bleeding edge systems often charge you a factor of two or more 
> >>>in
> >>>cost for a stinkin' 20% more performance, so that you're better off
> >>>buying two cheap systems rather than one really expensive one IF your
> >>>problem will scale linearly with number of nodes.
> >>>
> >>>I personally really like the opteron, and would commend it to people
> >>>looking for a very good general purpose floating point engine.  I 
> >>>would
> >>>mistrust vendor benchmarks that claim extreme speedups on vector
> >>>operations for any code big running out of memory unless the MEMORY 
> >>>is
> >>>somehow really special.  A Ferrari runs as fast as a Geo on a crowded
> >>>city street.
> >>>
> >>>As always, your best benchmark is your own application, in all its 
> >>>dirty
> >>>and possibly inefficiently coded state.  The vendor specs may show 30
> >>>GFLOPS (for just the right code running out of L1 cache or out of
> >>>on-chip registers) but when you hook that chip up to main memory with 
> >>>a
> >>>40 ns latency and some fixed bandwidth, it may slow right down to
> >>>bandwidth limited rates indistinguishable from those of a much slower
> >>>chip.
> >>>
> >>>
> >>>>>For me, I just revel in the Computer Age.  A decade ago, people were
> >>>>>predicting all sorts of problems breaking the GHz barrier.  Today 
> >>>>>CPUs
> >>>>>are routinely clocked at 3+ GHz, reaching for 4 and beyond.  A decade
> >>>>
> >>>>I just picked up a Semptron 3000+, 1.5GB RAM, 120GB HD, CD-ROM, 
> >>>>video,
> >>>>10/100 + intel 1000 Pro for $540 shipped. I was amazed.
> >>>
> >>>The Opterons tend to go for about twice that per CPU, but they are 
> >>>FAST,
> >>>especially for their actual clock.  The AMD-64's can be picked up for
> >>>about the same and they too are fast.  I haven't really done a 
> >>>complete
> >>>benchmark run on the one I own so far, but they look intermediate
> >>>between Opteron and everything else, at a much lower price.
> >>>
> >>>  rgb
> >>>
> >>>-- 
> >>>Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> >>>Duke University Dept. of Physics, Box 90305
> >>>Durham, N.C. 27708-0305
> >>>Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> >>>
> >>>
> >>>_______________________________________________
> >>>Beowulf mailing list, Beowulf at beowulf.org
> >>>To change your subscription (digest mode or unsubscribe) visit 
> >>>http://www.beowulf.org/mailman/listinfo/beowulf
> >>
> >>
> >
> >-- 
> >Joseph Landman, Ph.D
> >Founder and CEO
> >Scalable Informatics LLC,
> >email: landman at scalableinformatics.com
> >web  : http://www.scalableinformatics.com
> >phone: +1 734 786 8423
> >fax  : +1 734 786 8452
> >cell : +1 734 612 4615
> >
> >_______________________________________________
> >Beowulf mailing list, Beowulf at beowulf.org
> >To change your subscription (digest mode or unsubscribe) visit 
> >http://www.beowulf.org/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf

Art Edwards
Senior Research Physicist
Air Force Research Laboratory
Electronics Foundations Branch
KAFB, New Mexico

(505) 853-6042 (v)
(505) 846-2290 (f)

More information about the Beowulf mailing list