[Beowulf] Re: vectors vs. loops

Wed Apr 27 06:47:29 PDT 2005

On Tue, 26 Apr 2005, Ben Mayer wrote:

> The point of the study was teaching them parallel programming with MPI
> and UPC to see what the (dis)liked/(mis)understood to gain a better
> understanding of how we can help them learn. An observation was that
> they wrote their code in such a way that it was difficult (or
> impossible) for the compiler to vectorize the code (even with
> directives).

This isn't terribly surprising, at least to me.  Back in the old days,
when the North Carolina supercomputer center was still around and had a
Cray YMP as its centerpiece, a whole lot of the stuff run on it wasn't
particularly vectorizable.  People would apply for a "supercomputer
account" and expect their ported code to run at 100's of MFLOPS
(standard at the time on the desktop was maybe 1-4 MFLOPS depending on
how expensive your Unix desktop or departmental compute server was).

However, most code doesn't vectorize too well (even, as you say, with
directives), so people would end up getting 25 MFLOPs out of 300 MFLOPs
possible -- faster than a desktop, sure, but using a multimillion dollar
machine to get a factor of MAYBE 10 in speedup compared to (at the time)
$5-10K machines.  In the meantime, I'm sure that there were people who
had code that DID vectorize well pulling their hair because of all those
100 hour accounts that basically wasted 90% of the resource.

My own code doesn't vectorize too well because it isn't heavy on linear
algebra and the loops are over 3 dimensional lattices where
nearest-neighbor sums CANNOT be local in a single memory stream and
where a cluster methodology that makes the code's convergence SCALE
faster requires that sites be selected for cluster inclusion that are
effectively random memory accesses (the slowest possible,
cache-defeating form of memory access).  Rob Peter, pay Paul.

However, it PARALLELIZES just gangbusters fine, since accumulating
statistics on independent runs is embarrassingly parallel and produces
nearly perfect linear speedup.  That (in the early 90's) was when I
realized that a cluster of 10-20 $8000 Sun workstations (all but one or
two paid for by other people or the University) could give me twice the
performance of a Cray on my code, not on a measley 100 hour "research"
account but on a 24x7 basis where every week that went by I accumulated
more than twice as many hours.  And the workstations were all in use by
humans at the same time as desktops!  My jobs just ran quietly in the
background and didn't noticeably slow the sunview or SGI GUI.  Finally,
the code required no special structure or directives -- it was nice,
portable, serial code.

The moral of this particular story is to NOT try to force code onto a
vector environment unless it is, really, a vector task.  Indeed, don't
force code into a PARALLEL environment (e.g. into PVM or MPI) unless it
is a NON-TRIVIAL parallel task (I spent a lot of time rewriting my code
as master-slave stuff in PVM, only to finally realize that EP tasks are
more easily managed by just running the damn jobs independently via e.g.
a script and accumulating results with other scripts, because writing
ROBUST PVM (or MPI) code -- code that can survive a casual reboot or
interruption of any particular node -- is Not Easy.

If it IS a vector (or nontrivial parallel, or both) task, then the
problem almost by definition will EITHER require extensive "computer
science" level study -- work done with Ian Foster's book, Amalsi and
Gottlieb for parallel and I don't know what for vector as it isn't my
area of need or expertise and Amazon isn't terribly helpful (most books
on vector processing deal with obsolete systems or are out of print, it
seems).  

However, the observation that in many cases using an appropriate library
in the vector case is the way to go is dead on the money.  One important
example is linear algebra, which is something that generally DOES
vectorize well (almost by definition:-).  ATLAS clearly demonstrates
that linear algebra subroutines can vary in speed by a factor of 3 or
more just by switching algorithms and blocking/stride in a way that is
optimal for the memory/CPU sub-architectures in use.  Ordinary humans
(that is, intro students in any CPS class) aren't going to see this kind
of speedup from writing mundane loops that implement standard algorithms
in straightforward ways.  The problem is obviously complicated still
further by the existence of "vector" co-processors that can do certain
operations on certain memory blocksizes very quickly, which a whole
DISTINCT set of memory latencies and bandwidths to contend with,
complicated still further by SSE instructions and on-CPU specialty
processing units.  

Clearly there is no "solution" that fits all of the possible cases;
equally clearly ANY ATTEMPT TO OPTIMIZE for a particular architecture
instance comes AT THE EXPENSE OF code portability, code readability,
code instrumentation with numerous #ifdefs and pragmas.  Library calls
for "separable" components that might vectorize at least provide an API
layer that can leave the code itself portable and require that all
optimization energy go into the library (or not).

> >From the experiences of helping instruct two parallel classes, and my
> own experiences learning (and observing), it is a very rare case that
> a scientific code (or most any large code) will be be automatically
> vectorized all of the way through, unless one has a good understanding
> of computer architecture AND significant experience producing code
> that the compiler can optimize for a given arch.

Absolutely, dead on the money.  Especially since there is no "COTS"
vector (co)processor, so that these days most vectorization involves the
limited amount available within SSE on OTS cpus or one-of-a-kind vector
units that are neither standard nor code-portable.

It is an interesting question as to what would happen if e.g. Intel or
AMD went the co-processor route once again, and developed a dedicated
function, really really fast vector/float coprocessor for their 64 bit
lines of CPUs and standardized a machine language interface for the
architecture so that compilers could "automatically" use it.  I'd guess
that the problems such an approach would face would be profound, as
memory bandwidth is likely to be a rate limiting obstacle that might
make the result no faster than memory-bound linear algebra code is
already on OTS CPUs.  As Mark has somewhat wryly observed already.  It
is quite probable that they HAVE thought about doing this and rejected
it, both because of a relatively limited market for the result and
because it may NOT speed things up that much compared to what they can
do with a general purpose CPU with efficient floating point pipelines
and state-of-the-art memory architecture.

For me, I just revel in the Computer Age.  A decade ago, people were
predicting all sorts of problems breaking the GHz barrier.  Today CPUs
are routinely clocked at 3+ GHz, reaching for 4 and beyond.  A decade
ago, 10 MFLOPS was still a respectable speed and only available on
systems that cost order of ten thousand dollars.  Today a GFLOP is a
mundane speed and available on systems that cost order of one thousand
dollars.  A decade ago a "supercomputer" provided a few hundred MFLOPS,
costs millions of dollars for the hardware, hundreds of thousands of
dollars in annual maintenance, hundreds of thousands more each year for
housing, cooling, and a staff of professional acolytes that had to be
propitiated in order to make anything like full use of the resource
(with the exception of the relatively few of us who had discovered
cluster computing by design or accident).  Any sort of computer CAPABLE
of more than the most mundane performance was considered a >>munition<<
by Our Government and export-restricted lest Bad People design nuclear
weapons with them.  

Today, supercomputers routinely get 100's of GFLOPS on up, at a cost of
less than $1000/GFLOP.  Anybody can build them.  Amazingly, in spite of
the fact that my aged and obsolete laptop would have been a munition a
decade ago and that just about every country on the planet has at least
ONE cluster capable of eating any ten munitions-quality supercomputers a
decade ago, Bad People apparently are no closer or farther from nuclear
devices than they ever were (as their primary obstacle is obtaining
weapons-grade nuclear material, not coming up with or implementing an
"adequately" functional design as the latter is really pretty trivial in
terms of modern technology and a bomb doesn't have to be "perfect" to be
frighteningly usable).

Ahh, it's good to be alive.

    rgb

> 
> CVL is a library that has to be called. This means that one has to
> explicitly make a function call to the lib for your code to have
> vectorized segments. Function calls greatly diminish a compilers
> ability to optimize. Cray's compilers (and recently gcc) will
> automatically vectorize code for you, if it can. Cray's will even make
> it run in parallel, across processors, to some extent. The
> optimization can be greatly helped by knowing how to help the
> compiler. Learning how to guide the compiler takes a good deal of time
> to learn.  Unless Sony/IBM releases an awesome compiler and some great
> vectorized sub-systems (a graphics render, physics engine, etc) the
> Cell's (PS3) speed will be hindered greatly by code not vectorizing.
> This could be a great opportunity to make a large sum of money.
> 
> NESL does not appear to have momentum. Started in 1991 and the last
> paper is in 1997. They talk about running on Paragon and CM5 (I
> believe that UMN had serial #1), not about Cray X1, which is what we
> were using.
> 
> scandal's web site makes it look like it is a research language only.
> 
> MPI is standard, UPC has compilers from many vendors (including Cray).
> MPI is installed everywhere and runs everywhere, UPC can run on
> distributed memory and SSI systems, and is currently being explored
> for strengths and weaknesses.
> 
> Heck Matlab might be a good choice. Everything is a matrix AND it runs
> in parallel... now if only our large machine with matlab would be
> stable we could give some more assignments. ;)
> 
> Ben
> 
> On 4/26/05, Andrew Piskorski <atp at piskorski.com> wrote:
> > On Tue, Apr 26, 2005 at 10:01:25AM -0500, Ben Mayer wrote:
> > 
> > > I actually just did a small study of how well students in a parallel
> > > computing class write parallel codes on X1 with MPI and UPC. One of
> > > the things that stood out is that they tended to do odd things in
> > > their loops that inhibit code from vectorizing.
> > 
> > So, why were these students writing loops in the first place?  If the
> > goal is to generate vectorized code, wouldn't it make more sense to
> > use a language or library which directly supports vector commands?
> > 
> > E.g., although they're used for serial not parallel programming, S and
> > R are vector oriented in a pleasantly convenient way.  There do exist
> > languages and libraries specifically intended for vector programming,
> > like CVL or NESL, right, so, are they not useful?
> > 
> >  http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/scandal/public/papers/cvl.html
> >  http://www-2.cs.cmu.edu/~scandal/
> > 
> > --
> > Andrew Piskorski <atp at piskorski.com>
> > http://www.piskorski.com/
> >
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu