[Beowulf] FY;) GROMACS on the Raspberry Pi

Fri Sep 21 07:07:35 PDT 2012

On Thu, 20 Sep 2012, Prentice Bisbal wrote:

> I have a good one: generate a mandelbrot fractal. It's interesting
> because you can see it move through iterations faster as you add more
> processors to it. Of course, this means you need to ssh into the head
> node from a system with X-windows, and be able to run parallel jobs
> interactively. I remember seeing a demo of the first Linux cluster I saw
> IRL using this, and it was a homework assignment in my parallel
> programming class years ago.

The only real problem with this (which I used for years as a demo
myself, although rendering a 3D image was also a very good one) is that
modern CPUs are so very, very fast that even without parallelizing, you
can rubberband chunks of the mandlebrot set and have them refresh in "no
time".  So it gets harder to see the speedup, or care.  Of course now
you don't even have to build a cluster to see linear speedup -- or
rarely, superlinear speedup -- because nearly all current generation
systems are multicore.  They ARE clusters.

One of the simplest ways to demonstrate parallel speedup is the least
exciting and most useful -- running M INDEPENDENT jobs on N cores on as
many machines as needed to optain N cores.  I just did this on a new
system upstairs to get a feel for the scaling capabilities of the i7
processors -- it's an i7-3770 at 3.4 GHz with 16 GB of memory.  The task
I ran is a neural network builder that is very resource intensive -- it
begins with Monte Carlo plus conjugate gradient to build a population of
candidate networks, then runs an extensive genetic optimization
algorithm with various bells and whistles to transform the entire
population into a "pretty good" network, then finishes with a full
conjugate gradient optimization of that network.  Almost every step
involves running the network(s) (lots of arithmetic) against a training
set of data (lots of data) over and over and over, while generating many
random numbers along the way for this and for that.  If one starts the
task with a fixed (common) random number seed and scales its parameters
so that it takes maybe half an hour to complete on a single core, it
becomes a "decent" benchmark -- especially for me, since my purpose is
to do the preliminary work for designing a cluster intended to train
neural networks and do other sorts of Bayesian pattern recognition on a
commercial scale with very large datasets.

One can then simply plot the amount of work one gets done against the
number of jobs being run on the processor.  This involves two simple
steps -- determining the times it takes one job, two jobs, ... M jobs to
complete out to M large enough that the gain in work turns over and
starts to go DOWN as you increase it further.  Determining the "core
efficiency", which is this time divided into the time required for just
one job running on one core.  Finally, determining the work as the
product of the core efficiency times the number of jobs running, and
plotting all three curves for grins.

I have to say that I was surprised -- and impressed -- by the
performance of the i7-3770 when I did this.  I expected it to be nearly
flat from 1-3 jobs, maybe all the way to 4 (number of cores) and then to
follow the usual brutal curve south, where running 5, 6, 7, 8
simultaneous jobs on 4 cores yields a processor efficiency somewhat LESS
than 4/5, 4/6, 4/7, 4/8 so that the peak in work done occurs at 4 cores.
Rather amazingly, the i7 peak occurred at >>8<< jobs running on 4 cores.
It still was not taking twice as long to complete a job as it took for a
single job on a single core when there were 8 jobs running on 4 cores!

Or 9 jobs on 4 cores.  Or 10 jobs on 4 cores.  But the peak in the work
accomplished (the product, remember) did occur at 8-9 (flat) and was
dropping off by 10.  In the end, the 4 core i7 was completing work
almost as fast as 6 independent i7's running just one instance of the
job would have, when running 8 or 9 jobs on the 4 cores.  Obviously it
was exploiting some sort of parallelism in the data or execution, but I
wasn't doing anything particularly special to help it out.

I attach the figure (not sure if the beowulf list will pass posts with
attachments these days, but it's not very large and worth a glance.  I
would very much recommend STARTING anyone learning to cluster with a
series of embarrassingly parallel tasks such as this one, scaled out to
where there is a fair amount of work being done (but it still finishes
in not too long a time) to help demonstrate what the cluster (be it many
systems/many cores or many cores single system) can do in terms of
parallel work, where of course YMMV depending on where and how any given
independent task is bottlenecked.  Monte Carlo computations are good for
computation bound tasks, statistics computations (perhaps running R) are
good for data bound tasks, and of course there are various benchmarking
tool suites that can usually be run embarrassingly parallel to give you
a more fractionated breakdown of parallel performance -- all WITHOUT the
burden of IPCs.

Then one can, and should start to consider and explore the effect of
IPCs, which "should" be to strictly reduce the core efficiency, where
the more coarse-grained the computation is, the closer it gets to the
embarrassingly parallel result -- povray and mandelbrot generators,
which rely on perfect task partitioning in a master-slave algorithm,
usually get very nearly linear single task speedup (and CAN exhibit
superlinear speedup if splitting the task means keeping the whole thing
in cache versus having to go to main memory or worse, back to disk) or
if your processor stack involves black magic and core functions in the
fifth dimension like the i7 apparently does:-).  And eventually, you hit
fine grained tasks wherein the design and speed of the network are
performance/rate limiting, where the "cost" of IPCs exceeds the cost of
computation and the processors are typically blocked waiting for
communications to complete.

That's the series of things that I'd suggest for a "short course" on
parallel computing.  The ideal is perfectly (embarrassingly) parallel
tasks, completing work at some optimum rate given the architecture.
This ideal performance degrades (but can be optimized by design choices
both within the task and of the cluster) as one moves to tasks that are
not fully independent, that have to communicate between partitioned
subtasks in order to complete.  Some tasks cannot be effectively
parallelized.  Others don't need to be parallelized, just run in
parallel.  The hard part is everything in between, but it helps to
understand the limit points first.  IIRC Ian Foster's (free) online book
on parallel programming is a decent place to learn the theory and even
some of the application, although there are other good resources as
well.

BTW, my next plan is to finish building my household server with a
Sabertooth motherboard holding an 8 core AMD processor, also at 3.4 GHz,
and run it head to head against the i7 (as they end up being about the
same cost in reasonably packaging).  I'm very curious as to whether or
not the AMD can keep up with the scaling of the i7 even with twice as
many cores -- I wouldn't be surprised to learn that it couldn't, as I
EXPECTED to be memory bound even on the i7 (but perhaps its large cache
defeated this on identical tasks).  I'd love to hear if the list has
experience/wisdom concerning the probable performance bottlenecks of
these two processor families, as I've been out of the game for a few
years and am just now coming back into it.

      rgb

>
> Google "parallel fractal generator", and you should find a bunch of its.
>
> --
> Prentice
>
> On 09/19/2012 04:57 PM, Lux, Jim (337C) wrote:
>> Bringing up an excellent question for "learning to cluster" activities..
>> What would be a good sample program to try.  There was (is?) a MPI version of PoVRAY, as I recall.   It was nice because it's showy and you can easily see if you're getting a speedup.
>> Computing pi isn't very dramatic, especially since most people don't have a feel for how fast it should run.
>>
>> Some sort of n-body code, perhaps?
>>
>> Something that does pattern matching?
>>
>> There's a lot of MPI enabled finite element codes, but a lot don't have a flashy output.
>>
>> And you'd like something that actually makes use of internode communication in a meaningful way (because you could play with reconfiguring it, by plugging and unplugging cables), so embarrassingly parallel isn't as impressive.  (e.g. rendering frames of an animation.. so what if you do it 10 times faster with 10 computers)
>>
>> Jim Lux
>>
>> -----Original Message-----
>> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Bogdan Costescu
>> Sent: Wednesday, September 19, 2012 3:33 AM
>> To: Daniel Kidger
>> Cc: Beowulf at beowulf.org
>> Subject: Re: [Beowulf] FY;) GROMACS on the Raspberry Pi
>>
>> On Tue, Sep 18, 2012 at 10:10 AM, Daniel Kidger<daniel.kidger at gmail.com>  wrote:
>>> I touched on the Gromacs port to ClearSpeed when I worked there - I
>>> then went on to write the port of AMBER to CS plus I have a pair of
>>> RPis that I tinker with.
>> I'm not quite sure what the interest is... GROMACS is quite famous for having non-bonded kernels written in assembler and using features of the modern CPUs, but this is limited to some<snip>
>>
>> and will have a larger power consumption; plus with so many components, the risk of one or more breaking and reducing the overall compute power is quite high. So is it worth it ?
>> (as a scientist I look at it from the perspective of getting useful results from calculations; as a learning experience, it's surely useful, but then running any software using MPI would be)
>>
>> Cheers,
>> Bogdan
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

-------------- next part --------------
A non-text attachment was scrubbed...
Name: scaling_work.eps
Type: application/postscript
Size: 21072 bytes
Desc: Work Scaling on the i7-3770
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20120921/c6028d85/attachment.eps>