[Beowulf] Benchmark between Dell Poweredge 1950 And 1435

Fri Mar 9 05:58:43 PST 2007

On Thu, 8 Mar 2007, Bill Broadley wrote:

>
> As Robert Brown (and others) so eloquently said.  Nothing is better than your 
> actual application with your actual input files in an actual production run.

<blush>

...
> So all the above is just so much handwaving, any of dozens of factors
> could double of halve performance on your application, get out a stop
> watch and run it.  I suspect any number of vendors or even fellow beowulf
> list folks would either run your application code or allow you to run it.

It's worth a small editorial insertion here that I "like" hypertransport
for a variety of reasons -- perhaps because it is a packet-based
internal network that makes your computer's CPU-memory architecture
surprising like a cluster in miniature on the inside.  And just like any
compute cluster, you need to tune your design choices towards your
application.

What Bill didn't quite say (nor me in my previous reply) is that
whenever you consider multiprocessor solution from time immemorial past,
the issue of memory bus bandwidth and contention thereupon has ALWAYS
been one of the primary node choice issues.  From the oldest of
uniprocessor days (e.g. the original NASA beowulfs and related early
clusters) there have been certain applications -- often ones with a
stream-like memory access pattern, doing e.g. linear algebra and
multiplying/adding lots of large matrices and vectors and so on -- that
are nearly purely memory bound.  That is, if you run them on systems
with differing CPU clock they only WEAKLY scale with that clock, nothing
like twice the speed for twice the clock.

There are other applications that scale with CPU clock only, no matter
how many CPUs are sharing the pathway to memory.  These applications
either fit into L2 or have a large compute-to-communicate ratio period
where the scaling issues are the same as for a cluster, but now
"communicate" means "communicate with memory".

Then there are applications that span any possible range in between.
They can do some linear algebra for a while (memory bound) and then
settle down and solve a set of ODEs out of cache using the vector of
results (cpu bound) using a bunch of trigonometric function calls (cpu
bound but rates that vary strongly with architecture) to get a vector of
results that is transformed again with linear algebra (memory bound) and
stored away for the next cycle.  If an application like this runs on a
single core, single cpu, single thread system, it obtains some
baseline "optimal" performance for this particular architecture.

If you run the same application on a dual CPU system (single core CPUs)
then it may or may not experience delays because of resource contention.
In the older days, many dual CPU systems used the same memory bus and if
CPU 0 was using it during the linear algebra phase, CPU 1 had to wait if
IT tried to enter ITS linear algebra phase.  Sometimes this waiting
would gradually and naturally push the threads to phases where there was
minimal contention, sometimes not, and obviously what it DID was a
solution to a complex internal discrete-delay-differential process that
could experience e.g. chaos in the contention cycle, a.k.a. "thrashing".
Sometimes for only a certain size of run.

Around six or seven years ago (IIRC) dual CPU manufacturers finally
became clueful and dual CPU systems that were not overtly bottlenecked
on the memory bus (or were a lot less bottlenecked, so you could see a
slowdown only for two streams running at the same time -- most real
world apps had enough of a code mix that they didn't slow noticably)
started to appear, and life became good.  Hypertransport was a really
lovely solution that effectively gave each CPU its own memory network,
and hence PERFECT scaling, unless CPU0 had to talk to CPU1's memory in
which case you had a performance hit.  But this was relatively easy to
work around by just not writing and scaling apps to do it.  Basically
dual CPU systems were architected like two single CPU systems in a box
with an ultrafast network interconnecting the semi-independent memories
and CPUs and peripherals and were nearly ALWAYS the most cost effective
packaging for clusters as they generally came with dual network
interfaces as well.

The advent of dual cores complicated EVERYthing all over again. In most
ways a dual core resembles an old fashioned dual CPU system -- two CPUs
sharing a memory channel that is deliberately oversubscribed so that IF
the two CPUs are flat out hammering memory, one or the other will
frequently have to wait in line (degrading performance).  So now we have
to go BACK to analyzing apps for contention on the oversubscribed
channel, which occurs in a >>complex<< circumstance when the CPU vs
memory bound fractions of the code, relative to cache size, relative to
clock, exceed certain thresholds that depend not only on your code mix
per application but the SIZE of the run.  You could test your
application with -size=1000 and see no scaling problem, but crank
-size=1000000 and two threads running on a dual core and sharing a
memory bottleneck might suddenly take 130% of the single thread time to
complete.

Dual dual cores complicate things even further, per architecture.  Intel
and AMD have very different solutions to the memory bus problem, with
Intel's (IIRC) continuing with a more traditional "bus" vs AMD's "packet
network".  I don't have any idea how the two compare when running four
apps with a variety of colliding or partially memory access patterns and
of sizes that force single threads to use more than 1/4 or 1/2 of the
available memory.  I'm not sure that anybody does -- maybe the compiler
folks, or somebody with a powerful compulsion to benchmark code mixes.
And it just isn't possible to >>compute<< or >>estimate<< this, the only
way to figure it out is to measure it.

The final moral of the story is:  DO NOT assume that dual dual core
systems are the "right" (cost-benefit optimal) node architecture.  It
might be single processors units, or more likely (since dual processor
single cores are nowadays close to two single processors in a box) dual
single cores.  And we haven't even STARTED on actual parallel code,
where those node processes have to communicate with others on the
network, where you have to consider several other bottlenecks.  There
one has to add to the mix (for example) how many processors share a
network interface, and how well a multiprocessor, multicore system can
handle multiple network interfaces.  New bottlenecks emerge, new
nonlinearities take their toll, as one tries to figure out how the
network interfaces integrate with the kernel, the memory subsystem, the
application, the virtual machine library.

I would not be at all surprised to find that dual single cores are
overwhelmingly optimal for certain tasks, relative to dual duals.  I
would expect dual duals to be optimal for CPU bound coarse grained to
embarrassingly parallel tasks with a standard gigabit interface shared
across two or more cores, adjust granularity down as one invests more in
the network, until you hit a nonlinear neck where the dual cores overrun
the network period.  At that point, fewer processors per network per bus
will likely lead to better performance scaling (and maybe better COST
scaling, which is what really matters).

> For a wide mix of applications in the past I've leaned towards AMD because
> my real world testing showed AMD usually won.  The gap has closed 
> significantly in the last year (it used to be so embarrassing).  Today
> I'd call it mostly a wash.  Things are shaping up to be pretty interesting,
> AMD has the opportunity to take a commanding lead with their next generation
> chip which rumors claim will be shipping this summer.

AMD won twice over -- they were faster AND cheaper.  Even now with
performance coming in more evenly relative to clock (at least for
certain task mixes and scales) I'm not sure that AMD isn't solidly ahead
in price/performance as Intel likes to charge that premium for the Intel
name (and they get it, too, from the HA corporate server crowd where
folks feel a bit uncertain about whether or not those apps will run
without bugs on AMD -- Intel isn't above a certain amount of delicate
FUD and play the Microsoft Game with their compiler etc to the extent
that they can).

> The bad news is that while AMD's next generation promises dramatically better
> work done per cycle, the memory system doesn't look like it's going to
> get much (if any) more memory bandwidth.

Memory has for years now been the rate limiting factor for a large block
of large scale computational tasks.  All sorts of games are played to
hide its intrinsically poor performance from the user -- large caches,
prefetch and so on.  These systems are tuned to particular KINDS of
memory access pattern and speed memory intensive applications up to the
extent that they match those patterns.  It is really instructive and
amusing to see what happens when those patterns are deliberate defeated,
though (as they might well be in certain classes of application).  In my
"benchmaster" program -- which I'm not advertising at this point as I
haven't worked on it for a year or so, although it does work AFAIK --
there is a memory test that basically does a completely random access
pattern -- it literally fills a user-selectable block of memory with
shuffled addresses into that block and then follows the shuffled list of
addresses through the block, visiting every site in random order.

Boy does THAT slow memory down.  Suddenly you see the REAL speed of
memory relative to CPU when you eliminate all the parallelized branch
predictive prefetchy kind of stuff, when L2 no longer helps, when every
access is UNlikely to be in cache or in the process of being fetched to
cache instead of LIKELY to be in cache.

This sort of "antistream" test is very interesting because, of course,
there are applications out there that have this sort of access pattern,
e.g. many simulation programs, problems with a high and fragmented
dimensionality.  Not all code is streaming local vector access.  When a
memory intensive program with one of the antipatterns of streaming
access tries to run on an architecture heavily optimized for streaming
access, NOBODY knows how things will perform because the vendors would
rather pierce their own ears with a hole punch than tell you how slow
random access memory is when it is accessed randomly.

So don't find this all out the hard way (by buying any particular
architecture without studying your own code and the competing
architectures and running judicious tests and microbenchmarks to study
the relative bottleneck performance in relevant performance dimensions).
It is easy in this game to make $50K mistakes (and up) when you're
disposing of hundreds of thousands of dollars, and that's real money.
It can even be much more than that over the lifetime of a cluster --
a 10% slower solution than you COULD have gotten for your money over a 3
year expected lifetime of a cluster is a third of a year wasted -- all
the human time and salary, all the value of the work that could have
been done, all the additional infrastructure costs -- $50K is a LOW
estimate for the salaries alone for many projects.  Cost benefit
analysis rules.  Do it well and prosper.  Do it poorly and one can
wither and die.

   rgb

>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu