[Beowulf] Benchmark between Dell Poweredge 1950 And 1435

Mike Davis jmdavis1 at vcu.edu
Fri Mar 9 06:35:33 PST 2007


As usual, excellent information RGB.

The only thing that I might add which you alluded to is that even given 
the same processor, and memory an application may or may not be faster 
on one machine as opposed to another. A quality MB can offer you 
increased performance.

Then there's the housekeeping matter of keeping it cool. Can the fans 
move enough air to keep a rack full of X running efficiently? If you 
have to fill racks only half way or create 6 to 10 ft space buffers, you 
make things more difficult from a practical standpoint when scaling to 
larger numbers of racks. Failing to consider these matters can lead to 
excess failed nodes down the road.


Mike Davis

Robert G. Brown wrote:

> On Thu, 8 Mar 2007, Bill Broadley wrote:
>
>>
>> As Robert Brown (and others) so eloquently said.  Nothing is better 
>> than your actual application with your actual input files in an 
>> actual production run.
>
>
> <blush>
>
> ...
>
>> So all the above is just so much handwaving, any of dozens of factors
>> could double of halve performance on your application, get out a stop
>> watch and run it.  I suspect any number of vendors or even fellow 
>> beowulf
>> list folks would either run your application code or allow you to run 
>> it.
>
>
> It's worth a small editorial insertion here that I "like" hypertransport
> for a variety of reasons -- perhaps because it is a packet-based
> internal network that makes your computer's CPU-memory architecture
> surprising like a cluster in miniature on the inside.  And just like any
> compute cluster, you need to tune your design choices towards your
> application.
>
> What Bill didn't quite say (nor me in my previous reply) is that
> whenever you consider multiprocessor solution from time immemorial past,
> the issue of memory bus bandwidth and contention thereupon has ALWAYS
> been one of the primary node choice issues.  From the oldest of
> uniprocessor days (e.g. the original NASA beowulfs and related early
> clusters) there have been certain applications -- often ones with a
> stream-like memory access pattern, doing e.g. linear algebra and
> multiplying/adding lots of large matrices and vectors and so on -- that
> are nearly purely memory bound.  That is, if you run them on systems
> with differing CPU clock they only WEAKLY scale with that clock, nothing
> like twice the speed for twice the clock.
>
> There are other applications that scale with CPU clock only, no matter
> how many CPUs are sharing the pathway to memory.  These applications
> either fit into L2 or have a large compute-to-communicate ratio period
> where the scaling issues are the same as for a cluster, but now
> "communicate" means "communicate with memory".
>
> Then there are applications that span any possible range in between.
> They can do some linear algebra for a while (memory bound) and then
> settle down and solve a set of ODEs out of cache using the vector of
> results (cpu bound) using a bunch of trigonometric function calls (cpu
> bound but rates that vary strongly with architecture) to get a vector of
> results that is transformed again with linear algebra (memory bound) and
> stored away for the next cycle.  If an application like this runs on a
> single core, single cpu, single thread system, it obtains some
> baseline "optimal" performance for this particular architecture.
>
> If you run the same application on a dual CPU system (single core CPUs)
> then it may or may not experience delays because of resource contention.
> In the older days, many dual CPU systems used the same memory bus and if
> CPU 0 was using it during the linear algebra phase, CPU 1 had to wait if
> IT tried to enter ITS linear algebra phase.  Sometimes this waiting
> would gradually and naturally push the threads to phases where there was
> minimal contention, sometimes not, and obviously what it DID was a
> solution to a complex internal discrete-delay-differential process that
> could experience e.g. chaos in the contention cycle, a.k.a. "thrashing".
> Sometimes for only a certain size of run.
>
> Around six or seven years ago (IIRC) dual CPU manufacturers finally
> became clueful and dual CPU systems that were not overtly bottlenecked
> on the memory bus (or were a lot less bottlenecked, so you could see a
> slowdown only for two streams running at the same time -- most real
> world apps had enough of a code mix that they didn't slow noticably)
> started to appear, and life became good.  Hypertransport was a really
> lovely solution that effectively gave each CPU its own memory network,
> and hence PERFECT scaling, unless CPU0 had to talk to CPU1's memory in
> which case you had a performance hit.  But this was relatively easy to
> work around by just not writing and scaling apps to do it.  Basically
> dual CPU systems were architected like two single CPU systems in a box
> with an ultrafast network interconnecting the semi-independent memories
> and CPUs and peripherals and were nearly ALWAYS the most cost effective
> packaging for clusters as they generally came with dual network
> interfaces as well.
>
> The advent of dual cores complicated EVERYthing all over again. In most
> ways a dual core resembles an old fashioned dual CPU system -- two CPUs
> sharing a memory channel that is deliberately oversubscribed so that IF
> the two CPUs are flat out hammering memory, one or the other will
> frequently have to wait in line (degrading performance).  So now we have
> to go BACK to analyzing apps for contention on the oversubscribed
> channel, which occurs in a >>complex<< circumstance when the CPU vs
> memory bound fractions of the code, relative to cache size, relative to
> clock, exceed certain thresholds that depend not only on your code mix
> per application but the SIZE of the run.  You could test your
> application with -size=1000 and see no scaling problem, but crank
> -size=1000000 and two threads running on a dual core and sharing a
> memory bottleneck might suddenly take 130% of the single thread time to
> complete.
>
> Dual dual cores complicate things even further, per architecture.  Intel
> and AMD have very different solutions to the memory bus problem, with
> Intel's (IIRC) continuing with a more traditional "bus" vs AMD's "packet
> network".  I don't have any idea how the two compare when running four
> apps with a variety of colliding or partially memory access patterns and
> of sizes that force single threads to use more than 1/4 or 1/2 of the
> available memory.  I'm not sure that anybody does -- maybe the compiler
> folks, or somebody with a powerful compulsion to benchmark code mixes.
> And it just isn't possible to >>compute<< or >>estimate<< this, the only
> way to figure it out is to measure it.
>
> The final moral of the story is:  DO NOT assume that dual dual core
> systems are the "right" (cost-benefit optimal) node architecture.  It
> might be single processors units, or more likely (since dual processor
> single cores are nowadays close to two single processors in a box) dual
> single cores.  And we haven't even STARTED on actual parallel code,
> where those node processes have to communicate with others on the
> network, where you have to consider several other bottlenecks.  There
> one has to add to the mix (for example) how many processors share a
> network interface, and how well a multiprocessor, multicore system can
> handle multiple network interfaces.  New bottlenecks emerge, new
> nonlinearities take their toll, as one tries to figure out how the
> network interfaces integrate with the kernel, the memory subsystem, the
> application, the virtual machine library.
>
> I would not be at all surprised to find that dual single cores are
> overwhelmingly optimal for certain tasks, relative to dual duals.  I
> would expect dual duals to be optimal for CPU bound coarse grained to
> embarrassingly parallel tasks with a standard gigabit interface shared
> across two or more cores, adjust granularity down as one invests more in
> the network, until you hit a nonlinear neck where the dual cores overrun
> the network period.  At that point, fewer processors per network per bus
> will likely lead to better performance scaling (and maybe better COST
> scaling, which is what really matters).
>
>> For a wide mix of applications in the past I've leaned towards AMD 
>> because
>> my real world testing showed AMD usually won.  The gap has closed 
>> significantly in the last year (it used to be so embarrassing).  Today
>> I'd call it mostly a wash.  Things are shaping up to be pretty 
>> interesting,
>> AMD has the opportunity to take a commanding lead with their next 
>> generation
>> chip which rumors claim will be shipping this summer.
>
>
> AMD won twice over -- they were faster AND cheaper.  Even now with
> performance coming in more evenly relative to clock (at least for
> certain task mixes and scales) I'm not sure that AMD isn't solidly ahead
> in price/performance as Intel likes to charge that premium for the Intel
> name (and they get it, too, from the HA corporate server crowd where
> folks feel a bit uncertain about whether or not those apps will run
> without bugs on AMD -- Intel isn't above a certain amount of delicate
> FUD and play the Microsoft Game with their compiler etc to the extent
> that they can).
>
>> The bad news is that while AMD's next generation promises 
>> dramatically better
>> work done per cycle, the memory system doesn't look like it's going to
>> get much (if any) more memory bandwidth.
>
>
> Memory has for years now been the rate limiting factor for a large block
> of large scale computational tasks.  All sorts of games are played to
> hide its intrinsically poor performance from the user -- large caches,
> prefetch and so on.  These systems are tuned to particular KINDS of
> memory access pattern and speed memory intensive applications up to the
> extent that they match those patterns.  It is really instructive and
> amusing to see what happens when those patterns are deliberate defeated,
> though (as they might well be in certain classes of application).  In my
> "benchmaster" program -- which I'm not advertising at this point as I
> haven't worked on it for a year or so, although it does work AFAIK --
> there is a memory test that basically does a completely random access
> pattern -- it literally fills a user-selectable block of memory with
> shuffled addresses into that block and then follows the shuffled list of
> addresses through the block, visiting every site in random order.
>
> Boy does THAT slow memory down.  Suddenly you see the REAL speed of
> memory relative to CPU when you eliminate all the parallelized branch
> predictive prefetchy kind of stuff, when L2 no longer helps, when every
> access is UNlikely to be in cache or in the process of being fetched to
> cache instead of LIKELY to be in cache.
>
> This sort of "antistream" test is very interesting because, of course,
> there are applications out there that have this sort of access pattern,
> e.g. many simulation programs, problems with a high and fragmented
> dimensionality.  Not all code is streaming local vector access.  When a
> memory intensive program with one of the antipatterns of streaming
> access tries to run on an architecture heavily optimized for streaming
> access, NOBODY knows how things will perform because the vendors would
> rather pierce their own ears with a hole punch than tell you how slow
> random access memory is when it is accessed randomly.
>
> So don't find this all out the hard way (by buying any particular
> architecture without studying your own code and the competing
> architectures and running judicious tests and microbenchmarks to study
> the relative bottleneck performance in relevant performance dimensions).
> It is easy in this game to make $50K mistakes (and up) when you're
> disposing of hundreds of thousands of dollars, and that's real money.
> It can even be much more than that over the lifetime of a cluster --
> a 10% slower solution than you COULD have gotten for your money over a 3
> year expected lifetime of a cluster is a third of a year wasted -- all
> the human time and salary, all the value of the work that could have
> been done, all the additional infrastructure costs -- $50K is a LOW
> estimate for the salaries alone for many projects.  Cost benefit
> analysis rules.  Do it well and prosper.  Do it poorly and one can
> wither and die.
>
>   rgb
>
>>
>>
>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit 
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>




More information about the Beowulf mailing list