[Beowulf] Benchmark between Dell Poweredge 1950 And 1435
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Mike Davis jmdavis1 at vcu.eduFri Mar 9 06:35:33 PST 2007
- Previous message: [Beowulf] Benchmark between Dell Poweredge 1950 And 1435
- Next message: [Beowulf] Benchmark between Dell Poweredge 1950 And 1435
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
As usual, excellent information RGB. The only thing that I might add which you alluded to is that even given the same processor, and memory an application may or may not be faster on one machine as opposed to another. A quality MB can offer you increased performance. Then there's the housekeeping matter of keeping it cool. Can the fans move enough air to keep a rack full of X running efficiently? If you have to fill racks only half way or create 6 to 10 ft space buffers, you make things more difficult from a practical standpoint when scaling to larger numbers of racks. Failing to consider these matters can lead to excess failed nodes down the road. Mike Davis Robert G. Brown wrote: > On Thu, 8 Mar 2007, Bill Broadley wrote: > >> >> As Robert Brown (and others) so eloquently said. Nothing is better >> than your actual application with your actual input files in an >> actual production run. > > > <blush> > > ... > >> So all the above is just so much handwaving, any of dozens of factors >> could double of halve performance on your application, get out a stop >> watch and run it. I suspect any number of vendors or even fellow >> beowulf >> list folks would either run your application code or allow you to run >> it. > > > It's worth a small editorial insertion here that I "like" hypertransport > for a variety of reasons -- perhaps because it is a packet-based > internal network that makes your computer's CPU-memory architecture > surprising like a cluster in miniature on the inside. And just like any > compute cluster, you need to tune your design choices towards your > application. > > What Bill didn't quite say (nor me in my previous reply) is that > whenever you consider multiprocessor solution from time immemorial past, > the issue of memory bus bandwidth and contention thereupon has ALWAYS > been one of the primary node choice issues. From the oldest of > uniprocessor days (e.g. the original NASA beowulfs and related early > clusters) there have been certain applications -- often ones with a > stream-like memory access pattern, doing e.g. linear algebra and > multiplying/adding lots of large matrices and vectors and so on -- that > are nearly purely memory bound. That is, if you run them on systems > with differing CPU clock they only WEAKLY scale with that clock, nothing > like twice the speed for twice the clock. > > There are other applications that scale with CPU clock only, no matter > how many CPUs are sharing the pathway to memory. These applications > either fit into L2 or have a large compute-to-communicate ratio period > where the scaling issues are the same as for a cluster, but now > "communicate" means "communicate with memory". > > Then there are applications that span any possible range in between. > They can do some linear algebra for a while (memory bound) and then > settle down and solve a set of ODEs out of cache using the vector of > results (cpu bound) using a bunch of trigonometric function calls (cpu > bound but rates that vary strongly with architecture) to get a vector of > results that is transformed again with linear algebra (memory bound) and > stored away for the next cycle. If an application like this runs on a > single core, single cpu, single thread system, it obtains some > baseline "optimal" performance for this particular architecture. > > If you run the same application on a dual CPU system (single core CPUs) > then it may or may not experience delays because of resource contention. > In the older days, many dual CPU systems used the same memory bus and if > CPU 0 was using it during the linear algebra phase, CPU 1 had to wait if > IT tried to enter ITS linear algebra phase. Sometimes this waiting > would gradually and naturally push the threads to phases where there was > minimal contention, sometimes not, and obviously what it DID was a > solution to a complex internal discrete-delay-differential process that > could experience e.g. chaos in the contention cycle, a.k.a. "thrashing". > Sometimes for only a certain size of run. > > Around six or seven years ago (IIRC) dual CPU manufacturers finally > became clueful and dual CPU systems that were not overtly bottlenecked > on the memory bus (or were a lot less bottlenecked, so you could see a > slowdown only for two streams running at the same time -- most real > world apps had enough of a code mix that they didn't slow noticably) > started to appear, and life became good. Hypertransport was a really > lovely solution that effectively gave each CPU its own memory network, > and hence PERFECT scaling, unless CPU0 had to talk to CPU1's memory in > which case you had a performance hit. But this was relatively easy to > work around by just not writing and scaling apps to do it. Basically > dual CPU systems were architected like two single CPU systems in a box > with an ultrafast network interconnecting the semi-independent memories > and CPUs and peripherals and were nearly ALWAYS the most cost effective > packaging for clusters as they generally came with dual network > interfaces as well. > > The advent of dual cores complicated EVERYthing all over again. In most > ways a dual core resembles an old fashioned dual CPU system -- two CPUs > sharing a memory channel that is deliberately oversubscribed so that IF > the two CPUs are flat out hammering memory, one or the other will > frequently have to wait in line (degrading performance). So now we have > to go BACK to analyzing apps for contention on the oversubscribed > channel, which occurs in a >>complex<< circumstance when the CPU vs > memory bound fractions of the code, relative to cache size, relative to > clock, exceed certain thresholds that depend not only on your code mix > per application but the SIZE of the run. You could test your > application with -size=1000 and see no scaling problem, but crank > -size=1000000 and two threads running on a dual core and sharing a > memory bottleneck might suddenly take 130% of the single thread time to > complete. > > Dual dual cores complicate things even further, per architecture. Intel > and AMD have very different solutions to the memory bus problem, with > Intel's (IIRC) continuing with a more traditional "bus" vs AMD's "packet > network". I don't have any idea how the two compare when running four > apps with a variety of colliding or partially memory access patterns and > of sizes that force single threads to use more than 1/4 or 1/2 of the > available memory. I'm not sure that anybody does -- maybe the compiler > folks, or somebody with a powerful compulsion to benchmark code mixes. > And it just isn't possible to >>compute<< or >>estimate<< this, the only > way to figure it out is to measure it. > > The final moral of the story is: DO NOT assume that dual dual core > systems are the "right" (cost-benefit optimal) node architecture. It > might be single processors units, or more likely (since dual processor > single cores are nowadays close to two single processors in a box) dual > single cores. And we haven't even STARTED on actual parallel code, > where those node processes have to communicate with others on the > network, where you have to consider several other bottlenecks. There > one has to add to the mix (for example) how many processors share a > network interface, and how well a multiprocessor, multicore system can > handle multiple network interfaces. New bottlenecks emerge, new > nonlinearities take their toll, as one tries to figure out how the > network interfaces integrate with the kernel, the memory subsystem, the > application, the virtual machine library. > > I would not be at all surprised to find that dual single cores are > overwhelmingly optimal for certain tasks, relative to dual duals. I > would expect dual duals to be optimal for CPU bound coarse grained to > embarrassingly parallel tasks with a standard gigabit interface shared > across two or more cores, adjust granularity down as one invests more in > the network, until you hit a nonlinear neck where the dual cores overrun > the network period. At that point, fewer processors per network per bus > will likely lead to better performance scaling (and maybe better COST > scaling, which is what really matters). > >> For a wide mix of applications in the past I've leaned towards AMD >> because >> my real world testing showed AMD usually won. The gap has closed >> significantly in the last year (it used to be so embarrassing). Today >> I'd call it mostly a wash. Things are shaping up to be pretty >> interesting, >> AMD has the opportunity to take a commanding lead with their next >> generation >> chip which rumors claim will be shipping this summer. > > > AMD won twice over -- they were faster AND cheaper. Even now with > performance coming in more evenly relative to clock (at least for > certain task mixes and scales) I'm not sure that AMD isn't solidly ahead > in price/performance as Intel likes to charge that premium for the Intel > name (and they get it, too, from the HA corporate server crowd where > folks feel a bit uncertain about whether or not those apps will run > without bugs on AMD -- Intel isn't above a certain amount of delicate > FUD and play the Microsoft Game with their compiler etc to the extent > that they can). > >> The bad news is that while AMD's next generation promises >> dramatically better >> work done per cycle, the memory system doesn't look like it's going to >> get much (if any) more memory bandwidth. > > > Memory has for years now been the rate limiting factor for a large block > of large scale computational tasks. All sorts of games are played to > hide its intrinsically poor performance from the user -- large caches, > prefetch and so on. These systems are tuned to particular KINDS of > memory access pattern and speed memory intensive applications up to the > extent that they match those patterns. It is really instructive and > amusing to see what happens when those patterns are deliberate defeated, > though (as they might well be in certain classes of application). In my > "benchmaster" program -- which I'm not advertising at this point as I > haven't worked on it for a year or so, although it does work AFAIK -- > there is a memory test that basically does a completely random access > pattern -- it literally fills a user-selectable block of memory with > shuffled addresses into that block and then follows the shuffled list of > addresses through the block, visiting every site in random order. > > Boy does THAT slow memory down. Suddenly you see the REAL speed of > memory relative to CPU when you eliminate all the parallelized branch > predictive prefetchy kind of stuff, when L2 no longer helps, when every > access is UNlikely to be in cache or in the process of being fetched to > cache instead of LIKELY to be in cache. > > This sort of "antistream" test is very interesting because, of course, > there are applications out there that have this sort of access pattern, > e.g. many simulation programs, problems with a high and fragmented > dimensionality. Not all code is streaming local vector access. When a > memory intensive program with one of the antipatterns of streaming > access tries to run on an architecture heavily optimized for streaming > access, NOBODY knows how things will perform because the vendors would > rather pierce their own ears with a hole punch than tell you how slow > random access memory is when it is accessed randomly. > > So don't find this all out the hard way (by buying any particular > architecture without studying your own code and the competing > architectures and running judicious tests and microbenchmarks to study > the relative bottleneck performance in relevant performance dimensions). > It is easy in this game to make $50K mistakes (and up) when you're > disposing of hundreds of thousands of dollars, and that's real money. > It can even be much more than that over the lifetime of a cluster -- > a 10% slower solution than you COULD have gotten for your money over a 3 > year expected lifetime of a cluster is a third of a year wasted -- all > the human time and salary, all the value of the work that could have > been done, all the additional infrastructure costs -- $50K is a LOW > estimate for the salaries alone for many projects. Cost benefit > analysis rules. Do it well and prosper. Do it poorly and one can > wither and die. > > rgb > >> >> >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >
- Previous message: [Beowulf] Benchmark between Dell Poweredge 1950 And 1435
- Next message: [Beowulf] Benchmark between Dell Poweredge 1950 And 1435
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
