[Beowulf] Benchmark between Dell Poweredge 1950 And 1435
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduFri Mar 9 05:58:43 PST 2007
- Previous message: [Beowulf] Benchmark between Dell Poweredge 1950 And 1435
- Next message: [Beowulf] Benchmark between Dell Poweredge 1950 And 1435
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 8 Mar 2007, Bill Broadley wrote: > > As Robert Brown (and others) so eloquently said. Nothing is better than your > actual application with your actual input files in an actual production run. <blush> ... > So all the above is just so much handwaving, any of dozens of factors > could double of halve performance on your application, get out a stop > watch and run it. I suspect any number of vendors or even fellow beowulf > list folks would either run your application code or allow you to run it. It's worth a small editorial insertion here that I "like" hypertransport for a variety of reasons -- perhaps because it is a packet-based internal network that makes your computer's CPU-memory architecture surprising like a cluster in miniature on the inside. And just like any compute cluster, you need to tune your design choices towards your application. What Bill didn't quite say (nor me in my previous reply) is that whenever you consider multiprocessor solution from time immemorial past, the issue of memory bus bandwidth and contention thereupon has ALWAYS been one of the primary node choice issues. From the oldest of uniprocessor days (e.g. the original NASA beowulfs and related early clusters) there have been certain applications -- often ones with a stream-like memory access pattern, doing e.g. linear algebra and multiplying/adding lots of large matrices and vectors and so on -- that are nearly purely memory bound. That is, if you run them on systems with differing CPU clock they only WEAKLY scale with that clock, nothing like twice the speed for twice the clock. There are other applications that scale with CPU clock only, no matter how many CPUs are sharing the pathway to memory. These applications either fit into L2 or have a large compute-to-communicate ratio period where the scaling issues are the same as for a cluster, but now "communicate" means "communicate with memory". Then there are applications that span any possible range in between. They can do some linear algebra for a while (memory bound) and then settle down and solve a set of ODEs out of cache using the vector of results (cpu bound) using a bunch of trigonometric function calls (cpu bound but rates that vary strongly with architecture) to get a vector of results that is transformed again with linear algebra (memory bound) and stored away for the next cycle. If an application like this runs on a single core, single cpu, single thread system, it obtains some baseline "optimal" performance for this particular architecture. If you run the same application on a dual CPU system (single core CPUs) then it may or may not experience delays because of resource contention. In the older days, many dual CPU systems used the same memory bus and if CPU 0 was using it during the linear algebra phase, CPU 1 had to wait if IT tried to enter ITS linear algebra phase. Sometimes this waiting would gradually and naturally push the threads to phases where there was minimal contention, sometimes not, and obviously what it DID was a solution to a complex internal discrete-delay-differential process that could experience e.g. chaos in the contention cycle, a.k.a. "thrashing". Sometimes for only a certain size of run. Around six or seven years ago (IIRC) dual CPU manufacturers finally became clueful and dual CPU systems that were not overtly bottlenecked on the memory bus (or were a lot less bottlenecked, so you could see a slowdown only for two streams running at the same time -- most real world apps had enough of a code mix that they didn't slow noticably) started to appear, and life became good. Hypertransport was a really lovely solution that effectively gave each CPU its own memory network, and hence PERFECT scaling, unless CPU0 had to talk to CPU1's memory in which case you had a performance hit. But this was relatively easy to work around by just not writing and scaling apps to do it. Basically dual CPU systems were architected like two single CPU systems in a box with an ultrafast network interconnecting the semi-independent memories and CPUs and peripherals and were nearly ALWAYS the most cost effective packaging for clusters as they generally came with dual network interfaces as well. The advent of dual cores complicated EVERYthing all over again. In most ways a dual core resembles an old fashioned dual CPU system -- two CPUs sharing a memory channel that is deliberately oversubscribed so that IF the two CPUs are flat out hammering memory, one or the other will frequently have to wait in line (degrading performance). So now we have to go BACK to analyzing apps for contention on the oversubscribed channel, which occurs in a >>complex<< circumstance when the CPU vs memory bound fractions of the code, relative to cache size, relative to clock, exceed certain thresholds that depend not only on your code mix per application but the SIZE of the run. You could test your application with -size=1000 and see no scaling problem, but crank -size=1000000 and two threads running on a dual core and sharing a memory bottleneck might suddenly take 130% of the single thread time to complete. Dual dual cores complicate things even further, per architecture. Intel and AMD have very different solutions to the memory bus problem, with Intel's (IIRC) continuing with a more traditional "bus" vs AMD's "packet network". I don't have any idea how the two compare when running four apps with a variety of colliding or partially memory access patterns and of sizes that force single threads to use more than 1/4 or 1/2 of the available memory. I'm not sure that anybody does -- maybe the compiler folks, or somebody with a powerful compulsion to benchmark code mixes. And it just isn't possible to >>compute<< or >>estimate<< this, the only way to figure it out is to measure it. The final moral of the story is: DO NOT assume that dual dual core systems are the "right" (cost-benefit optimal) node architecture. It might be single processors units, or more likely (since dual processor single cores are nowadays close to two single processors in a box) dual single cores. And we haven't even STARTED on actual parallel code, where those node processes have to communicate with others on the network, where you have to consider several other bottlenecks. There one has to add to the mix (for example) how many processors share a network interface, and how well a multiprocessor, multicore system can handle multiple network interfaces. New bottlenecks emerge, new nonlinearities take their toll, as one tries to figure out how the network interfaces integrate with the kernel, the memory subsystem, the application, the virtual machine library. I would not be at all surprised to find that dual single cores are overwhelmingly optimal for certain tasks, relative to dual duals. I would expect dual duals to be optimal for CPU bound coarse grained to embarrassingly parallel tasks with a standard gigabit interface shared across two or more cores, adjust granularity down as one invests more in the network, until you hit a nonlinear neck where the dual cores overrun the network period. At that point, fewer processors per network per bus will likely lead to better performance scaling (and maybe better COST scaling, which is what really matters). > For a wide mix of applications in the past I've leaned towards AMD because > my real world testing showed AMD usually won. The gap has closed > significantly in the last year (it used to be so embarrassing). Today > I'd call it mostly a wash. Things are shaping up to be pretty interesting, > AMD has the opportunity to take a commanding lead with their next generation > chip which rumors claim will be shipping this summer. AMD won twice over -- they were faster AND cheaper. Even now with performance coming in more evenly relative to clock (at least for certain task mixes and scales) I'm not sure that AMD isn't solidly ahead in price/performance as Intel likes to charge that premium for the Intel name (and they get it, too, from the HA corporate server crowd where folks feel a bit uncertain about whether or not those apps will run without bugs on AMD -- Intel isn't above a certain amount of delicate FUD and play the Microsoft Game with their compiler etc to the extent that they can). > The bad news is that while AMD's next generation promises dramatically better > work done per cycle, the memory system doesn't look like it's going to > get much (if any) more memory bandwidth. Memory has for years now been the rate limiting factor for a large block of large scale computational tasks. All sorts of games are played to hide its intrinsically poor performance from the user -- large caches, prefetch and so on. These systems are tuned to particular KINDS of memory access pattern and speed memory intensive applications up to the extent that they match those patterns. It is really instructive and amusing to see what happens when those patterns are deliberate defeated, though (as they might well be in certain classes of application). In my "benchmaster" program -- which I'm not advertising at this point as I haven't worked on it for a year or so, although it does work AFAIK -- there is a memory test that basically does a completely random access pattern -- it literally fills a user-selectable block of memory with shuffled addresses into that block and then follows the shuffled list of addresses through the block, visiting every site in random order. Boy does THAT slow memory down. Suddenly you see the REAL speed of memory relative to CPU when you eliminate all the parallelized branch predictive prefetchy kind of stuff, when L2 no longer helps, when every access is UNlikely to be in cache or in the process of being fetched to cache instead of LIKELY to be in cache. This sort of "antistream" test is very interesting because, of course, there are applications out there that have this sort of access pattern, e.g. many simulation programs, problems with a high and fragmented dimensionality. Not all code is streaming local vector access. When a memory intensive program with one of the antipatterns of streaming access tries to run on an architecture heavily optimized for streaming access, NOBODY knows how things will perform because the vendors would rather pierce their own ears with a hole punch than tell you how slow random access memory is when it is accessed randomly. So don't find this all out the hard way (by buying any particular architecture without studying your own code and the competing architectures and running judicious tests and microbenchmarks to study the relative bottleneck performance in relevant performance dimensions). It is easy in this game to make $50K mistakes (and up) when you're disposing of hundreds of thousands of dollars, and that's real money. It can even be much more than that over the lifetime of a cluster -- a 10% slower solution than you COULD have gotten for your money over a 3 year expected lifetime of a cluster is a third of a year wasted -- all the human time and salary, all the value of the work that could have been done, all the additional infrastructure costs -- $50K is a LOW estimate for the salaries alone for many projects. Cost benefit analysis rules. Do it well and prosper. Do it poorly and one can wither and die. rgb > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] Benchmark between Dell Poweredge 1950 And 1435
- Next message: [Beowulf] Benchmark between Dell Poweredge 1950 And 1435
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
