[Beowulf] About Cluster Performance...
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduThu May 20 11:16:58 PDT 2004
- Previous message: [Beowulf] About Cluster Performance...
- Next message: [Beowulf] ARIMA HDAMA dual opteron motherboard
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 20 May 2004, Mathias Brito wrote: > hello folks, > > Well, i would like to know how to organize the nodes > to abtain the best performance of my cluster, for now Attain the best performance doing what? The best (and cheapest) "general purpose" design is to put your 16 nodes on a fast ethernet switch as you are doing. It works fine for moderately coarse grained tasks and is inexpensive to upgrade to gigabit ethernet now or later for improved bandwidth if need be. However, decisions concerning network and topology tend to be driven by the application (set) you desire to run. If you are doing Monte Carlo (as I am) in an embarrassingly parallel configuration then don't change. Who cares how "fast" you run some benchmark if that benchmark has nothing to do with the structure of YOUR code? If you are doing something medium to fine grained parallel, where there are quite a few network communications that have to take place for your computation to advance a step, then consider a faster network. If the communications tend to be sending relatively few BIG messages between the nodes, gigabit ethernet is a reasonable (and still cheap) choice. If they tend to be sending lots of itty bitty messages between the nodes all the time, then you will need to look at the really expensive, racehorse networks. Myrinet and SCI are both expensive (quite possibly more expensive than the nodes themselves, per node) but offer latency on the order of microseconds, where ethernet latencies tend to be on the tens to hundreds of microseconds. They also offer quite high peak bandwidths. I honestly don't know a lot about firewire -- perhaps somebody on the list could summarize experiences with a firewire connected network. > i'm using a star topology to my 16 nodes cluster, i > heard something about using a 4x4 grid instead 1x16, > is it better? why? And i also would like to know a way > to calculate(predict) the performance of a cluster, > for example, i have a 16 nodes(using fast ethernet) It is "in principle" possible to predict parallel performance, but it isn't easy and to have a good shot at getting it right requires extensive study of your hardware resources, your operating system and libraries, and your application. By "study" I mean both learning all you can about it (and about parallel task scaling in general) from real textbooks, manuals, and informational resources and making real world measurements. For example, you'll need to know things like (and this is by no means a complete list): a) Raw CPU clock, and how fast the processor does all the things it can do. The structure of the processor -- how and if it is pipelined, number and kind of registers -- it all matters. b) Size and structure and latencies associated with CPU registers, L1 and L2 (and L3, if any) cache. c) Size and structure and (clock and) latency of main memory. Structure includes a lot -- datapath widths, how the memory is interfaced to e.g. peripherals and one or more CPUs. d) The bus(es) on the motherboard, and how they are interfaced with the CPU and memory and each other. Again, things like clock, datapath width and so forth are important, but there is more. e) The devices attached to the bus(es). Obviously and especially the network hardware, but also disk hardware and even humble slow peripheral hardware can matter, as its mere existence CAN (has in the past) significantly alter the performance of things you care about. a') Operating system. Kernels are different -- schedulers, interrupt mechanisms, locking mechanisms, device drivers -- all of it goes into how well and stably a system will perform overall. b') Compilers. Just changing compilers can make a huge difference in the speed of your code. Some compilers support SSE2 instructions, and some code speeds up if it uses them. Others (in both cases) don't. c') Libraries. You don't really write your whole program (any program) -- you write PART of it, and rely on the kindness and skill of strangers for the rest of it, in the form of the many libraries you link to. Whether it is core libraries such as libc and libm or esoteric/advanced libraries like libgsl, libxml2, libpvm -- if your code uses these libraries your performance will be affected by their quality (which can vary with all of the above). a") Your application. The ATLAS project is a perfect demonstration of how naive implementation of even the most mundane tasks (such as vector and matrix operations) can fail to achieve the speed available to them by factors of two and three. ATLAS (automatically tuned linear algebra system) provides linear algebra libraries that are custom built for particular systems (a-e above) and operating system environments (a'-c') above and does things like change algorithms altogether and alter the blocking of the code according to the size and MEASURED speed of the CPU and its various layers of cache. There are whole books devoted to parallel programming and algorithms, and performance on any given task depends on the algorithms chosen. b") Your application. Yes, it bears repeating. You cannot predict "performance" of piece of hardware with a given operating system and compiler and set of libraries. You can only predict performance of an application. So you have to fully understand it, in particular where and how it is bottlenecked. c") Your application yet again. The more you understand about it and how it uses the hardware resources available to it, the better your chances of both predicting its performance and optimizing that performance on any given box. Profiling tools, debuggers, and code instrumentation are all your friends. Most people I know (even the ones that know enough that they COULD make a creditable stab at all of the points above) tend to NOT try to predict performance per se -- they prefer to measure it. There are a very few microbenchmarks that can be useful to people who are bottlenecked in one of the "standard" ways -- at the network layer, or memory layer, or raw CPU layer, for example. These people care about netpipe numbers, they care about stream results, they care about lmbench or cpu_rate numbers. Most applications are a complicated mix of integer and floating point code where "improving" any one thing may not affect the overall performance at all! For years I ran my Monte Carlo code and discovered that within a processor family it scaled almost perfectly with CPU clock and nothing else. Fast expensive memory or slow blodgy memory, didn't matter. Big CPU L2 cache or little cache, didn't matter. Fast FSB or slow FSB, fast network or slow network -- all about the same, but double CPU clock and it finishes twice as fast. Not all jobs are like that, and scaling BETWEEN processor families (e.g. AMD and Intel) can be completely different even for this job. That's why everybody will conclude any answer about speed, benchmarks, performance with --- YMMV (Your Mileage May Vary). Don't trust benchmarks, don't trust what people tell you about system performance. Don't compute it if you can avoid it. Measure it. It's the only sure way, and even MEASURING it you have work to do with e.g. compiler and library selection, and algorithm optimization. rgb > cluster and i`m not sure about the maximum performace > i should get, to compare with the HPL benchmark > results. > > Thanks > Mathias Brito > > > > ===== > Mathias Brito > Universidade Estadual de Santa Cruz - UESC > Departamento de Ciências Exatas e Tecnológicas > Estudante do Curso de Ciência da Computação > > ______________________________________________________________________ > > Yahoo! Messenger - Fale com seus amigos online. Instale agora! > http://br.download.yahoo.com/messenger/ > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] About Cluster Performance...
- Next message: [Beowulf] ARIMA HDAMA dual opteron motherboard
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
