[Beowulf] About Cluster Performance...
Robert G. Brown
rgb at phy.duke.edu
Thu May 20 11:16:58 PDT 2004
On Thu, 20 May 2004, Mathias Brito wrote:
> hello folks,
> Well, i would like to know how to organize the nodes
> to abtain the best performance of my cluster, for now
Attain the best performance doing what?
The best (and cheapest) "general purpose" design is to put your 16 nodes
on a fast ethernet switch as you are doing. It works fine for
moderately coarse grained tasks and is inexpensive to upgrade to gigabit
ethernet now or later for improved bandwidth if need be.
However, decisions concerning network and topology tend to be driven by
the application (set) you desire to run. If you are doing Monte Carlo
(as I am) in an embarrassingly parallel configuration then don't change.
Who cares how "fast" you run some benchmark if that benchmark has
nothing to do with the structure of YOUR code? If you are doing
something medium to fine grained parallel, where there are quite a few
network communications that have to take place for your computation to
advance a step, then consider a faster network.
If the communications tend to be sending relatively few BIG messages
between the nodes, gigabit ethernet is a reasonable (and still cheap)
choice. If they tend to be sending lots of itty bitty messages between
the nodes all the time, then you will need to look at the really
expensive, racehorse networks.
Myrinet and SCI are both expensive (quite possibly more expensive than
the nodes themselves, per node) but offer latency on the order of
microseconds, where ethernet latencies tend to be on the tens to
hundreds of microseconds. They also offer quite high peak bandwidths.
I honestly don't know a lot about firewire -- perhaps somebody on the
list could summarize experiences with a firewire connected network.
> i'm using a star topology to my 16 nodes cluster, i
> heard something about using a 4x4 grid instead 1x16,
> is it better? why? And i also would like to know a way
> to calculate(predict) the performance of a cluster,
> for example, i have a 16 nodes(using fast ethernet)
It is "in principle" possible to predict parallel performance, but it
isn't easy and to have a good shot at getting it right requires
extensive study of your hardware resources, your operating system and
libraries, and your application. By "study" I mean both learning all
you can about it (and about parallel task scaling in general) from real
textbooks, manuals, and informational resources and making real world
For example, you'll need to know things like (and this is by no means a
a) Raw CPU clock, and how fast the processor does all the things it
can do. The structure of the processor -- how and if it is pipelined,
number and kind of registers -- it all matters.
b) Size and structure and latencies associated with CPU registers, L1
and L2 (and L3, if any) cache.
c) Size and structure and (clock and) latency of main memory.
Structure includes a lot -- datapath widths, how the memory is
interfaced to e.g. peripherals and one or more CPUs.
d) The bus(es) on the motherboard, and how they are interfaced with
the CPU and memory and each other. Again, things like clock, datapath
width and so forth are important, but there is more.
e) The devices attached to the bus(es). Obviously and especially the
network hardware, but also disk hardware and even humble slow peripheral
hardware can matter, as its mere existence CAN (has in the past)
significantly alter the performance of things you care about.
a') Operating system. Kernels are different -- schedulers, interrupt
mechanisms, locking mechanisms, device drivers -- all of it goes into
how well and stably a system will perform overall.
b') Compilers. Just changing compilers can make a huge difference in
the speed of your code. Some compilers support SSE2 instructions, and
some code speeds up if it uses them. Others (in both cases) don't.
c') Libraries. You don't really write your whole program (any
program) -- you write PART of it, and rely on the kindness and skill of
strangers for the rest of it, in the form of the many libraries you link
to. Whether it is core libraries such as libc and libm or
esoteric/advanced libraries like libgsl, libxml2, libpvm -- if your code
uses these libraries your performance will be affected by their quality
(which can vary with all of the above).
a") Your application. The ATLAS project is a perfect demonstration of
how naive implementation of even the most mundane tasks (such as vector
and matrix operations) can fail to achieve the speed available to them
by factors of two and three. ATLAS (automatically tuned linear algebra
system) provides linear algebra libraries that are custom built for
particular systems (a-e above) and operating system environments (a'-c')
above and does things like change algorithms altogether and alter the
blocking of the code according to the size and MEASURED speed of the CPU
and its various layers of cache. There are whole books devoted to
parallel programming and algorithms, and performance on any given task
depends on the algorithms chosen.
b") Your application. Yes, it bears repeating. You cannot predict
"performance" of piece of hardware with a given operating system and
compiler and set of libraries. You can only predict performance of an
application. So you have to fully understand it, in particular where
and how it is bottlenecked.
c") Your application yet again. The more you understand about it and
how it uses the hardware resources available to it, the better your
chances of both predicting its performance and optimizing that
performance on any given box. Profiling tools, debuggers, and code
instrumentation are all your friends.
Most people I know (even the ones that know enough that they COULD make
a creditable stab at all of the points above) tend to NOT try to predict
performance per se -- they prefer to measure it. There are a very few
microbenchmarks that can be useful to people who are bottlenecked in one
of the "standard" ways -- at the network layer, or memory layer, or raw
CPU layer, for example. These people care about netpipe numbers, they
care about stream results, they care about lmbench or cpu_rate numbers.
Most applications are a complicated mix of integer and floating point
code where "improving" any one thing may not affect the overall
performance at all!
For years I ran my Monte Carlo code and discovered that within a
processor family it scaled almost perfectly with CPU clock and nothing
else. Fast expensive memory or slow blodgy memory, didn't matter. Big
CPU L2 cache or little cache, didn't matter. Fast FSB or slow FSB, fast
network or slow network -- all about the same, but double CPU clock and
it finishes twice as fast.
Not all jobs are like that, and scaling BETWEEN processor families (e.g.
AMD and Intel) can be completely different even for this job. That's
why everybody will conclude any answer about speed, benchmarks,
performance with ---
YMMV (Your Mileage May Vary). Don't trust benchmarks, don't trust what
people tell you about system performance. Don't compute it if you can
avoid it. Measure it. It's the only sure way, and even MEASURING it
you have work to do with e.g. compiler and library selection, and
> cluster and i`m not sure about the maximum performace
> i should get, to compare with the HPL benchmark
> Mathias Brito
> Mathias Brito
> Universidade Estadual de Santa Cruz - UESC
> Departamento de Ciências Exatas e Tecnológicas
> Estudante do Curso de Ciência da Computação
> Yahoo! Messenger - Fale com seus amigos online. Instale agora!
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf