j.sassmannshausen at ucl.ac.uk
Sun Nov 4 10:06:01 PST 2012
I agree with Vincent regarding EEC, I think it is really mandatory for a
cluster which does number crunching.
However, the best cluster does not help if the deployed code does not have a
test suite to verify the installation. Believe me, that is not an expection, I
know a number of chemistry codes which are used in practise and there is not
test suite, or the test suite is broken and it actually says on the code's
webpage: don't bother using the test suite, it is broken and we know it.
So you need both: good hardware _and_ good software with a test suite to
generate meaningful results. If one of the requirements is not met, we might
as well throw a dice which is cheaper ;-)
All the best from a wet London
On Sonntag 04 November 2012 Vincent Diepeveen wrote:
> On Nov 4, 2012, at 5:53 PM, Lux, Jim (337C) wrote:
> > On 11/3/12 6:55 PM, "Robin Whittle" <rw at firstpr.com.au> wrote:
> >> <snip>
> >> For serious work, the cluster and its software needs to survive power
> >> outages, failure of individual servers and memory errors, so ECC
> >> memory
> >> is a good investment . . . which typically requires more expensive
> >> motherboards and CPUs.
> > Actually, I don't know that I would agree with you about ECC, etc.
> > ECC
> > memory is an attempt to create "perfect memory". As you scale up, the
> > assumption of "perfect computation" becomes less realistic, so that
> > means
> > your application (or the infrastructure on which the application
> > sits) has
> > to explicitly address failures, because at sufficiently large
> > scale, they
> > are inevitable. Once you've dealt with that, then whether ECC is
> > needed
> > or not (or better power supplies, or cooling fans, or lunar gravity
> > phase
> > compensation, or whatever) is part of your computational design and
> > budget: it might be cheaper (using whatever metric) to
> > overprovision and
> > allow errors than to buy fewer better widgets.
> I don't know whether for all clusters 'outages' is a big issue - here
> in Western Europe we hardly have
> power failures, so i would imagine it if a company with a cluster
> doesn't invest into batterypacks,
> as their company won't be able to run anyway if there isn't power.
> More interesting is the ECC discussion.
> ECC is simply a requirement IMHO, not a 'luxury thing' as some
> hardware engineers see it.
> I know some memory engineers disagree here - for example one of them
> mentionned to me that "putting ECC onto a GPU
> is nonsense as it is a lot of effort and DDR5 already has a built in
> CRC" something like that (if i remember the quote correctly).
> But they do not administer servers themselves.
> Also they don't understand the accuracy or better LACK of accuracy in
> checking calculations done by
> some who calculate at big iron. If you calculate at a cluster and get
> after some months a result - reality is simply that
> 99% of the researchers isn't as good as the Einstein league
> researchers and 90% simply sucks too much by any standards
> in this sense that they wouldn't see an obvious problem get generated
> by a bitflip here or there. They just would
> happily invent a new theory, as we already have seen too much in
> By simply putting in ECC there you avoid in some percent of the cases
> this 'interpreting the results correctly' problem.
> Furthermore there is too many calculations where a single bitflip
> could be catastrophic and calculating
> for a few months at hundreds of cores is asking for trouble then
> without ECC.
> As last argument i want to note that in many sciences we simply see
> that the post 2nd world war standard of using alpha = 0.05
> or an error of at most 5% (2 x standard deviation), simply isn't
> accurate enough anymore for todays generation of scientists.
> They need more accuracy.
> So historic debates on what is enough or what isn't enough - reducing
> errors by means of using ECC is really important.
> Now that said - if someone shows up with a different form of checking
> that's just as accurate or even better - that would be
> acceptable as well - yet most discussions usually with the hardware
> engineers are typically like: "why do all this effort to get
> rid of a few errors meanwhile my windows laptop if it crashes i just
> reboot it".
> Such sorts of discussions really should be discussions of the past -
> society is moving on - one needs a far higher accuracy and
> reliability now - simply as the CPU's do more calculations and the
> Memory therefore has to serve more bytes per second.
> In all that ECC is a requirement for huge clusters and from my
> viewpoint also for relative tiny clusters.
> >> I understand that the most serious limitation of this approach is the
> >> bandwidth and latency (how long it takes for a message to get to the
> >> destination server) of 1Gbps Ethernet. The most obvious alternatives
> >> are using multiple 1Gbps Ethernet connections per server (but this is
> >> complex and only marginally improves bandwidth, while doing little or
> >> nothing for latency) or upgrading to Infiniband. As far as I know,
> >> Infiniband is exotic and expensive compared to the mass market
> >> motherboards etc. from which a Beowulf cluster can be made. In other
> >> words, I think Infiniband is required to make a cluster work really
> >> well, but it does not not (yet) meet the original Beowulf goal of
> >> being
> >> inexpensive and commonly available.
> > Perhaps a distinction should be made between "original Beowulf" and
> > "cluster computer"? As you say, the original idea (espoused in the
> > book,
> > etc.) is a cluster built from cheap commodity parts. That would mean
> > "commodity packaging", "commodity interconnects", etc. which for
> > the most
> > part meant tower cases and ethernet. However, cheap custom sheet
> > metal is
> > now available (back when Beowulfs were first being built, rooms
> > full of
> > servers were still a fairly new and novel thing, and you paid a
> > significant premium for rack mount chassis, especially as consumer
> > pressure forced the traditional tower case prices down)
> >> I think this model of HPC cluster computing remains fundamentally
> >> true,
> >> but there are two important developments in recent years which either
> >> alter the way a cluster would be built or used or which may make the
> >> best solution to a computing problem no longer a cluster. These
> >> developments are large numbers of CPU cores per server, and the
> >> use of
> >> GPUs to do massive amounts of computing, in a single inexpensive
> >> graphic
> >> card - more crunching than was possible in massive clusters a decade
> >> earlier.
> > Yes. But in some ways, utilizing them has the same sort of software
> > problem as using multiple nodes in the first place (EP aside). And
> > the
> > architecture of the interconnects is heterogeneous compared to the
> > fairly
> > uniform interconnect of a generalized cluster fabric. One can
> > raise the
> > same issues with cache, by the way.
> >> The ideal computing system would have a single CPU core which
> >> could run
> >> at arbitrarily high frequencies, with low latency, high bandwidth,
> >> access to an arbitrarily large amount of RAM, with matching links to
> >> hard disks or other non-volatile storage systems, with a good
> >> Ethernet
> >> link to the rest of the world.
> >> While CPU clock frequencies and computing effort per clock frequency
> >> have been growing slowly for the last 10 years or so, there has
> >> been a
> >> continuing increase in the number of CPU cores per CPU device
> >> (typically
> >> a single chip, but sometimes multiple chips in a device which is
> >> plugged
> >> into the motherboard) and in the number of CPU devices which can be
> >> plugged into a motherboard.
> > That's because CPU clock is limited by physics. "work per clock
> > cycle" is
> > also limited by physics to a certain extent (because today's
> > processors
> > are mostly synchronous, so you have a propagation delay time from
> > one side
> > of the processor to the other) except for things like array processors
> > (SIMD) but I'd say that's just multiple processors that happen to
> > be doing
> > the same thing, rather than a single processor doing more.
> > The real force driving multiple cores is the incredible expense of
> > getting
> > on and off chip. Moving a bit across the chip is easy, compared to
> > off
> > chip: you have to change the voltage levels, have enough current
> > to drive
> > a trace, propagate down that trace, receive the signal at the other
> > end,
> > shift voltages again.
> >> Most mass market motherboards are for a single CPU device, but
> >> there are
> >> a few two and four CPU motherboards for Intel and AMD CPUs.
> >> It is possible to get 4 (mass market) 6, 8, 12 or sometimes 16 CPU
> >> cores
> >> per CPU device. I think the 4 core i7 CPUs or their ECC-
> >> compatible Xeon
> >> equivalents are marginally faster than those with 6 or 8 cores.
> >> In all cases, as far as I know, combining multiple CPU cores and/or
> >> multiple CPU devices results in a single computer system, with a
> >> single
> >> operating system and a single body of memory, with multiple CPU cores
> >> all running around in this shared memory.
> > Yes.. That's a fairly simple model and easy to program for.
> >> I have no clear idea how each
> >> CPU core knows what the other cores have written to the RAM they are
> >> using, since each core is reading and writing via its own cache of
> >> the
> >> memory contents. This raises the question of inter-CPU-core
> >> communications, within a single CPU chip, between chips in a multi-
> >> chip
> >> CPU module, and between multiple CPU modules on the one motherboard.
> > Generally handled by the OS kernel. In a multitasking OS, the
> > scheduler
> > just assigns the next free CPU to the next task. Whether you
> > restore the
> > context from processor A to processor A or to processor B doesn't make
> > much difference. Obviously, there are cache issues (since that's
> > part of
> > context). This kind of thing is why multiprocessor kernels are non-
> > trivial.
> >> I understand that MPI works identically from the programmer's
> >> perspective between CPU-cores on a shared memory computer as between
> >> CPU-cores on separate servers. However, the performance (low latency
> >> and high bandwidth) of these communications within a single shared
> >> memory system is vastly higher than between any separate servers,
> >> which
> >> would rely on Infiniband or Ethernet.
> > Yes. This is a problem with a simple interconnect model.. It doesn't
> > necessarily reflect the cost of the interconnect is different
> > depending on
> > how far and how fast you're going. That said, there is a fair
> > amount of
> > research into this. Hypercube processors had limited interconnects
> > between nodes (only nearest neighbor) and there are toroidal
> > fabrics (2D
> > interconnects) as well.
> >> So even if you have, or are going to write, MPI-based software
> >> which can
> >> run on a cluster, there may be an argument for not building a
> >> cluster as
> >> such, but for building a single motherboard system with as many as 64
> >> CPU cores.
> > Sure.. If your problem is of a size that it can be solved by a
> > single box,
> > then that's usually the way to go. (It applies in areas outside of
> > computing.. Better to have one big transmitter tube than lots of
> > little
> > ones). But it doesn't scale. The instant the problem gets too big,
> > then
> > you're stuck. The advantage of clusters is that they are
> > scalable. Your
> > problem gets 2x bigger, in theory, you add another N nodes and you're
> > ready to go (Amdahl's law can bite you though).
> > There's even been a lot of discussion over the years on this list
> > about
> > the optimum size cluster to build for a big task, given that
> > computers are
> > getting cheaper/more powerful. If you've got 2 years worth of
> > computing,
> > do you buy a computer today that can finish the job in 2 years, or
> > do you
> > do nothing for a year and buy a computer that is twice as fast in a
> > year.
> >> I think the major new big academic cluster projects focus on
> >> getting as
> >> many CPU cores as possible into a single server, while minimising
> >> power
> >> consumption per unit of compute power, and then hooking as many as
> >> possible of these servers together with Infiniband.
> > That might be an aspect of trying to make a general purpose computing
> > resource within a specified budget.
> >> Here is a somewhat rambling discussion of my own thoughts regarding
> >> clusters and multi-core machines, for my own purposes. My
> >> interests in
> >> high performance computing involve music synthesis and physics
> >> simulation.
> >> There is an existing, single-threaded (written in C, can't be made
> >> multithreaded in any reasonable manner) music synthesis program
> >> called
> >> Csound. I want to use this now, but as a language for synthesis, I
> >> think it is extremely clunky. So I plan to write my own program -
> >> one
> >> day . . . When I do, it will be written in C++ and
> >> multithreaded, so
> >> it will run nicely on multiple CPU-cores in a single machine.
> >> Writing
> >> and debugging a multithreaded program is more complex than doing
> >> so for
> >> a single-threaded program, but I think it will be practical and a lot
> >> easier than writing and debugging an MPI based program running
> >> either on
> >> on multiple servers or on multiple CPU-cores on a single server.
> > Maybe, maybe not. How is your interthread communication architecture
> > structured? Once you bite the bullet and go with a message passing
> > model,
> > it's a lot more scalable, because you're not doing stuff like "shared
> > memory".
> >> I want to do some simulation of electromagnetic wave propagation
> >> using
> >> an existing and widely used MPI-based (C++, open source) program
> >> called
> >> Meep. This can run as a single thread, if there is enough RAM, or
> >> the
> >> problem can be split up to run over multiple threads using MPI
> >> communication between the threads. If this is done on a single
> >> server,
> >> then the MPI communication is done really quickly, via shared memory,
> >> which is vastly faster than using Ethernet or Inifiniband to other
> >> servers. However, this places a limit on the number of CPU-cores and
> >> the total memory. When simulating three dimensional models, the
> >> RAM and
> >> CPU demands can easily become extremely demanding. Meep was
> >> written to
> >> split the problem into multiple zones, and to work efficiently
> >> with MPI.
> > As you note, this is advantage of setting up a message passing
> > architecture from the beginning.. It works regardless of the scale/
> > method
> > of message passing. There *are* differences in performance.
> >> Ten or 15 years ago, the only way to get more compute power was to
> >> build
> >> a cluster and therefore to write the software to use MPI. This was
> >> because CPU-devices had a single core (Intel Pentium 3 and 4) and
> >> because it was rare to find motherboards which handled multiple such
> >> chips.
> > Yes
> >> The next step would be to get a 4 socket motherboard from Tyan or
> >> SuperMicro for $800 or so and populate it with 8, 12 or (if money
> >> permits) 16 core CPUs and a bunch of ECC RAM.
> >> My forthcoming music synthesis program would run fine with 8 or
> >> 16GB of
> >> RAM. So one or two of these 16 (2 x 8) to 64 (4 x 16) core Opteron
> >> machines would do the trick nicely.
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> > Computing
> > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
University College London
Department of Chemistry
email: j.sassmannshausen at ucl.ac.uk
Please avoid sending me Word or PowerPoint attachments.
More information about the Beowulf