[Beowulf] CPU Startup Combines CPU+DRAM‹And A Whole Bunch Of Crazy

Mon Jan 23 12:50:09 PST 2012

On Mon, Jan 23, 2012 at 11:35 AM, Lux, Jim (337C)
<james.p.lux at jpl.nasa.gov> wrote:
> The "processors in a sea of memory" model has been around for a while
> (and, in fact, there were a lot of designs in the 80s, at the board if not
> the chip level: transputers, early hypercubes, etc.)  So this is
> revisiting the architecture at a smaller level of integration.

I remember 12-15 years ago I was reading quite a few papers published
by the Berkeley Intelligent RAM (IRAM) Project:

http://iram.cs.berkeley.edu/

So 15 years later someone suddenly thinks that it is a good idea to
ship IRAM systems to real customers?? :-D

Rayson

=================================
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/

> One thing about power consumption.. Those memory cells consume so little
> power because most of them  are not being accessed.  They're essentially
> "floating" capacitors. So the power consumption of the same transistor in
> a CPU (where the duty factor is 100%) is going to be higher than the power
> consumption in a memory cell (where the duty factor is 0.001% or
> something).
>
> And, as always, the challenge is in the software to effectively use the
> distributed computing architecture.  When you think about it, we've had
> almost a century to figure out how to program single instruction stream
> computers of one sort or another, and it was easy, because we are single
> stream (SISD) ourselves.  We can create a simulation of multiple threads
> by timesharing in some sense (in either the human or machine models)
>
> And we have lots of experience with EP type, or even scatter/gather type
> processes (tilling land, building pyramids, assembly lines) so that model
> of software/hardware architecture can be argued to be a natural outgrowth
> of what humans already do, and have been figuring out how to do for
> millenia.  (did Imhotep use some form of project planning tools?  You bet
> he did)
>
> However, true parallelism (MIMD) is harder to conceptualize.  Vector and
> matrix math is one area, but I'd argue that it's just the same as EP
> tasks, just at a finer grain. Systolic arrays, vector pipelines, FFT boxes
> from FloatingPointSystems, are all basically ways to use the underlying
> structure of the task, in an easy way (how long til there's a hardware
> implementation of the new faster-than-FFT algorithm published last week?)
> And in all those cases, you have to explicitly make use of the special
> capabilities.  That is, in general, the compiler doesn't recognize it
> (although, modern parallelizing compilers ARE really smart.. So they
> probably do find most of the cases)
>
> I don't know that we have good conceptual tools to take a complex task and
> break it effectively into multiple disparate component tasks that can
> effectively run in parallel.  It's a hard task for something
> straightforward (e.g. Designing a big system or building a spacecraft),
> and I don't know that any of outputs of current project planning
> techniques (which are entirely manual) can be said to produce
> "generalized" optimum outputs.  They produce *an* output for dividing the
> complex task up (or else the project can't be done), but I don't know that
> the output is provably optimum or even workable (an awful lot of projects
> over-run, and not just because of bad estimates for time/cost).
>
> So the problem facing would-be users of new computing architectures (be
> they TOMI, HyperCube, ConnectionMachine, or Beowulf) is like that facing a
> project planner given a big project, and a brand new crew of workers who
> speak a different language, with skill sets totally different than the
> planner is used to.
>
> This is what the computer user is facing:  There's no compiler or problem
> description technique that will automatically generate a "work plan" to
> use that new architecture. It's all manual, and it's hard, and you're up
> against a brute force "why not just hook 500 people up to that rock and
> drag it" approach.  The people who figure out the new way will certainly
> benefit society, but there's going to be a lot of false starts along the
> way.  And, I'm not particularly sanguine about the process being automated
> (at least in the sense of automatic parallelizing compilers that recognize
> loops and repetitve stuff).  I think that for the next few years
> (decades?) using new architectures is going to rely on skilled humans to
> figure out how to use it, on an ad hoc, unique to each application, basis.
>
>
> [Back in the 80s, I had a loaner "sugarcube" 4 node Intel hypercube
> sitting on my desk for a while.  I wanted to figure out something to do
> with it that is non-trivial, and not the examples given in the docs (which
> focused on stuff like LISP and Prolog).  I started, as I'm sure many
> people do, by taking a multithreaded application I had, and distributing
> the threads to processors.  You pretty quickly realize, though, that it's
> tough to evenly distribute the loads among processors, and you wind up
> with processor 1 waiting for something that processor 2 is doing, which in
> turn is waiting for something that processor 3 is doing, and so forth.  In
> a "shared processor" this isn't a big deal, and is transparent: the
> processor is always working, and aside from deadlocks, there's no
> particular reason why you need to balance load among threads.
>
> For what it's worth, the task I was doing was comparable to taking
> execution of a Matlab/simulink model and distributing it across multiple
> processors.  You had signals flowing among blocks, etc.  These things are
> computationally intensive (especially if you have loops in the design, so
> you need an iterative solution of some sort) so the idea of putting
> multiple processors to work is attractive.   But the "work" in each block
> in the diagram isn't known a-priori and might vary during the course of
> the simulation, so it's not like you can come up with some sort of
> automatic partitioning algorithm.
>
>
> On 1/23/12 7:38 AM, "Prentice Bisbal" <prentice at ias.edu> wrote:
>
>>If you read this PDF from Venray Technologies, which is linked to in the
>>article, you see where the 'Whole Bunch of Crazy" part comes from. After
>>reading it, Venray lost a lot of credibility in my book.
>>
>>https://www.venraytechnology.com/economics_of_cpu_in_DRAM2.pdf
>>
>>--
>>Prentice
>>
>>
>>On 01/23/2012 08:45 AM, Eugen Leitl wrote:
>>> (Old idea, makes sense, will they be able to pull it off?)
>>>
>>>
>>>http://hothardware.com/News/CPU-Startup-Combines-CPUDRAMAnd-A-Whole-Bunch
>>>-Of-Crazy/
>>>
>>> CPU Startup Combines CPU+DRAM‹And A Whole Bunch Of Crazy
>>>
>>> Sunday, January 22, 2012 - by Joel Hruska
>>>
>>> The CPU design firm Venray Technology announced a new product design
>>>this
>>> week that it claims can deliver enormous performance benefits by
>>>combining
>>> CPU and DRAM on to a single piece of silicon. We spent some time
>>>earlier this
>>> fall discussing the new TOMI (Thread Optimized Multiprocessor) with
>>>company
>>> CTO Russell Fish, but while the idea is interesting; its presentation is
>>> marred by crazy conceptualizing and deeply suspect analytics.
>>>
>>> The Multicore Problem:
>>>
>>> There are three limiting factors, or walls, that limit the scaling of
>>>modern
>>> microprocessors. First, there's the memory wall, defined as the gap
>>>between
>>> the CPU and DRAM clock speed. Second, there's the ILP (Instruction Level
>>> Parallelism) wall, which refers to the difficulty of decoding enough
>>> instructions per clock cycle to keep a core completely busy. Finally,
>>>there's
>>> the power wall--the faster a CPU is and the more cores it has, the more
>>>power
>>> it consumes.
>>>
>>> Attempting to compensate for one wall often risks running afoul of the
>>>other
>>> two. Adding more cache to decrease the impact of the CPU/DRAM speed
>>> discrepancy adds die complexity and draws more power, as does raising
>>>CPU
>>> clock speed. Combined, the three walls are a set of fundamental
>>> constraints--improving architectural efficiency and moving to a smaller
>>> process technology may make the room a bit bigger, but they don't
>>>remove the
>>> walls themselves.
>>>
>>> TOMI attempts to redefine the problem by building a very different type
>>>of
>>> microprocessor. The TOMI Borealis is built using the same transistor
>>> structures as conventional DRAM; the chip trades clock speed and
>>>performance
>>> for ultra-low low leakage. Its design is, by necessity, extremely
>>>simple. Not
>>> counting the cache, TOMI is a 22,000 transistor design, as compared to
>>>30,000
>>> transistors for the original ARM2. The company's early prototypes,
>>>built on
>>> legacy DRAM technology, ran at 500MHz on a 110nm process.
>>>
>>> Instead of surrounding a CPU core with a substantial amount of L2 and L3
>>> cache, Venray inserted a CPU core directly into a DRAM design. A TOMI
>>> Borealis core connects eight TOMI cores to a 1Gbit DRAM with a total of
>>>16
>>> ICs per 2GB DIMM. This works out to a total of 128 processor cores per
>>>DIMM.
>>> Because they're built using ultra-low-leakage processes and are so
>>>small,
>>> such cores cost very little to build and consume vanishingly small
>>>amounts of
>>> power (Venray claims power consumption is as low as 23mW per core at
>>>500MHz).
>>>
>>> It's an interesting idea.
>>>
>>> The Bad:
>>>
>>> When your CPU has fewer transistors than an architecture that debuted in
>>> 1986, it's a good chance that you left a few things out--like an FPU,
>>>branch
>>> prediction, pipelining, or any form of speculative execution. Venray
>>>may have
>>> created a chip with power consumption an order of magnitude lower than
>>> anything ARM builds and more memory bandwidth than Intel's highest-end
>>>Xeons,
>>> but it's an ultra-specialized, ultra-lightweight core that trades 25
>>>years of
>>> flexibility and performance for scads of memory bandwidth.
>>>
>>>
>>> The last few years have seen a dramatic surge in the number of
>>>low-power,
>>> many-core architectures being floated as the potential future of
>>>computing,
>>> but Venray's approach relies on the manufacturing expertise of
>>>companies who
>>> have no experience in building microprocessors and don't normally serve
>>>as
>>> foundries. This imposes fundamental restrictions on the CPU's ability to
>>> scale; DRAM is manufactured using a three layer mask rather than the
>>>10-12
>>> layers Intel and AMD use for their CPUs. Venray already acknowledges
>>>that
>>> these conditions imposed substantial limitations on the original TOMI
>>>design.
>>>
>>> Of course, there's still a chance that the TOMI uarch could be
>>>effective in
>>> certain bandwidth-hungry scenarios--but that's where the Venray Crazy
>>>Train
>>> goes flying off the track.
>>>
>>> The Disingenuous and Crazy
>>>
>>> Let's start here. In a graph like this, you expect the two bars to
>>>represent
>>> the same systems being compared across three different characteristics.
>>> That's not the case. When we spoke to Russell Fish in late November, he
>>> pointed us to this publicly available document and claimed that the
>>>results
>>> came from a customer with 384 2.1GHz Xeons. There's no such thing as an
>>>S5620
>>> Xeon and even if we grant that he meant the E5620 CPU, that's a 2.4GHz
>>>chip.
>>>
>>> The "Power consumption" graphs show Oracle's maximum power consumption
>>>for a
>>> system with 10x Xeon E7-8870s, 168 dedicated SQL processors, 5.3TB
>>>(yes, TB)
>>> of Flash and 15x 10,000 RPM hard drives. It's not only a worst-case
>>>figure,
>>> it's a figure utterly unrelated to the workload shown in the Performance
>>> comparison. Furthermore, given that each Xeon E7-8870 has a 130W TDP,
>>>ten of
>>> them only come out to 1.3kW--Oracle's 17.7kW figure means that the
>>> overwhelming majority of the cabinet's power consumption is driven by
>>> components other than its CPUs.
>>>
>>> From here, things rapidly get worse. Fish makes his points about power
>>>walls
>>> by referring to unverified claims that prototype 90nm Tejas chips drew
>>>150W
>>> at 2.8GHz back in 2004. That's like arguing that Ford can't build a
>>>decent
>>> car because the Edsel sucked.
>>>
>>> After reading about the technology, you might think Venray was planning
>>>to
>>> market a small chip to high-end HPC niche markets... and you'd be
>>>wrong. The
>>> company expects the following to occur as a result of this revolutionary
>>> architecture (organized by least-to-most creepy):
>>>
>>>     Computer speech will be so common that devices will talk to other
>>>devices
>>> in the presence of their users.
>>>
>>>     Your cell phone camera will recognize the face of anyone it sees
>>>and scan
>>> the computer cloud for backround red flags as well as six degrees of
>>> separation
>>>
>>>     Common commands will be reduced to short verbal cues like clicking
>>>your
>>> tongue or sucking your lips
>>>
>>>     Your personal history will be displayed for one and all to
>>>see...women
>>> will create search engines to find eligible, prosperous men. Men will
>>>create
>>> search engines to qualify women. Criminals will find their jobs much
>>>more
>>> difficult because their history will be immediately known to anyone who
>>> encounters them.
>>>
>>>     TOMI Technology will be built on flash memories creating the
>>>elemental
>>> unit of a learning machine... the machines will be able to self
>>>organize,
>>> build robust communicating structures, and collaborate to perform tasks.
>>>
>>>     A disposable diaper company will give away TOMI enabled teddy bears
>>>that
>>> teach reading and arithmetic. It will be able to identify specific
>>> children... and from time to time remind Mom to buy a product. The bear
>>>will
>>> also diagnose a raspy throat, a cough, or runny nose.
>>>
>>> Conclusion:
>>>
>>> Fish has spent decades in the microprocessor industry--he invented the
>>>first
>>> CPU to use a clock multiplier in conjunction with Chuck H. Moore--but
>>>his
>>> vision of the future is crazy enough to scare mad dogs and Englishmen.
>>>
>>> His idea for a CPU architecture is interesting, even underneath the
>>> obfuscation and false representation, but too practically limited to
>>>ever
>>> take off. Google, an enthusiastic and dedicated proponent of energy
>>> efficient, multi-core research said it best in a paper titled "Brawny
>>>cores
>>> still beat wimpy cores, most of the time."
>>>
>>>  "Once a chip¹s single-core performance lags by more than a factor to
>>>two or
>>> so behind the higher end of current-generation commodity processors,
>>>making a
>>> business case for switching to the wimpy system becomes increasingly
>>> difficult... So go forth and multiply your cores, but do it in
>>>moderation, or
>>> the sea of wimpy cores will stick to your programmers¹ boots like clay."