[Beowulf] CPU Startup Combines CPU+DRAM‹And A Whole Bunch Of Crazy

Lux, Jim (337C) james.p.lux at jpl.nasa.gov
Mon Jan 23 08:35:56 PST 2012


The CPU reminds me of the old bipolar AMD2901 CPU chip sets...
RISC before it was called RISC.

The white paper sort of harps on the fact that one cannot accurately
predict the future (hey, I was a 10th grader at NCC in 1975, and saw the
Altair at the MITS display in their trailer and KNEW that I wanted one,
but I also wanted lots of other things there, which didn't pan out).
Then, having established that you can make predictions with impunity and
nobody can prove you wrong, they go on with a couple pages of ideas.
(establishing priority for patenting.. Eh?  Like the story Feynman tells
about getting a patent on nuclear powered airplanes)

The concept isn't particularly new (see, e.g. Transputers), but that's
true of most architectural things. I think what happens is that as
manufacturing or other limits/bumps in the road are hit, it forces a
review. There's always the argument that building a bigger, faster version
of what we had before is easier (support for legacy codes, etc.) and at
some point, the balance shifts.. It's not easier to just build bigger
faster.

Vector processors
Pipelines
Cluster computers
Etc.

The "processors in a sea of memory" model has been around for a while
(and, in fact, there were a lot of designs in the 80s, at the board if not
the chip level: transputers, early hypercubes, etc.)  So this is
revisiting the architecture at a smaller level of integration.

One thing about power consumption.. Those memory cells consume so little
power because most of them  are not being accessed.  They're essentially
"floating" capacitors. So the power consumption of the same transistor in
a CPU (where the duty factor is 100%) is going to be higher than the power
consumption in a memory cell (where the duty factor is 0.001% or
something).

And, as always, the challenge is in the software to effectively use the
distributed computing architecture.  When you think about it, we've had
almost a century to figure out how to program single instruction stream
computers of one sort or another, and it was easy, because we are single
stream (SISD) ourselves.  We can create a simulation of multiple threads
by timesharing in some sense (in either the human or machine models)

And we have lots of experience with EP type, or even scatter/gather type
processes (tilling land, building pyramids, assembly lines) so that model
of software/hardware architecture can be argued to be a natural outgrowth
of what humans already do, and have been figuring out how to do for
millenia.  (did Imhotep use some form of project planning tools?  You bet
he did)

However, true parallelism (MIMD) is harder to conceptualize.  Vector and
matrix math is one area, but I'd argue that it's just the same as EP
tasks, just at a finer grain. Systolic arrays, vector pipelines, FFT boxes
from FloatingPointSystems, are all basically ways to use the underlying
structure of the task, in an easy way (how long til there's a hardware
implementation of the new faster-than-FFT algorithm published last week?)
And in all those cases, you have to explicitly make use of the special
capabilities.  That is, in general, the compiler doesn't recognize it
(although, modern parallelizing compilers ARE really smart.. So they
probably do find most of the cases)

I don't know that we have good conceptual tools to take a complex task and
break it effectively into multiple disparate component tasks that can
effectively run in parallel.  It's a hard task for something
straightforward (e.g. Designing a big system or building a spacecraft),
and I don't know that any of outputs of current project planning
techniques (which are entirely manual) can be said to produce
"generalized" optimum outputs.  They produce *an* output for dividing the
complex task up (or else the project can't be done), but I don't know that
the output is provably optimum or even workable (an awful lot of projects
over-run, and not just because of bad estimates for time/cost).

So the problem facing would-be users of new computing architectures (be
they TOMI, HyperCube, ConnectionMachine, or Beowulf) is like that facing a
project planner given a big project, and a brand new crew of workers who
speak a different language, with skill sets totally different than the
planner is used to.

This is what the computer user is facing:  There's no compiler or problem
description technique that will automatically generate a "work plan" to
use that new architecture. It's all manual, and it's hard, and you're up
against a brute force "why not just hook 500 people up to that rock and
drag it" approach.  The people who figure out the new way will certainly
benefit society, but there's going to be a lot of false starts along the
way.  And, I'm not particularly sanguine about the process being automated
(at least in the sense of automatic parallelizing compilers that recognize
loops and repetitve stuff).  I think that for the next few years
(decades?) using new architectures is going to rely on skilled humans to
figure out how to use it, on an ad hoc, unique to each application, basis.


[Back in the 80s, I had a loaner "sugarcube" 4 node Intel hypercube
sitting on my desk for a while.  I wanted to figure out something to do
with it that is non-trivial, and not the examples given in the docs (which
focused on stuff like LISP and Prolog).  I started, as I'm sure many
people do, by taking a multithreaded application I had, and distributing
the threads to processors.  You pretty quickly realize, though, that it's
tough to evenly distribute the loads among processors, and you wind up
with processor 1 waiting for something that processor 2 is doing, which in
turn is waiting for something that processor 3 is doing, and so forth.  In
a "shared processor" this isn't a big deal, and is transparent: the
processor is always working, and aside from deadlocks, there's no
particular reason why you need to balance load among threads.

For what it's worth, the task I was doing was comparable to taking
execution of a Matlab/simulink model and distributing it across multiple
processors.  You had signals flowing among blocks, etc.  These things are
computationally intensive (especially if you have loops in the design, so
you need an iterative solution of some sort) so the idea of putting
multiple processors to work is attractive.   But the "work" in each block
in the diagram isn't known a-priori and might vary during the course of
the simulation, so it's not like you can come up with some sort of
automatic partitioning algorithm.


On 1/23/12 7:38 AM, "Prentice Bisbal" <prentice at ias.edu> wrote:

>If you read this PDF from Venray Technologies, which is linked to in the
>article, you see where the 'Whole Bunch of Crazy" part comes from. After
>reading it, Venray lost a lot of credibility in my book.
>
>https://www.venraytechnology.com/economics_of_cpu_in_DRAM2.pdf
>
>--
>Prentice
>
>
>On 01/23/2012 08:45 AM, Eugen Leitl wrote:
>> (Old idea, makes sense, will they be able to pull it off?)
>>
>>
>>http://hothardware.com/News/CPU-Startup-Combines-CPUDRAMAnd-A-Whole-Bunch
>>-Of-Crazy/
>>
>> CPU Startup Combines CPU+DRAM‹And A Whole Bunch Of Crazy
>>
>> Sunday, January 22, 2012 - by Joel Hruska
>>
>> The CPU design firm Venray Technology announced a new product design
>>this
>> week that it claims can deliver enormous performance benefits by
>>combining
>> CPU and DRAM on to a single piece of silicon. We spent some time
>>earlier this
>> fall discussing the new TOMI (Thread Optimized Multiprocessor) with
>>company
>> CTO Russell Fish, but while the idea is interesting; its presentation is
>> marred by crazy conceptualizing and deeply suspect analytics.
>>
>> The Multicore Problem:
>>
>> There are three limiting factors, or walls, that limit the scaling of
>>modern
>> microprocessors. First, there's the memory wall, defined as the gap
>>between
>> the CPU and DRAM clock speed. Second, there's the ILP (Instruction Level
>> Parallelism) wall, which refers to the difficulty of decoding enough
>> instructions per clock cycle to keep a core completely busy. Finally,
>>there's
>> the power wall--the faster a CPU is and the more cores it has, the more
>>power
>> it consumes.
>>
>> Attempting to compensate for one wall often risks running afoul of the
>>other
>> two. Adding more cache to decrease the impact of the CPU/DRAM speed
>> discrepancy adds die complexity and draws more power, as does raising
>>CPU
>> clock speed. Combined, the three walls are a set of fundamental
>> constraints--improving architectural efficiency and moving to a smaller
>> process technology may make the room a bit bigger, but they don't
>>remove the
>> walls themselves.
>>
>> TOMI attempts to redefine the problem by building a very different type
>>of
>> microprocessor. The TOMI Borealis is built using the same transistor
>> structures as conventional DRAM; the chip trades clock speed and
>>performance
>> for ultra-low low leakage. Its design is, by necessity, extremely
>>simple. Not
>> counting the cache, TOMI is a 22,000 transistor design, as compared to
>>30,000
>> transistors for the original ARM2. The company's early prototypes,
>>built on
>> legacy DRAM technology, ran at 500MHz on a 110nm process.
>>
>> Instead of surrounding a CPU core with a substantial amount of L2 and L3
>> cache, Venray inserted a CPU core directly into a DRAM design. A TOMI
>> Borealis core connects eight TOMI cores to a 1Gbit DRAM with a total of
>>16
>> ICs per 2GB DIMM. This works out to a total of 128 processor cores per
>>DIMM.
>> Because they're built using ultra-low-leakage processes and are so
>>small,
>> such cores cost very little to build and consume vanishingly small
>>amounts of
>> power (Venray claims power consumption is as low as 23mW per core at
>>500MHz).
>>
>> It's an interesting idea.
>>
>> The Bad:
>>
>> When your CPU has fewer transistors than an architecture that debuted in
>> 1986, it's a good chance that you left a few things out--like an FPU,
>>branch
>> prediction, pipelining, or any form of speculative execution. Venray
>>may have
>> created a chip with power consumption an order of magnitude lower than
>> anything ARM builds and more memory bandwidth than Intel's highest-end
>>Xeons,
>> but it's an ultra-specialized, ultra-lightweight core that trades 25
>>years of
>> flexibility and performance for scads of memory bandwidth.
>>
>>
>> The last few years have seen a dramatic surge in the number of
>>low-power,
>> many-core architectures being floated as the potential future of
>>computing,
>> but Venray's approach relies on the manufacturing expertise of
>>companies who
>> have no experience in building microprocessors and don't normally serve
>>as
>> foundries. This imposes fundamental restrictions on the CPU's ability to
>> scale; DRAM is manufactured using a three layer mask rather than the
>>10-12
>> layers Intel and AMD use for their CPUs. Venray already acknowledges
>>that
>> these conditions imposed substantial limitations on the original TOMI
>>design.
>>
>> Of course, there's still a chance that the TOMI uarch could be
>>effective in
>> certain bandwidth-hungry scenarios--but that's where the Venray Crazy
>>Train
>> goes flying off the track.
>>
>> The Disingenuous and Crazy
>>
>> Let's start here. In a graph like this, you expect the two bars to
>>represent
>> the same systems being compared across three different characteristics.
>> That's not the case. When we spoke to Russell Fish in late November, he
>> pointed us to this publicly available document and claimed that the
>>results
>> came from a customer with 384 2.1GHz Xeons. There's no such thing as an
>>S5620
>> Xeon and even if we grant that he meant the E5620 CPU, that's a 2.4GHz
>>chip.
>>
>> The "Power consumption" graphs show Oracle's maximum power consumption
>>for a
>> system with 10x Xeon E7-8870s, 168 dedicated SQL processors, 5.3TB
>>(yes, TB)
>> of Flash and 15x 10,000 RPM hard drives. It's not only a worst-case
>>figure,
>> it's a figure utterly unrelated to the workload shown in the Performance
>> comparison. Furthermore, given that each Xeon E7-8870 has a 130W TDP,
>>ten of
>> them only come out to 1.3kW--Oracle's 17.7kW figure means that the
>> overwhelming majority of the cabinet's power consumption is driven by
>> components other than its CPUs.
>>
>> From here, things rapidly get worse. Fish makes his points about power
>>walls
>> by referring to unverified claims that prototype 90nm Tejas chips drew
>>150W
>> at 2.8GHz back in 2004. That's like arguing that Ford can't build a
>>decent
>> car because the Edsel sucked.
>>
>> After reading about the technology, you might think Venray was planning
>>to
>> market a small chip to high-end HPC niche markets... and you'd be
>>wrong. The
>> company expects the following to occur as a result of this revolutionary
>> architecture (organized by least-to-most creepy):
>>
>>     Computer speech will be so common that devices will talk to other
>>devices
>> in the presence of their users.
>>
>>     Your cell phone camera will recognize the face of anyone it sees
>>and scan
>> the computer cloud for backround red flags as well as six degrees of
>> separation
>>
>>     Common commands will be reduced to short verbal cues like clicking
>>your
>> tongue or sucking your lips
>>
>>     Your personal history will be displayed for one and all to
>>see...women
>> will create search engines to find eligible, prosperous men. Men will
>>create
>> search engines to qualify women. Criminals will find their jobs much
>>more
>> difficult because their history will be immediately known to anyone who
>> encounters them.
>>
>>     TOMI Technology will be built on flash memories creating the
>>elemental
>> unit of a learning machine... the machines will be able to self
>>organize,
>> build robust communicating structures, and collaborate to perform tasks.
>>
>>     A disposable diaper company will give away TOMI enabled teddy bears
>>that
>> teach reading and arithmetic. It will be able to identify specific
>> children... and from time to time remind Mom to buy a product. The bear
>>will
>> also diagnose a raspy throat, a cough, or runny nose.
>>
>> Conclusion:
>>
>> Fish has spent decades in the microprocessor industry--he invented the
>>first
>> CPU to use a clock multiplier in conjunction with Chuck H. Moore--but
>>his
>> vision of the future is crazy enough to scare mad dogs and Englishmen.
>>
>> His idea for a CPU architecture is interesting, even underneath the
>> obfuscation and false representation, but too practically limited to
>>ever
>> take off. Google, an enthusiastic and dedicated proponent of energy
>> efficient, multi-core research said it best in a paper titled "Brawny
>>cores
>> still beat wimpy cores, most of the time."
>>
>>  "Once a chip¹s single-core performance lags by more than a factor to
>>two or
>> so behind the higher end of current-generation commodity processors,
>>making a
>> business case for switching to the wimpy system becomes increasingly
>> difficult... So go forth and multiply your cores, but do it in
>>moderation, or
>> the sea of wimpy cores will stick to your programmers¹ boots like clay."
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>>http://www.beowulf.org/mailman/listinfo/beowulf
>>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>To change your subscription (digest mode or unsubscribe) visit
>http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list