[Beowulf] Gen - 1 Clusters

Sat May 7 22:42:54 PDT 2005

> > I think this is slightly mistaken: DC is not a generational shift, but rather
> > mostly a packaging change.  DC chips have twice the area of SC, and therefore
> 
> What makes you conclude this?

simple: DC chips do not introduce any new feature.  this is most clear-cut
with Intel's DC chips, since they're nearly indistinguishable from a 
dual-socket, SC system.  as such, I wouldn't call them a different
generation, just a different packaging.  heck, rumor has it that some Intel
future DC chip will actually have two separate die in the same package
(shades of the P6, or even IBM's MCM systems...)

> And if it is true, what is the
> underlying mechanism for most of the major chip manufacturers moving
> to dual core at similar times, fashion?

my point was simply that the shift is in packaging, not in CPU design,
since the cores are nearly identical to current cores.  certainly no 
reason for users/programmers to care.  it alters the memory balance,
but almost every minor rev of a chip does as well...

> I ask seriously.  I actually know little about it, but my assumption
> is that the chip designers are having difficulty taking an existing
> chip using N transistors and figuring out how to instead use N*2
> transistors to to double the performance by extracting more
> instruction level parallelism out of serial code.  And that those
> designers are therefore doing the obvious thing - giving up - and are
> instead using those N*2 transistors to build to CPUs each with the
> same old N transitors.
> 
> Is this model incorrect?

yes, though it's a bit more subtle than that.  creating a faster SC
in the same chip area would involve:
	- increasing cache size (but for most apps, 1M is probably already
	showing diminishing returns.)
	- increasing the number of functional units (but this implies
	increasing other aspects of the chip, such as register-file and 
	cache ports, and the width of decode/dispatch/retire units.)

the attraction of DC is not only that you don't have to bother designing 
and validating another core, but also that in some cases, it's the right 
thing to do (above a faster DC).  it's also fairly significant that you 
can more easily power-down a second core, versus powering down some fraction
of the functional units on a single core.  presumably a faster SC would also 
have problematic hotspots - choke points like the ones mentioned above.

it's not hard to imagine lots of interesting ways to use more transistors
than current chips.  HyperThreading, for instance, has been rather mediocre
mainly because HT-enabled chips *don't* have additional resources - that is,
it's just a timeslicing of the same resources that in some cases improves 
throughput.  once people digest multicore, applications will make more use 
of threads, and that will make it profitable for chip designers to really
do multicore "right".  for instance, if you had multiple fetch/decode/retire
units sharing a larger pool of ALU/FPU resources.  this would let a single 
thread "burst" to a large number of, say, FP operations, more than if the 
same number of FPU's were statically partitioned among discrete cores.
it changes the chip into more of a dataflow network, probably based on 
a crossbar.  that would be a real generational shift.

another way to look at this is that DC is the first step in herding
programmers (and OS and compiler writers) towards more parallelism.
the literature on parallelism shows that many codes have lots of 
parallel opportunity, but actually implementing it at the chip level 
is quite challenging (how do you know which loops or conditionals to 
spin off speculatively or in parallel?)  going multicore means that the 
programmer+compiler decide how many threads and where...

ultimately, it boils down to chip designers trying to keep the silicon busy.
that's the main point of hyperthreading (switch context on cache misses,
etc).  it's also the main point (with a radically different set of
assumptions) of the Itanium's EPIC/VLIW architecture.  (it2 assumes that the 
compiler can prefetch data and pre-evaluate conditionals far enough ahead
that there will be no pipe stalls.  obviously, this works well for certain
codes, but certainly does not for others.  hence, it2's SpecFP performance 
is strongly dominated by the 2-3 components which are entirely in-cache.)

regards, mark hahn.