[Beowulf] Re: Opteron 275 performance

Sat Jul 30 11:47:41 PDT 2005

> Be very careful.  Hyper TRANSPORT is what AMD (and the HT consortium)
> push as a replacment to the traditional bus -- it amounts to putting the
> CPUs, memory, and all peripherals on a very high bandwidth low latency
> network, IIRC.  It enables a sort of SMP design very similar to a
> "cluster" inside a single box, whether the processors are single or
> dual core, and is the way AMD is going quite heavily in their future
> designs.  It is very useful and relevant to SMP, multicore, and single
> core designs and important to HPC.

I find this slightly confusing.  HTrans is indeed a fast interconnect,
but the salient features are that it's purely point-to-point and that 
AMD's cache-coherency is implemented using it.  Intel chips depend 
on shared front-side-buses, which means that they're inherently contending 
for memory bandwidth.  that's why quad-xeons usually sucked, for instance.
a shared snoopy bus could potentially be faster for inter-processor 
traffic (a hot lock, for instance), but I've never seen numbers supporting
this - opterons seem to always win for latency.

hypertransport is very elegant, and I hope AMD+someone manage to add some
form of directory-based coherence soon, so it scales beyond 2-4 nodes.
(the problem is that AMD's current design requires broadcasts of some 
coherency traffic, which starts to look ugly even at 4 nodes, and is 
noticable at 8.  going multi-core is an interesting twist to this, since 
onchip, coherency appears to be snoopy, so the HT-broadcast is only 
required for intra-socket coherence.  maybe 4sock, 4core machines will 
cover enough of the market to avoid adding a directory-based scheme...)

> Hyper THREADING is Intel's solution to what amounts to an overlong
> instruction pipeline on the CPU itself.  In single-threaded code a long

hmm, not really.  Intel's pipeline length is indeed very long,
but hyperthreading is all about *switching*, not pipelining.

the real problem with hyperthreading is that it works best with bad code:
code that normally spends most of its time stalled (cache/tlb misses, etc).
on code that makes effective use of the hardware, it's only a slowdown.

> <editorial comment>In my personal opinion Hthreading is pretty much
> useless to "most" HPC applications and seems to be mostly irrelevant to
> SMP design as well.  If anything, I'd expect hyperthreading to really
> complicate SMP kernel design, as one adds a whole layer of complexity to
> the already complex processor affinity question for independent threads
> and multiple memory locality and ITS possible processor affinities.

not really.  linux uses a fairly straightforward "domain" design that 
lets the scheduler intelligently decide when to migrate a process to 
another virtual core, real core, or socket.

in HT mode, the package actually appears to boot two CPUs, so there's 
not that much the kernel has to do to support it.  the rest is affinity,
and handled sanely by the sched-domain design.  for instance, there's 
zero cost to "migrating" a thread between HT siblings.  but there is some 
cost (cache-wise) to migrating to a different core.  an idle HT sibling 
is not bad, but an idle core should definitely pull a proc that's sharing
a HT sibling.  etc.

> It really isn't at all clear that HThreading needs to live (as my kids
> would put it:-).  AMD just uses shorter instruction pipelines (less to
> flush and fill) and seems to outperform Hthreaded Intel at constant
> clock, at any rate.  At MOST it yields a 30% or so speedup that is
> relevant to somebody doing work with lots of independent things going on
> on a single CPU -- typically (unsurprisingly) a desktop user watching a
> movie or the like -- decoding video and audio at the same time, or
> servers handling multiple service threads.  In others, it yields a
> DECREASE of say 10% due to aforementioned cache-thrashing.</editorial
> comment>

HT is a very limited implementation of a general class of multi-threaded
processors.  IBM and others have done better SMT.  and the appeal is clear:
most programs do not manage to keep all a CPU's functional units busy,
so why not share the pool of FU's among multiple threads?  I'm expecting 
someone to eventually replicate the proc-specific parts (registers and 
L1 cache), but share all the FU's in a package.  sort of merged multi-core.
on the other hand, the chip area devoted to actual compute elements (ALU/FPU)
is dwarfed by caches...

regards, mark hahn.