[Beowulf] Nvidia K1 Denver

Tue Aug 12 00:05:02 PDT 2014

I was surprised to find the Nvidia K1 to be a surprising departure from
the ARM Cortex a53 and a57 cores.

Summary at:

http://blogs.nvidia.com/blog/2014/08/11/tegra-k1-denver-64-bit-for-android/

Details at (if you are willing to share your email address):
http://www.tiriasresearch.com/downloads/nvidia-charts-its-own-path-to-armv8/

Highlights:
* The nvidia 64-bit denver core is in order!
* dynamics code optimizer uses 128MB of ram to optimize frequently
  used code segments
* Larger 128KB L1 I cache to handle microcode expansion
* L1 I cache can deliver a 32 byte parcel to the scheduler every cycle
* Optimizer is not visible to OS or hypervisor
* 7 way issue
* 13 cycle mispredict.
* expected launch at 2.5 GHz
* Slightly better than haswell celeron 2955U at SpecINT 2k
* Slightly worse than haswell celeron 2955U at SpecFP 2k
* new lower power state CC4 that allows maintaining cache
  and CPU state information that looks to be around 5mw
  from the graph.
* special optimizations lookup table, 1k entries, jumps to the
  already optimized code.
* 128MB cache does not contain any pre-canned optimizations for
  benchmarks.  *chuckle*
* Pin compatible with existing 32 bit K1 chips.

The dynamics code optimization:
* collects branch results (taken, not taken, strongly take, and
  strongly not taken).
* performs register renaming
* claimed comparable performance to OoO hardware implementations
* claimed power efficiency of in order implementations
* can reorder load/stores
* remove redundant code
* hoist redundant computations
* unroll loops
* claims larger instruction reorder window than hardware implementations

Seems like the optimizer has a pretty tough job considering that
compilers have already attempted similar optimizations with access to
source code and relatively unlimited CPU/ram resources compared to a
battery operated tablet/phone/widget.

For reference, celeron 2955U = 2 cores @ 1.4GHz, 2MB cache, 15 watt TDP,
haswell core, 25.6GB/sec mem bandwidth.