Details on 21264

Bill Broadley bill@math.ucdavis.edu
Fri, 25 Sep 1998 02:42:39 -0400


Here's some notes I put together awhile back.  264's are showing up
all over (conferences), microway has made an announcement about
a dual cpu config, various other vendors are making various noises.

Maybe the wait is almost over, anyway the goods:


References:
http://www.digital.com/info/semiconductor/a264up1/index.html
http://www.digital.com/semiconductor/alpha/papers/microrep/digital2.htm

21264
	15.3 million transistors
	L1 dcache > 8 GB/sec (128 bit wide, 3 cycle load latency)
	L2 cache > 4 GB/sec (128 bit wide, 12 cycle load latency)
	System interface: 2 GB/sec sustained (64 bit wide, 80 cycle load latency)
	16 64 byte main memory ref's (8 read misses, 8 write backs)

	64k L1 2-way associative any combo (rr,rw,ww), 2 cycle latency
		(actually running at double the clock)
	Duplicated 128 entry TLB (fully associative)
	Up to 8 cache misses inthe miss queue, waiting for external cache.
	Displaced cached lines of an 8 entry victim buffer.

	L2 possibilities: (128 bits wide)
		133 Mhz burstram 							2.1 GB/sec  Cheap
		250 Mhz latewrite							4.0 GB/sec
	 	333 Mhz dual data (167 mhz on both edges)	5.3 GB/sec  Expensive
    Up to 2.6 GB/sec to main memory, 1.6 gb observed in stream, 64 bits wide,
		up to 333 Mhz.
    4 ops a cycle sustainable (6 peak)
    up to 80 instructions in flight at one time, as well as 16 loads and 16
		stores.
	Onchip L2 cache interface, onchip bus interface.
	Primary and secondary caches are non-blocking.
	Tsunami D-chip demultiplexes the 64 bit (up to 333 Mhz bus) into a 256
		bit wide sdram sybsystem that can run at 100 mhz. 
	Low cost configurations likely to be 128 bits wide, expensive
		ones 512 bits wide.
	Very parallel/out of order.  (2 instructions per cycle on spec95)

Floating point unit.
	72 fp registers (32 + renames)
	dual issue (15 instruction queue)
	1 Mul and 1 FP/ADD/DIV/Sqrt on the fp side 
	peak 4 flops (2 memory references, 1 mult, and 1 other fp)
	1st fp unit 64 bit add, 4 cycle latency, pipelined,  
		64 bit IEEE divide 16 cycle latency,
		64 bit sqrt 33 cycle latency.
	2nd fp unit 64 bit multiply
	pipelined, with 4 cycle latency for add, mult, and most other opts.
	Double prevision divide takes 16, and sqrt takes 33 cycles, does
		not prevent the other fp unit from operating.A

Integer unit
	80 int registers (32 + renames)
	Quad issue (20 instruction queue)
	64k L1 2-way associative (2 cycles to access)
	L1 instruction cache is duplicated, each having 4 read prots and 6 write
		ports.
	2 Addr Alu's, and 2 Int Units (1 int unit handles the video instructions)

Branch unit:
	mispredicted branch at least 7 cycles (including 2 cycles for i cache)
	average mispredicted branch 11 cycles.
	2 level branch predictor table:
		local predictor indexed by program counter with a single level table
		global predictor indexed by global history of all branches.
	3rd table observes the history of both predictors and chooses the better
		for each situation.
	Claimed 1 miss oer 7-10 per 1000 instructions in spec95.	
	35kbits of strorage for the branch-history (2% of die)+48k bits stored
		in the icache (per cache line).
	
	
Fetch unit:
	Each cache line has a "next-line" predictor and a set predictor.
	2 load/stores per cycle (any combination)
	32 entry load/store reorder buffer
	Cache prefetch (read, modify intent, non-cached, evict next)
	Taken branches have a zero-cycle delay.
	Feeds 4 instructions per cycle to decode unit
	return address stack for 32 levels of subroutine calls.kkkk	
	
Samsung available chips: (from April)
KP21264-2.0X Est. 28 SPECint95/42 SPECfp95 
KP21264-2.5X Est. 33 SPECint95/50 SPECfp95 
KP21264-3.0X Est. 38 SPECint95/58 SPECfp95 
KP21264-3.5X Est. 43 SPECint95/64 SPECfp95 

I think thats 400-700 Mhz...