MKP-logo-white-transparent Title 4th-edition
Chapter 5
Large and Fast: Exploiting Memory Hierarchy

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 2
Memory Technology
nStatic RAM (SRAM)
n0.5ns – 2.5ns, $2000 – $5000 per GB
nDynamic RAM (DRAM)
n50ns – 70ns, $20 – $75 per GB
nMagnetic disk
n5ms – 20ms, $0.20 – $2 per GB
nIdeal memory
nAccess time of SRAM
nCapacity and cost/GB of disk

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 3
Principle of Locality
nPrograms access a small proportion of their address space at any time
nTemporal locality
nItems accessed recently are likely to be accessed again soon
ne.g., instructions in a loop, induction variables
nSpatial locality
nItems near those accessed recently are likely to be accessed soon
nE.g., sequential instruction access, array data

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 4
Taking Advantage of Locality
nMemory hierarchy
nStore everything on disk
nCopy recently accessed (and nearby) items from disk to smaller DRAM memory
nMain memory
nCopy more recently accessed (and nearby) items from DRAM to smaller SRAM memory
nCache memory attached to CPU

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 5
f05-02-P374493
Memory Hierarchy Levels
nBlock (aka line): unit of copying
nMay be multiple words
nIf accessed data is present in upper level
nHit: access satisfied by upper level
nHit ratio: hits/accesses
nIf accessed data is absent
nMiss: block copied from lower level
nTime taken: miss penalty
nMiss ratio: misses/accesses
= 1 – hit ratio
nThen accessed data supplied from upper level

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 6
f05-04-P374493
Cache Memory
nCache memory
nThe level of the memory hierarchy closest to the CPU
nGiven accesses X1, …, Xn–1, Xn
nHow do we know if the data is present?
nWhere do we look?

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 7
f05-05-P374493
Direct Mapped Cache
nLocation determined by address
nDirect mapped: only one choice
n(Block address) modulo (#Blocks in cache)
n#Blocks is a power of 2
nUse low-order address bits

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 8
Tags and Valid Bits
nHow do we know which particular block is stored in a cache location?
nStore block address as well as the data
nActually, only need the high-order bits
nCalled the tag
nWhat if there is no data in a location?
nValid bit: 1 = present, 0 = not present
nInitially 0

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 9
Cache Example
n8-blocks, 1 word/block, direct mapped
nInitial state
Index
V
Tag
Data
000
N
001
N
010
N
011
N
100
N
101
N
110
N
111
N

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 10
Cache Example
Index
V
Tag
Data
000
N
001
N
010
N
011
N
100
N
101
N
110
Y
10
Mem[10110]
111
N
Word addr
Binary addr
Hit/miss
Cache block
22
10 110
Miss
110

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 11
Cache Example
Index
V
Tag
Data
000
N
001
N
010
Y
11
Mem[11010]
011
N
100
N
101
N
110
Y
10
Mem[10110]
111
N
Word addr
Binary addr
Hit/miss
Cache block
26
11 010
Miss
010

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 12
Cache Example
Index
V
Tag
Data
000
N
001
N
010
Y
11
Mem[11010]
011
N
100
N
101
N
110
Y
10
Mem[10110]
111
N
Word addr
Binary addr
Hit/miss
Cache block
22
10 110
Hit
110
26
11 010
Hit
010

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 13
Cache Example
Index
V
Tag
Data
000
Y
10
Mem[10000]
001
N
010
Y
11
Mem[11010]
011
Y
00
Mem[00011]
100
N
101
N
110
Y
10
Mem[10110]
111
N
Word addr
Binary addr
Hit/miss
Cache block
16
10 000
Miss
000
3
00 011
Miss
011
16
10 000
Hit
000

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 14
Cache Example
Index
V
Tag
Data
000
Y
10
Mem[10000]
001
N
010
Y
10
Mem[10010]
011
Y
00
Mem[00011]
100
N
101
N
110
Y
10
Mem[10110]
111
N
Word addr
Binary addr
Hit/miss
Cache block
18
10 010
Miss
010

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 15
Address Subdivision
f05-07-P374493

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 16
Example: Larger Block Size
n64 blocks, 16 bytes/block
nTo what block number does address 1200 map?
nBlock address = ë1200/16û = 75
nBlock number = 75 modulo 64 = 11
Tag
Index
Offset
0
3
4
9
10
31
4 bits
6 bits
22 bits

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 17
Block Size Considerations
nLarger blocks should reduce miss rate
nDue to spatial locality
nBut in a fixed-sized cache
nLarger blocks Þ fewer of them
nMore competition Þ increased miss rate
nLarger blocks Þ pollution
nLarger miss penalty
nCan override benefit of reduced miss rate
nEarly restart and critical-word-first can help

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 18
Cache Misses
nOn cache hit, CPU proceeds normally
nOn cache miss
nStall the CPU pipeline
nFetch block from next level of hierarchy
nInstruction cache miss
nRestart instruction fetch
nData cache miss
nComplete data access

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 19
Write-Through
nOn data-write hit, could just update the block in cache
nBut then cache and memory would be inconsistent
nWrite through: also update memory
nBut makes writes take longer
ne.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles
n Effective CPI = 1 + 0.1×100 = 11
nSolution: write buffer
nHolds data waiting to be written to memory
nCPU continues immediately
nOnly stalls on write if write buffer is already full

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 20
Write-Back
nAlternative: On data-write hit, just update the block in cache
nKeep track of whether each block is dirty
nWhen a dirty block is replaced
nWrite it back to memory
nCan use a write buffer to allow replacing block to be read first

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 21
Write Allocation
nWhat should happen on a write miss?
nAlternatives for write-through
nAllocate on miss: fetch the block
nWrite around: don’t fetch the block
nSince programs often write a whole block before reading it (e.g., initialization)
nFor write-back
nUsually fetch the block

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 22
Example: Intrinsity FastMATH
nEmbedded MIPS processor
n12-stage pipeline
nInstruction and data access on each cycle
nSplit cache: separate I-cache and D-cache
nEach 16KB: 256 blocks × 16 words/block
nD-cache: write-through or write-back
nSPEC2000 miss rates
nI-cache: 0.4%
nD-cache: 11.4%
nWeighted average: 3.2%

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 23
Example: Intrinsity FastMATH
f05-09-P374493

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 24
Main Memory Supporting Caches
nUse DRAMs for main memory
nFixed width (e.g., 1 word)
nConnected by fixed-width clocked bus
nBus clock is typically slower than CPU clock
nExample cache block read
n1 bus cycle for address transfer
n15 bus cycles per DRAM access
n1 bus cycle per data transfer
nFor 4-word block, 1-word-wide DRAM
nMiss penalty = 1 + 4×15 + 4×1 = 65 bus cycles
nBandwidth = 16 bytes / 65 cycles = 0.25 B/cycle

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 25
f05-11-P374493
Increasing Memory Bandwidth
n4-word wide memory
nMiss penalty = 1 + 15 + 1 = 17 bus cycles
nBandwidth = 16 bytes / 17 cycles = 0.94 B/cycle
n4-bank interleaved memory
nMiss penalty = 1 + 15 + 4×1 = 20 bus cycles
nBandwidth = 16 bytes / 20 cycles = 0.8 B/cycle

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 26
Advanced DRAM Organization
nBits in a DRAM are organized as a rectangular array
nDRAM accesses an entire row
nBurst mode: supply successive words from a row with reduced latency
nDouble data rate (DDR) DRAM
nTransfer on rising and falling clock edges
nQuad data rate (QDR) DRAM
nSeparate DDR inputs and outputs

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 27
DRAM Generations
Year
Capacity
$/GB
1980
64Kbit
$1500000
1983
256Kbit
$500000
1985
1Mbit
$200000
1989
4Mbit
$50000
1992
16Mbit
$15000
1996
64Mbit
$10000
1998
128Mbit
$4000
2000
256Mbit
$1000
2004
512Mbit
$250
2007
1Gbit
$50

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 28
Measuring Cache Performance
nComponents of CPU time
nProgram execution cycles
nIncludes cache hit time
nMemory stall cycles
nMainly from cache misses
nWith simplifying assumptions:

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 29
Cache Performance Example
nGiven
nI-cache miss rate = 2%
nD-cache miss rate = 4%
nMiss penalty = 100 cycles
nBase CPI (ideal cache) = 2
nLoad & stores are 36% of instructions
nMiss cycles per instruction
nI-cache: 0.02 × 100 = 2
nD-cache: 0.36 × 0.04 × 100 = 1.44
nActual CPI = 2 + 2 + 1.44 = 5.44
nIdeal CPU is 5.44/2 =2.72 times faster

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 30
Average Access Time
nHit time is also important for performance
nAverage memory access time (AMAT)
nAMAT = Hit time + Miss rate × Miss penalty
nExample
nCPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5%
nAMAT = 1 + 0.05 × 20 = 2ns
n2 cycles per instruction

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 31
Performance Summary
nWhen CPU performance increased
nMiss penalty becomes more significant
nDecreasing base CPI
nGreater proportion of time spent on memory stalls
nIncreasing clock rate
nMemory stalls account for more CPU cycles
nCan’t neglect cache behavior when evaluating system performance

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 32
Associative Caches
nFully associative
nAllow a given block to go in any cache entry
nRequires all entries to be searched at once
nComparator per entry (expensive)
nn-way set associative
nEach set contains n entries
nBlock number determines which set
n(Block number) modulo (#Sets in cache)
nSearch all entries in a given set at once
nn comparators (less expensive)

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 33
f05-13-P374493
Associative Cache Example

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 34
Spectrum of Associativity
nFor a cache with 8 entries
f05-14-P374493

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 35
Associativity Example
nCompare 4-block caches
nDirect mapped, 2-way set associative,
fully associative
nBlock access sequence: 0, 8, 0, 6, 8
nDirect mapped
Block address
Cache index
Hit/miss
Cache content after access
0
1
2
3
0
0
miss
Mem[0]
8
0
miss
Mem[8]
0
0
miss
Mem[0]
6
2
miss
Mem[0]
Mem[6]
8
0
miss
Mem[8]
Mem[6]

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 36
Associativity Example
n2-way set associative
Block address
Cache index
Hit/miss
Cache content after access
Set 0
Set 1
0
0
miss
Mem[0]
8
0
miss
Mem[0]
Mem[8]
0
0
hit
Mem[0]
Mem[8]
6
0
miss
Mem[0]
Mem[6]
8
0
miss
Mem[8]
Mem[6]
nFully associative
Block address
Hit/miss
Cache content after access
0
miss
Mem[0]
8
miss
Mem[0]
Mem[8]
0
hit
Mem[0]
Mem[8]
6
miss
Mem[0]
Mem[8]
Mem[6]
8
hit
Mem[0]
Mem[8]
Mem[6]

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 37
How Much Associativity
nIncreased associativity decreases miss rate
nBut with diminishing returns
nSimulation of a system with 64KB
D-cache, 16-word blocks, SPEC2000
n1-way: 10.3%
n2-way: 8.6%
n4-way: 8.3%
n8-way: 8.1%

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 38
Set Associative Cache Organization
f05-17-P374493

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 39
Replacement Policy
nDirect mapped: no choice
nSet associative
nPrefer non-valid entry, if there is one
nOtherwise, choose among entries in the set
nLeast-recently used (LRU)
nChoose the one unused for the longest time
nSimple for 2-way, manageable for 4-way, too hard beyond that
nRandom
nGives approximately the same performance as LRU for high associativity

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 40
Multilevel Caches
nPrimary cache attached to CPU
nSmall, but fast
nLevel-2 cache services misses from primary cache
nLarger, slower, but still faster than main memory
nMain memory services L-2 cache misses
nSome high-end systems include L-3 cache

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 41
Multilevel Cache Example
nGiven
nCPU base CPI = 1, clock rate = 4GHz
nMiss rate/instruction = 2%
nMain memory access time = 100ns
nWith just primary cache
nMiss penalty = 100ns/0.25ns = 400 cycles
nEffective CPI = 1 + 0.02 × 400 = 9

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 42
Example (cont.)
nNow add L-2 cache
nAccess time = 5ns
nGlobal miss rate to main memory = 0.5%
nPrimary miss with L-2 hit
nPenalty = 5ns/0.25ns = 20 cycles
nPrimary miss with L-2 miss
nExtra penalty = 500 cycles
nCPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4
nPerformance ratio = 9/3.4 = 2.6

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 43
Multilevel Cache Considerations
nPrimary cache
nFocus on minimal hit time
nL-2 cache
nFocus on low miss rate to avoid main memory access
nHit time has less overall impact
nResults
nL-1 cache usually smaller than a single cache
nL-1 block size smaller than L-2 block size

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 44
Interactions with Advanced CPUs
nOut-of-order CPUs can execute instructions during cache miss
nPending store stays in load/store unit
nDependent instructions wait in reservation stations
nIndependent instructions continue
nEffect of miss depends on program data flow
nMuch harder to analyse
nUse system simulation

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 45
f05-18-P374493
Interactions with Software
nMisses depend on memory access patterns
nAlgorithm behavior
nCompiler optimization for memory access

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 46
Virtual Memory
nUse main memory as a “cache” for secondary (disk) storage
nManaged jointly by CPU hardware and the operating system (OS)
nPrograms share main memory
nEach gets a private virtual address space holding its frequently used code and data
nProtected from other programs
nCPU and OS translate virtual addresses to physical addresses
nVM “block” is called a page
nVM translation “miss” is called a page fault

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 47
Address Translation
nFixed-size pages (e.g., 4K)
f05-20-P374493 f05-19-P374493

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 48
Page Fault Penalty
nOn page fault, the page must be fetched from disk
nTakes millions of clock cycles
nHandled by OS code
nTry to minimize page fault rate
nFully associative placement
nSmart replacement algorithms

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 49
Page Tables
nStores placement information
nArray of page table entries, indexed by virtual page number
nPage table register in CPU points to page table in physical memory
nIf page is present in memory
nPTE stores the physical page number
nPlus other status bits (referenced, dirty, …)
nIf page is not present
nPTE can refer to location in swap space on disk

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 50
Translation Using a Page Table
f05-21-P374493

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 51
Mapping Pages to Storage
f05-22-P374493

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 52
Replacement and Writes
nTo reduce page fault rate, prefer least-recently used (LRU) replacement
nReference bit (aka use bit) in PTE set to 1 on access to page
nPeriodically cleared to 0 by OS
nA page with reference bit = 0 has not been used recently
nDisk writes take millions of cycles
nBlock at once, not individual locations
nWrite through is impractical
nUse write-back
nDirty bit in PTE set when page is written

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 53
Fast Translation Using a TLB
nAddress translation would appear to require extra memory references
nOne to access the PTE
nThen the actual memory access
nBut access to page tables has good locality
nSo use a fast cache of PTEs within the CPU
nCalled a Translation Look-aside Buffer (TLB)
nTypical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100 cycles for miss, 0.01%–1% miss rate
nMisses could be handled by hardware or software

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 54
f05-23-P374493
Fast Translation Using a TLB

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 55
TLB Misses
nIf page is in memory
nLoad the PTE from memory and retry
nCould be handled in hardware
nCan get complex for more complicated page table structures
nOr in software
nRaise a special exception, with optimized handler
nIf page is not in memory (page fault)
nOS handles fetching the page and updating the page table
nThen restart the faulting instruction

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 56
TLB Miss Handler
nTLB miss indicates
nPage present, but PTE not in TLB
nPage not preset
nMust recognize TLB miss before destination register overwritten
nRaise exception
nHandler copies PTE from memory to TLB
nThen restarts instruction
nIf page not present, page fault will occur

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 57
Page Fault Handler
nUse faulting virtual address to find PTE
nLocate page on disk
nChoose page to replace
nIf dirty, write to disk first
nRead page into memory and update page table
nMake process runnable again
nRestart from faulting instruction

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 58
TLB and Cache Interaction
nIf cache tag uses physical address
nNeed to translate before cache lookup
nAlternative: use virtual address tag
nComplications due to aliasing
nDifferent virtual addresses for shared physical address
f05-24-P374493

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 59
Memory Protection
nDifferent tasks can share parts of their virtual address spaces
nBut need to protect against errant access
nRequires OS assistance
nHardware support for OS protection
nPrivileged supervisor mode (aka kernel mode)
nPrivileged instructions
nPage tables and other state information only accessible in supervisor mode
nSystem call exception (e.g., syscall in MIPS)

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 60
The Memory Hierarchy
nCommon principles apply at all levels of the memory hierarchy
nBased on notions of caching
nAt each level in the hierarchy
nBlock placement
nFinding a block
nReplacement on a miss
nWrite policy
The BIG Picture

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 61
Block Placement
nDetermined by associativity
nDirect mapped (1-way associative)
nOne choice for placement
nn-way set associative
nn choices within a set
nFully associative
nAny location
nHigher associativity reduces miss rate
nIncreases complexity, cost, and access time

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 62
Finding a Block
nHardware caches
nReduce comparisons to reduce cost
nVirtual memory
nFull table lookup makes full associativity feasible
nBenefit in reduced miss rate
Associativity
Location method
Tag comparisons
Direct mapped
Index
1
n-way set associative
Set index, then search entries within the set
n
Fully associative
Search all entries
#entries
Full lookup table
0

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 63
Replacement
nChoice of entry to replace on a miss
nLeast recently used (LRU)
nComplex and costly hardware for high associativity
nRandom
nClose to LRU, easier to implement
nVirtual memory
nLRU approximation with hardware support

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 64
Write Policy
nWrite-through
nUpdate both upper and lower levels
nSimplifies replacement, but may require write buffer
nWrite-back
nUpdate upper level only
nUpdate lower level when block is replaced
nNeed to keep more state
nVirtual memory
nOnly write-back is feasible, given disk write latency

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 65
Sources of Misses
nCompulsory misses (aka cold start misses)
nFirst access to a block
nCapacity misses
nDue to finite cache size
nA replaced block is later accessed again
nConflict misses (aka collision misses)
nIn a non-fully associative cache
nDue to competition for entries in a set
nWould not occur in a fully associative cache of the same total size

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 66
Cache Design Trade-offs
Design change
Effect on miss rate
Negative performance effect
Increase cache size
Decrease capacity misses
May increase access time
Increase associativity
Decrease conflict misses
May increase access time
Increase block size
Decrease compulsory misses
Increases miss penalty. For very large block size, may increase miss rate due to pollution.

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 67
Virtual Machines
nHost computer emulates guest operating system and machine resources
nImproved isolation of multiple guests
nAvoids security and reliability problems
nAids sharing of resources
nVirtualization has some performance impact
nFeasible with modern high-performance comptuers
nExamples
nIBM VM/370 (1970s technology!)
nVMWare
nMicrosoft Virtual PC

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 68
Virtual Machine Monitor
nMaps virtual resources to physical resources
nMemory, I/O devices, CPUs
nGuest code runs on native machine in user mode
nTraps to VMM on privileged instructions and access to protected resources
nGuest OS may be different from host OS
nVMM handles real I/O devices
nEmulates generic virtual I/O devices for guest

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 69
Example: Timer Virtualization
nIn native machine, on timer interrupt
nOS suspends current process, handles interrupt, selects and resumes next process
nWith Virtual Machine Monitor
nVMM suspends current VM, handles interrupt, selects and resumes next VM
nIf a VM requires timer interrupts
nVMM emulates a virtual timer
nEmulates interrupt for VM when physical timer interrupt occurs

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 70
Instruction Set Support
nUser and System modes
nPrivileged instructions only available in system mode
nTrap to system if executed in user mode
nAll physical resources only accessible using privileged instructions
nIncluding page tables, interrupt controls, I/O registers
nRenaissance of virtualization support
nCurrent ISAs (e.g., x86) adapting

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 71
Cache Control
nExample cache characteristics
nDirect-mapped, write-back, write allocate
nBlock size: 4 words (16 bytes)
nCache size: 16 KB (1024 blocks)
n32-bit byte addresses
nValid bit and dirty bit per block
nBlocking cache
nCPU waits until access is complete
Tag
Index
Offset
0
3
4
9
10
31
4 bits
10 bits
18 bits

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 72
Interface Signals
Cache
CPU
Memory
Read/Write
Valid
Address
Write Data
Read Data
Ready
32
32
32
Read/Write
Valid
Address
Write Data
Read Data
Ready
32
128
128
Multiple cycles per access

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 73
f05-33-P374493
Finite State Machines
nUse an FSM to sequence control steps
nSet of states, transition on each clock edge
nState values are binary encoded
nCurrent state stored in a register
nNext state
= fn (current state,
current inputs)
nControl output signals
= fo (current state)

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 74
f05-34-P374493
Cache Controller FSM
Could partition into separate states to reduce clock cycle time

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 75
Cache Coherence Problem
nSuppose two CPU cores share a physical address space
nWrite-through caches
Time step
Event
CPU A’s cache
CPU B’s cache
Memory
0
0
1
CPU A reads X
0
0
2
CPU B reads X
0
0
0
3
CPU A writes 1 to X
1
0
1

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 76
Coherence Defined
nInformally: Reads return most recently written value
nFormally:
nP writes X; P reads X (no intervening writes)
Þ read returns written value
nP1 writes X; P2 reads X (sufficiently later)
Þ read returns written value
nc.f. CPU B reading X after step 3 in example
nP1 writes X, P2 writes X
Þ all processors see writes in the same order
nEnd up with the same final value for X

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 77
Cache Coherence Protocols
nOperations performed by caches in multiprocessors to ensure coherence
nMigration of data to local caches
nReduces bandwidth for shared memory
nReplication of read-shared data
nReduces contention for access
nSnooping protocols
nEach cache monitors bus reads/writes
nDirectory-based protocols
nCaches and memory record sharing status of blocks in a directory

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 78
Invalidating Snooping Protocols
nCache gets exclusive access to a block when it is to be written
nBroadcasts an invalidate message on the bus
nSubsequent read in another cache misses
nOwning cache supplies updated value
CPU activity
Bus activity
CPU A’s cache
CPU B’s cache
Memory
0
CPU A reads X
Cache miss for X
0
0
CPU B reads X
Cache miss for X
0
0
0
CPU A writes 1 to X
Invalidate for X
1
0
CPU B read X
Cache miss for X
1
1
1

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 79
Memory Consistency
nWhen are writes seen by other processors
n“Seen” means a read returns the written value
nCan’t be instantaneously
nAssumptions
nA write completes only when all processors have seen it
nA processor does not reorder writes with other accesses
nConsequence
nP writes X then writes Y
Þ all processors that see new Y also see new X
nProcessors can reorder reads, but not writes

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 80
Multilevel On-Chip Caches
f05-37-P374493
Per core: 32KB L1 I-cache, 32KB L1 D-cache, 512KB L2 cache
Intel Nehalem 4-core processor

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 81
2-Level TLB Organization
Intel Nehalem
AMD Opteron X4
Virtual addr
48 bits
48 bits
Physical addr
44 bits
48 bits
Page size
4KB, 2/4MB
4KB, 2/4MB
L1 TLB
(per core)
L1 I-TLB: 128 entries for small pages, 7 per thread (2×) for large pages
L1 D-TLB: 64 entries for small pages, 32 for large pages
Both 4-way, LRU replacement
L1 I-TLB: 48 entries
L1 D-TLB: 48 entries
Both fully associative, LRU replacement
L2 TLB
(per core)
Single L2 TLB: 512 entries
4-way, LRU replacement
L2 I-TLB: 512 entries
L2 D-TLB: 512 entries
Both 4-way, round-robin LRU
TLB misses
Handled in hardware
Handled in hardware

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 82
3-Level Cache Organization
Intel Nehalem
AMD Opteron X4
L1 caches
(per core)
L1 I-cache: 32KB, 64-byte blocks, 4-way, approx LRU replacement, hit time n/a
L1 D-cache: 32KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a
L1 I-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, hit time 3 cycles
L1 D-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, write-back/allocate, hit time 9 cycles
L2 unified cache
(per core)
256KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a
512KB, 64-byte blocks, 16-way, approx LRU replacement, write-back/allocate, hit time n/a
L3 unified cache (shared)
8MB, 64-byte blocks, 16-way, replacement n/a, write-back/allocate, hit time n/a
2MB, 64-byte blocks, 32-way, replace block shared by fewest cores, write-back/allocate, hit time 32
cycles
n/a: data not available

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 83
Mis Penalty Reduction
nReturn requested word first
nThen back-fill rest of block
nNon-blocking miss processing
nHit under miss: allow hits to proceed
nMis under miss: allow multiple outstanding misses
nHardware prefetch: instructions and data
nOpteron X4: bank interleaved L1 D-cache
nTwo concurrent accesses per cycle

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 84
Pitfalls
nByte vs. word addressing
nExample: 32-byte direct-mapped cache,
4-byte blocks
nByte 36 maps to block 1
nWord 36 maps to block 4
nIgnoring memory system effects when writing or generating code
nExample: iterating over rows vs. columns of arrays
nLarge strides result in poor locality

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 85
Pitfalls
nIn multiprocessor with shared L2 or L3 cache
nLess associativity than cores results in conflict misses
nMore cores Þ need to increase associativity
nUsing AMAT to evaluate performance of out-of-order processors
nIgnores effect of non-blocked accesses
nInstead, evaluate performance by simulation

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 86
Pitfalls
nExtending address range using segments
nE.g., Intel 80286
nBut a segment is not always big enough
nMakes address arithmetic complicated
nImplementing a VMM on an ISA not designed for virtualization
nE.g., non-privileged instructions accessing hardware resources
nEither extend ISA, or require guest OS not to use problematic instructions

MKP-logo
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 87
Concluding Remarks
nFast memories are small, large memories are slow
nWe really want fast, large memories L
nCaching gives this illusion J
nPrinciple of locality
nPrograms use a small part of their memory space frequently
nMemory hierarchy
nL1 cache « L2 cache « … « DRAM memory
« disk
nMemory system design is critical for multiprocessors