MKP-logo-white-transparent Title 4th-edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 2 Memory Technology nStatic RAM (SRAM) n0.5ns – 2.5ns, $2000 – $5000 per GB nDynamic RAM (DRAM) n50ns – 70ns, $20 – $75 per GB nMagnetic disk n5ms – 20ms, $0.20 – $2 per GB nIdeal memory nAccess time of SRAM nCapacity and cost/GB of disk MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 3 Principle of Locality nPrograms access a small proportion of their address space at any time nTemporal locality nItems accessed recently are likely to be accessed again soon ne.g., instructions in a loop, induction variables nSpatial locality nItems near those accessed recently are likely to be accessed soon nE.g., sequential instruction access, array data MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 4 Taking Advantage of Locality nMemory hierarchy nStore everything on disk nCopy recently accessed (and nearby) items from disk to smaller DRAM memory nMain memory nCopy more recently accessed (and nearby) items from DRAM to smaller SRAM memory nCache memory attached to CPU MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 5 f05-02-P374493 Memory Hierarchy Levels nBlock (aka line): unit of copying nMay be multiple words nIf accessed data is present in upper level nHit: access satisfied by upper level nHit ratio: hits/accesses nIf accessed data is absent nMiss: block copied from lower level nTime taken: miss penalty nMiss ratio: misses/accesses = 1 – hit ratio nThen accessed data supplied from upper level MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 6 f05-04-P374493 Cache Memory nCache memory nThe level of the memory hierarchy closest to the CPU nGiven accesses X1, …, Xn–1, Xn nHow do we know if the data is present? nWhere do we look? MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 7 f05-05-P374493 Direct Mapped Cache nLocation determined by address nDirect mapped: only one choice n(Block address) modulo (#Blocks in cache) n#Blocks is a power of 2 nUse low-order address bits MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 8 Tags and Valid Bits nHow do we know which particular block is stored in a cache location? nStore block address as well as the data nActually, only need the high-order bits nCalled the tag nWhat if there is no data in a location? nValid bit: 1 = present, 0 = not present nInitially 0 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 9 Cache Example n8-blocks, 1 word/block, direct mapped nInitial state Index V Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 N 111 N MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 10 Cache Example Index V Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N Word addr Binary addr Hit/miss Cache block 22 10 110 Miss 110 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 11 Cache Example Index V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N Word addr Binary addr Hit/miss Cache block 26 11 010 Miss 010 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 12 Cache Example Index V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N Word addr Binary addr Hit/miss Cache block 22 10 110 Hit 110 26 11 010 Hit 010 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 13 Cache Example Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 11 Mem[11010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N Word addr Binary addr Hit/miss Cache block 16 10 000 Miss 000 3 00 011 Miss 011 16 10 000 Hit 000 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 14 Cache Example Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 10 Mem[10010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N Word addr Binary addr Hit/miss Cache block 18 10 010 Miss 010 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 15 Address Subdivision f05-07-P374493 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 16 Example: Larger Block Size n64 blocks, 16 bytes/block nTo what block number does address 1200 map? nBlock address = ë1200/16û = 75 nBlock number = 75 modulo 64 = 11 Tag Index Offset 0 3 4 9 10 31 4 bits 6 bits 22 bits MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 17 Block Size Considerations nLarger blocks should reduce miss rate nDue to spatial locality nBut in a fixed-sized cache nLarger blocks Þ fewer of them nMore competition Þ increased miss rate nLarger blocks Þ pollution nLarger miss penalty nCan override benefit of reduced miss rate nEarly restart and critical-word-first can help MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 18 Cache Misses nOn cache hit, CPU proceeds normally nOn cache miss nStall the CPU pipeline nFetch block from next level of hierarchy nInstruction cache miss nRestart instruction fetch nData cache miss nComplete data access MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 19 Write-Through nOn data-write hit, could just update the block in cache nBut then cache and memory would be inconsistent nWrite through: also update memory nBut makes writes take longer ne.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles n Effective CPI = 1 + 0.1×100 = 11 nSolution: write buffer nHolds data waiting to be written to memory nCPU continues immediately nOnly stalls on write if write buffer is already full MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 20 Write-Back nAlternative: On data-write hit, just update the block in cache nKeep track of whether each block is dirty nWhen a dirty block is replaced nWrite it back to memory nCan use a write buffer to allow replacing block to be read first MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 21 Write Allocation nWhat should happen on a write miss? nAlternatives for write-through nAllocate on miss: fetch the block nWrite around: don’t fetch the block nSince programs often write a whole block before reading it (e.g., initialization) nFor write-back nUsually fetch the block MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 22 Example: Intrinsity FastMATH nEmbedded MIPS processor n12-stage pipeline nInstruction and data access on each cycle nSplit cache: separate I-cache and D-cache nEach 16KB: 256 blocks × 16 words/block nD-cache: write-through or write-back nSPEC2000 miss rates nI-cache: 0.4% nD-cache: 11.4% nWeighted average: 3.2% MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 23 Example: Intrinsity FastMATH f05-09-P374493 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 24 Main Memory Supporting Caches nUse DRAMs for main memory nFixed width (e.g., 1 word) nConnected by fixed-width clocked bus nBus clock is typically slower than CPU clock nExample cache block read n1 bus cycle for address transfer n15 bus cycles per DRAM access n1 bus cycle per data transfer nFor 4-word block, 1-word-wide DRAM nMiss penalty = 1 + 4×15 + 4×1 = 65 bus cycles nBandwidth = 16 bytes / 65 cycles = 0.25 B/cycle MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 25 f05-11-P374493 Increasing Memory Bandwidth n4-word wide memory nMiss penalty = 1 + 15 + 1 = 17 bus cycles nBandwidth = 16 bytes / 17 cycles = 0.94 B/cycle n4-bank interleaved memory nMiss penalty = 1 + 15 + 4×1 = 20 bus cycles nBandwidth = 16 bytes / 20 cycles = 0.8 B/cycle MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 26 Advanced DRAM Organization nBits in a DRAM are organized as a rectangular array nDRAM accesses an entire row nBurst mode: supply successive words from a row with reduced latency nDouble data rate (DDR) DRAM nTransfer on rising and falling clock edges nQuad data rate (QDR) DRAM nSeparate DDR inputs and outputs MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 27 DRAM Generations Year Capacity $/GB 1980 64Kbit $1500000 1983 256Kbit $500000 1985 1Mbit $200000 1989 4Mbit $50000 1992 16Mbit $15000 1996 64Mbit $10000 1998 128Mbit $4000 2000 256Mbit $1000 2004 512Mbit $250 2007 1Gbit $50 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 28 Measuring Cache Performance nComponents of CPU time nProgram execution cycles nIncludes cache hit time nMemory stall cycles nMainly from cache misses nWith simplifying assumptions: MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 29 Cache Performance Example nGiven nI-cache miss rate = 2% nD-cache miss rate = 4% nMiss penalty = 100 cycles nBase CPI (ideal cache) = 2 nLoad & stores are 36% of instructions nMiss cycles per instruction nI-cache: 0.02 × 100 = 2 nD-cache: 0.36 × 0.04 × 100 = 1.44 nActual CPI = 2 + 2 + 1.44 = 5.44 nIdeal CPU is 5.44/2 =2.72 times faster MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 30 Average Access Time nHit time is also important for performance nAverage memory access time (AMAT) nAMAT = Hit time + Miss rate × Miss penalty nExample nCPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5% nAMAT = 1 + 0.05 × 20 = 2ns n2 cycles per instruction MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 31 Performance Summary nWhen CPU performance increased nMiss penalty becomes more significant nDecreasing base CPI nGreater proportion of time spent on memory stalls nIncreasing clock rate nMemory stalls account for more CPU cycles nCan’t neglect cache behavior when evaluating system performance MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 32 Associative Caches nFully associative nAllow a given block to go in any cache entry nRequires all entries to be searched at once nComparator per entry (expensive) nn-way set associative nEach set contains n entries nBlock number determines which set n(Block number) modulo (#Sets in cache) nSearch all entries in a given set at once nn comparators (less expensive) MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 33 f05-13-P374493 Associative Cache Example MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 34 Spectrum of Associativity nFor a cache with 8 entries f05-14-P374493 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 35 Associativity Example nCompare 4-block caches nDirect mapped, 2-way set associative, fully associative nBlock access sequence: 0, 8, 0, 6, 8 nDirect mapped Block address Cache index Hit/miss Cache content after access 0 1 2 3 0 0 miss Mem[0] 8 0 miss Mem[8] 0 0 miss Mem[0] 6 2 miss Mem[0] Mem[6] 8 0 miss Mem[8] Mem[6] MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 36 Associativity Example n2-way set associative Block address Cache index Hit/miss Cache content after access Set 0 Set 1 0 0 miss Mem[0] 8 0 miss Mem[0] Mem[8] 0 0 hit Mem[0] Mem[8] 6 0 miss Mem[0] Mem[6] 8 0 miss Mem[8] Mem[6] nFully associative Block address Hit/miss Cache content after access 0 miss Mem[0] 8 miss Mem[0] Mem[8] 0 hit Mem[0] Mem[8] 6 miss Mem[0] Mem[8] Mem[6] 8 hit Mem[0] Mem[8] Mem[6] MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 37 How Much Associativity nIncreased associativity decreases miss rate nBut with diminishing returns nSimulation of a system with 64KB D-cache, 16-word blocks, SPEC2000 n1-way: 10.3% n2-way: 8.6% n4-way: 8.3% n8-way: 8.1% MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 38 Set Associative Cache Organization f05-17-P374493 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 39 Replacement Policy nDirect mapped: no choice nSet associative nPrefer non-valid entry, if there is one nOtherwise, choose among entries in the set nLeast-recently used (LRU) nChoose the one unused for the longest time nSimple for 2-way, manageable for 4-way, too hard beyond that nRandom nGives approximately the same performance as LRU for high associativity MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 40 Multilevel Caches nPrimary cache attached to CPU nSmall, but fast nLevel-2 cache services misses from primary cache nLarger, slower, but still faster than main memory nMain memory services L-2 cache misses nSome high-end systems include L-3 cache MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 41 Multilevel Cache Example nGiven nCPU base CPI = 1, clock rate = 4GHz nMiss rate/instruction = 2% nMain memory access time = 100ns nWith just primary cache nMiss penalty = 100ns/0.25ns = 400 cycles nEffective CPI = 1 + 0.02 × 400 = 9 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 42 Example (cont.) nNow add L-2 cache nAccess time = 5ns nGlobal miss rate to main memory = 0.5% nPrimary miss with L-2 hit nPenalty = 5ns/0.25ns = 20 cycles nPrimary miss with L-2 miss nExtra penalty = 500 cycles nCPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4 nPerformance ratio = 9/3.4 = 2.6 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 43 Multilevel Cache Considerations nPrimary cache nFocus on minimal hit time nL-2 cache nFocus on low miss rate to avoid main memory access nHit time has less overall impact nResults nL-1 cache usually smaller than a single cache nL-1 block size smaller than L-2 block size MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 44 Interactions with Advanced CPUs nOut-of-order CPUs can execute instructions during cache miss nPending store stays in load/store unit nDependent instructions wait in reservation stations nIndependent instructions continue nEffect of miss depends on program data flow nMuch harder to analyse nUse system simulation MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 45 f05-18-P374493 Interactions with Software nMisses depend on memory access patterns nAlgorithm behavior nCompiler optimization for memory access MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 46 Virtual Memory nUse main memory as a “cache” for secondary (disk) storage nManaged jointly by CPU hardware and the operating system (OS) nPrograms share main memory nEach gets a private virtual address space holding its frequently used code and data nProtected from other programs nCPU and OS translate virtual addresses to physical addresses nVM “block” is called a page nVM translation “miss” is called a page fault MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 47 Address Translation nFixed-size pages (e.g., 4K) f05-20-P374493 f05-19-P374493 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 48 Page Fault Penalty nOn page fault, the page must be fetched from disk nTakes millions of clock cycles nHandled by OS code nTry to minimize page fault rate nFully associative placement nSmart replacement algorithms MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 49 Page Tables nStores placement information nArray of page table entries, indexed by virtual page number nPage table register in CPU points to page table in physical memory nIf page is present in memory nPTE stores the physical page number nPlus other status bits (referenced, dirty, …) nIf page is not present nPTE can refer to location in swap space on disk MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 50 Translation Using a Page Table f05-21-P374493 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 51 Mapping Pages to Storage f05-22-P374493 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 52 Replacement and Writes nTo reduce page fault rate, prefer least-recently used (LRU) replacement nReference bit (aka use bit) in PTE set to 1 on access to page nPeriodically cleared to 0 by OS nA page with reference bit = 0 has not been used recently nDisk writes take millions of cycles nBlock at once, not individual locations nWrite through is impractical nUse write-back nDirty bit in PTE set when page is written MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 53 Fast Translation Using a TLB nAddress translation would appear to require extra memory references nOne to access the PTE nThen the actual memory access nBut access to page tables has good locality nSo use a fast cache of PTEs within the CPU nCalled a Translation Look-aside Buffer (TLB) nTypical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100 cycles for miss, 0.01%–1% miss rate nMisses could be handled by hardware or software MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 54 f05-23-P374493 Fast Translation Using a TLB MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 55 TLB Misses nIf page is in memory nLoad the PTE from memory and retry nCould be handled in hardware nCan get complex for more complicated page table structures nOr in software nRaise a special exception, with optimized handler nIf page is not in memory (page fault) nOS handles fetching the page and updating the page table nThen restart the faulting instruction MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 56 TLB Miss Handler nTLB miss indicates nPage present, but PTE not in TLB nPage not preset nMust recognize TLB miss before destination register overwritten nRaise exception nHandler copies PTE from memory to TLB nThen restarts instruction nIf page not present, page fault will occur MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 57 Page Fault Handler nUse faulting virtual address to find PTE nLocate page on disk nChoose page to replace nIf dirty, write to disk first nRead page into memory and update page table nMake process runnable again nRestart from faulting instruction MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 58 TLB and Cache Interaction nIf cache tag uses physical address nNeed to translate before cache lookup nAlternative: use virtual address tag nComplications due to aliasing nDifferent virtual addresses for shared physical address f05-24-P374493 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 59 Memory Protection nDifferent tasks can share parts of their virtual address spaces nBut need to protect against errant access nRequires OS assistance nHardware support for OS protection nPrivileged supervisor mode (aka kernel mode) nPrivileged instructions nPage tables and other state information only accessible in supervisor mode nSystem call exception (e.g., syscall in MIPS) MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 60 The Memory Hierarchy nCommon principles apply at all levels of the memory hierarchy nBased on notions of caching nAt each level in the hierarchy nBlock placement nFinding a block nReplacement on a miss nWrite policy The BIG Picture MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 61 Block Placement nDetermined by associativity nDirect mapped (1-way associative) nOne choice for placement nn-way set associative nn choices within a set nFully associative nAny location nHigher associativity reduces miss rate nIncreases complexity, cost, and access time MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 62 Finding a Block nHardware caches nReduce comparisons to reduce cost nVirtual memory nFull table lookup makes full associativity feasible nBenefit in reduced miss rate Associativity Location method Tag comparisons Direct mapped Index 1 n-way set associative Set index, then search entries within the set n Fully associative Search all entries #entries Full lookup table 0 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 63 Replacement nChoice of entry to replace on a miss nLeast recently used (LRU) nComplex and costly hardware for high associativity nRandom nClose to LRU, easier to implement nVirtual memory nLRU approximation with hardware support MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 64 Write Policy nWrite-through nUpdate both upper and lower levels nSimplifies replacement, but may require write buffer nWrite-back nUpdate upper level only nUpdate lower level when block is replaced nNeed to keep more state nVirtual memory nOnly write-back is feasible, given disk write latency MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 65 Sources of Misses nCompulsory misses (aka cold start misses) nFirst access to a block nCapacity misses nDue to finite cache size nA replaced block is later accessed again nConflict misses (aka collision misses) nIn a non-fully associative cache nDue to competition for entries in a set nWould not occur in a fully associative cache of the same total size MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 66 Cache Design Trade-offs Design change Effect on miss rate Negative performance effect Increase cache size Decrease capacity misses May increase access time Increase associativity Decrease conflict misses May increase access time Increase block size Decrease compulsory misses Increases miss penalty. For very large block size, may increase miss rate due to pollution. MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 67 Virtual Machines nHost computer emulates guest operating system and machine resources nImproved isolation of multiple guests nAvoids security and reliability problems nAids sharing of resources nVirtualization has some performance impact nFeasible with modern high-performance comptuers nExamples nIBM VM/370 (1970s technology!) nVMWare nMicrosoft Virtual PC MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 68 Virtual Machine Monitor nMaps virtual resources to physical resources nMemory, I/O devices, CPUs nGuest code runs on native machine in user mode nTraps to VMM on privileged instructions and access to protected resources nGuest OS may be different from host OS nVMM handles real I/O devices nEmulates generic virtual I/O devices for guest MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 69 Example: Timer Virtualization nIn native machine, on timer interrupt nOS suspends current process, handles interrupt, selects and resumes next process nWith Virtual Machine Monitor nVMM suspends current VM, handles interrupt, selects and resumes next VM nIf a VM requires timer interrupts nVMM emulates a virtual timer nEmulates interrupt for VM when physical timer interrupt occurs MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 70 Instruction Set Support nUser and System modes nPrivileged instructions only available in system mode nTrap to system if executed in user mode nAll physical resources only accessible using privileged instructions nIncluding page tables, interrupt controls, I/O registers nRenaissance of virtualization support nCurrent ISAs (e.g., x86) adapting MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 71 Cache Control nExample cache characteristics nDirect-mapped, write-back, write allocate nBlock size: 4 words (16 bytes) nCache size: 16 KB (1024 blocks) n32-bit byte addresses nValid bit and dirty bit per block nBlocking cache nCPU waits until access is complete Tag Index Offset 0 3 4 9 10 31 4 bits 10 bits 18 bits MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 72 Interface Signals Cache CPU Memory Read/Write Valid Address Write Data Read Data Ready 32 32 32 Read/Write Valid Address Write Data Read Data Ready 32 128 128 Multiple cycles per access MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 73 f05-33-P374493 Finite State Machines nUse an FSM to sequence control steps nSet of states, transition on each clock edge nState values are binary encoded nCurrent state stored in a register nNext state = fn (current state, current inputs) nControl output signals = fo (current state) MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 74 f05-34-P374493 Cache Controller FSM Could partition into separate states to reduce clock cycle time MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 75 Cache Coherence Problem nSuppose two CPU cores share a physical address space nWrite-through caches Time step Event CPU A’s cache CPU B’s cache Memory 0 0 1 CPU A reads X 0 0 2 CPU B reads X 0 0 0 3 CPU A writes 1 to X 1 0 1 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 76 Coherence Defined nInformally: Reads return most recently written value nFormally: nP writes X; P reads X (no intervening writes) Þ read returns written value nP1 writes X; P2 reads X (sufficiently later) Þ read returns written value nc.f. CPU B reading X after step 3 in example nP1 writes X, P2 writes X Þ all processors see writes in the same order nEnd up with the same final value for X MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 77 Cache Coherence Protocols nOperations performed by caches in multiprocessors to ensure coherence nMigration of data to local caches nReduces bandwidth for shared memory nReplication of read-shared data nReduces contention for access nSnooping protocols nEach cache monitors bus reads/writes nDirectory-based protocols nCaches and memory record sharing status of blocks in a directory MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 78 Invalidating Snooping Protocols nCache gets exclusive access to a block when it is to be written nBroadcasts an invalidate message on the bus nSubsequent read in another cache misses nOwning cache supplies updated value CPU activity Bus activity CPU A’s cache CPU B’s cache Memory 0 CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X 0 0 0 CPU A writes 1 to X Invalidate for X 1 0 CPU B read X Cache miss for X 1 1 1 MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 79 Memory Consistency nWhen are writes seen by other processors n“Seen” means a read returns the written value nCan’t be instantaneously nAssumptions nA write completes only when all processors have seen it nA processor does not reorder writes with other accesses nConsequence nP writes X then writes Y Þ all processors that see new Y also see new X nProcessors can reorder reads, but not writes MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 80 Multilevel On-Chip Caches f05-37-P374493 Per core: 32KB L1 I-cache, 32KB L1 D-cache, 512KB L2 cache Intel Nehalem 4-core processor MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 81 2-Level TLB Organization Intel Nehalem AMD Opteron X4 Virtual addr 48 bits 48 bits Physical addr 44 bits 48 bits Page size 4KB, 2/4MB 4KB, 2/4MB L1 TLB (per core) L1 I-TLB: 128 entries for small pages, 7 per thread (2×) for large pages L1 D-TLB: 64 entries for small pages, 32 for large pages Both 4-way, LRU replacement L1 I-TLB: 48 entries L1 D-TLB: 48 entries Both fully associative, LRU replacement L2 TLB (per core) Single L2 TLB: 512 entries 4-way, LRU replacement L2 I-TLB: 512 entries L2 D-TLB: 512 entries Both 4-way, round-robin LRU TLB misses Handled in hardware Handled in hardware MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 82 3-Level Cache Organization Intel Nehalem AMD Opteron X4 L1 caches (per core) L1 I-cache: 32KB, 64-byte blocks, 4-way, approx LRU replacement, hit time n/a L1 D-cache: 32KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a L1 I-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, hit time 3 cycles L1 D-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, write-back/allocate, hit time 9 cycles L2 unified cache (per core) 256KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a 512KB, 64-byte blocks, 16-way, approx LRU replacement, write-back/allocate, hit time n/a L3 unified cache (shared) 8MB, 64-byte blocks, 16-way, replacement n/a, write-back/allocate, hit time n/a 2MB, 64-byte blocks, 32-way, replace block shared by fewest cores, write-back/allocate, hit time 32 cycles n/a: data not available MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 83 Mis Penalty Reduction nReturn requested word first nThen back-fill rest of block nNon-blocking miss processing nHit under miss: allow hits to proceed nMis under miss: allow multiple outstanding misses nHardware prefetch: instructions and data nOpteron X4: bank interleaved L1 D-cache nTwo concurrent accesses per cycle MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 84 Pitfalls nByte vs. word addressing nExample: 32-byte direct-mapped cache, 4-byte blocks nByte 36 maps to block 1 nWord 36 maps to block 4 nIgnoring memory system effects when writing or generating code nExample: iterating over rows vs. columns of arrays nLarge strides result in poor locality MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 85 Pitfalls nIn multiprocessor with shared L2 or L3 cache nLess associativity than cores results in conflict misses nMore cores Þ need to increase associativity nUsing AMAT to evaluate performance of out-of-order processors nIgnores effect of non-blocked accesses nInstead, evaluate performance by simulation MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 86 Pitfalls nExtending address range using segments nE.g., Intel 80286 nBut a segment is not always big enough nMakes address arithmetic complicated nImplementing a VMM on an ISA not designed for virtualization nE.g., non-privileged instructions accessing hardware resources nEither extend ISA, or require guest OS not to use problematic instructions MKP-logo Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 87 Concluding Remarks nFast memories are small, large memories are slow nWe really want fast, large memories L nCaching gives this illusion J nPrinciple of locality nPrograms use a small part of their memory space frequently nMemory hierarchy nL1 cache « L2 cache « … « DRAM memory « disk nMemory system design is critical for multiprocessors