MKP-logo-white-transparent Title 4th-edition Chapter 4 The Processor MKP-logo Chapter 4 — The Processor — 2 Introduction nCPU performance factors nInstruction count nDetermined by ISA and compiler nCPI and Cycle time nDetermined by CPU hardware nWe will examine two MIPS implementations nA simplified version nA more realistic pipelined version nSimple subset, shows most aspects nMemory reference: lw, sw nArithmetic/logical: add, sub, and, or, slt nControl transfer: beq, j MKP-logo Chapter 4 — The Processor — 3 Instruction Execution nPC ® instruction memory, fetch instruction nRegister numbers ® register file, read registers nDepending on instruction class nUse ALU to calculate nArithmetic result nMemory address for load/store nBranch target address nAccess data memory for load/store nPC ¬ target address or PC + 4 MKP-logo Chapter 4 — The Processor — 4 CPU Overview f04-01-P374493 MKP-logo Chapter 4 — The Processor — 5 f04-01-P374493 Multiplexers nCan’t just join wires together nUse multiplexers MKP-logo Chapter 4 — The Processor — 6 f04-02-P374493 Control MKP-logo Chapter 4 — The Processor — 7 Logic Design Basics nInformation encoded in binary nLow voltage = 0, High voltage = 1 nOne wire per bit nMulti-bit data encoded on multi-wire buses nCombinational element nOperate on data nOutput is a function of input nState (sequential) elements nStore information MKP-logo Chapter 4 — The Processor — 8 Combinational Elements nAND-gate nY = A & B A B Y I0 I1 Y M u x S nMultiplexer nY = S ? I1 : I0 A B Y + A B Y ALU F nAdder nY = A + B nArithmetic/Logic Unit nY = F(A, B) MKP-logo Chapter 4 — The Processor — 9 Sequential Elements nRegister: stores data in a circuit nUses a clock signal to determine when to update the stored value nEdge-triggered: update when Clk changes from 0 to 1 D Clk Q Clk D Q MKP-logo Chapter 4 — The Processor — 10 Sequential Elements nRegister with write control nOnly updates on clock edge when write control input is 1 nUsed when stored value is required later D Clk Q Write Write D Q Clk MKP-logo Chapter 4 — The Processor — 11 Clocking Methodology nCombinational logic transforms data during clock cycles nBetween clock edges nInput from state elements, output to state element nLongest delay determines clock period f04-04-P374493 f04-03-P374493 MKP-logo Chapter 4 — The Processor — 12 Building a Datapath nDatapath nElements that process data and addresses in the CPU nRegisters, ALUs, mux’s, memories, … nWe will build a MIPS datapath incrementally nRefining the overview design MKP-logo Chapter 4 — The Processor — 13 Instruction Fetch 32-bit register Increment by 4 for next instruction f04-06-P374493 MKP-logo Chapter 4 — The Processor — 14 f04-07-P374493 R-Format Instructions nRead two register operands nPerform arithmetic/logical operation nWrite register result MKP-logo Chapter 4 — The Processor — 15 f04-08-P374493 Load/Store Instructions nRead register operands nCalculate address using 16-bit offset nUse ALU, but sign-extend offset nLoad: Read memory and update register nStore: Write register value to memory MKP-logo Chapter 4 — The Processor — 16 Branch Instructions nRead register operands nCompare operands nUse ALU, subtract and check Zero output nCalculate target address nSign-extend displacement nShift left 2 places (word displacement) nAdd to PC + 4 nAlready calculated by instruction fetch MKP-logo Chapter 4 — The Processor — 17 f04-09-P374493 Branch Instructions Just re-routes wires Sign-bit wire replicated MKP-logo Chapter 4 — The Processor — 18 Composing the Elements nFirst-cut data path does an instruction in one clock cycle nEach datapath element can only do one function at a time nHence, we need separate instruction and data memories nUse multiplexers where alternate data sources are used for different instructions MKP-logo Chapter 4 — The Processor — 19 f04-10-P374493 R-Type/Load/Store Datapath MKP-logo Chapter 4 — The Processor — 20 f04-11-P374493 Full Datapath MKP-logo Chapter 4 — The Processor — 21 ALU Control nALU used for nLoad/Store: F = add nBranch: F = subtract nR-type: F depends on funct field ALU control Function 0000 AND 0001 OR 0010 add 0110 subtract 0111 set-on-less-than 1100 NOR MKP-logo Chapter 4 — The Processor — 22 ALU Control nAssume 2-bit ALUOp derived from opcode nCombinational logic derives ALU control opcode ALUOp Operation funct ALU function ALU control lw 00 load word XXXXXX add 0010 sw 00 store word XXXXXX add 0010 beq 01 branch equal XXXXXX subtract 0110 R-type 10 add 100000 add 0010 subtract 100010 subtract 0110 AND 100100 AND 0000 OR 100101 OR 0001 set-on-less-than 101010 set-on-less-than 0111 MKP-logo Chapter 4 — The Processor — 23 The Main Control Unit nControl signals derived from instruction 0 rs rt rd shamt funct 31:26 5:0 25:21 20:16 15:11 10:6 35 or 43 rs rt address 31:26 25:21 20:16 15:0 4 rs rt address 31:26 25:21 20:16 15:0 R-type Load/ Store Branch opcode always read read, except for load write for R-type and load sign-extend and add MKP-logo Chapter 4 — The Processor — 24 f04-17-P374493 Datapath With Control MKP-logo Chapter 4 — The Processor — 25 f04-19-P374493 R-Type Instruction MKP-logo Chapter 4 — The Processor — 26 f04-20-P374493 Load Instruction MKP-logo Chapter 4 — The Processor — 27 f04-21-P374493 Branch-on-Equal Instruction MKP-logo Chapter 4 — The Processor — 28 Implementing Jumps nJump uses word address nUpdate PC with concatenation of nTop 4 bits of old PC n26-bit jump address n00 nNeed an extra control signal decoded from opcode 2 address 31:26 25:0 Jump MKP-logo Chapter 4 — The Processor — 29 f04-24-P374493 Datapath With Jumps Added MKP-logo Chapter 4 — The Processor — 30 Performance Issues nLongest delay determines clock period nCritical path: load instruction nInstruction memory ® register file ® ALU ® data memory ® register file nNot feasible to vary period for different instructions nViolates design principle nMaking the common case fast nWe will improve performance by pipelining MKP-logo Chapter 4 — The Processor — 31 f04-25-P374493 Pipelining Analogy nPipelined laundry: overlapping execution nParallelism improves performance nFour loads: nSpeedup = 8/3.5 = 2.3 nNon-stop: nSpeedup = 2n/0.5n + 1.5 ≈ 4 = number of stages MKP-logo Chapter 4 — The Processor — 32 MIPS Pipeline nFive stages, one step per stage 1.IF: Instruction fetch from memory 2.ID: Instruction decode & register read 3.EX: Execute operation or calculate address 4.MEM: Access memory operand 5.WB: Write result back to register MKP-logo Chapter 4 — The Processor — 33 Pipeline Performance nAssume time for stages is n100ps for register read or write n200ps for other stages nCompare pipelined datapath with single-cycle datapath Instr Instr fetch Register read ALU op Memory access Register write Total time lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps 700ps R-format 200ps 100 ps 200ps 100 ps 600ps beq 200ps 100 ps 200ps 500ps MKP-logo Chapter 4 — The Processor — 34 f04-27-P374493 Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) MKP-logo Chapter 4 — The Processor — 35 Pipeline Speedup nIf all stages are balanced ni.e., all take the same time nTime between instructionspipelined = Time between instructionsnonpipelined Number of stages nIf not balanced, speedup is less nSpeedup due to increased throughput nLatency (time for each instruction) does not decrease MKP-logo Chapter 4 — The Processor — 36 Pipelining and ISA Design nMIPS ISA designed for pipelining nAll instructions are 32-bits nEasier to fetch and decode in one cycle nc.f. x86: 1- to 17-byte instructions nFew and regular instruction formats nCan decode and read registers in one step nLoad/store addressing nCan calculate address in 3rd stage, access memory in 4th stage nAlignment of memory operands nMemory access takes only one cycle MKP-logo Chapter 4 — The Processor — 37 Hazards nSituations that prevent starting the next instruction in the next cycle nStructure hazards nA required resource is busy nData hazard nNeed to wait for previous instruction to complete its data read/write nControl hazard nDeciding on control action depends on previous instruction MKP-logo Chapter 4 — The Processor — 38 Structure Hazards nConflict for use of a resource nIn MIPS pipeline with a single memory nLoad/store requires data access nInstruction fetch would have to stall for that cycle nWould cause a pipeline “bubble” nHence, pipelined datapaths require separate instruction/data memories nOr separate instruction/data caches MKP-logo Chapter 4 — The Processor — 39 data-hazard-bubble-no-forwarding Data Hazards nAn instruction depends on completion of data access by a previous instruction nadd $s0, $t0, $t1 sub $t2, $s0, $t3 MKP-logo Chapter 4 — The Processor — 40 f04-29-P374493 Forwarding (aka Bypassing) nUse result when it is computed nDon’t wait for it to be stored in a register nRequires extra connections in the datapath MKP-logo Chapter 4 — The Processor — 41 f04-30-P374493 Load-Use Data Hazard nCan’t always avoid stalls by forwarding nIf value not computed when needed nCan’t forward backward in time! MKP-logo Chapter 4 — The Processor — 42 Code Scheduling to Avoid Stalls nReorder code to avoid use of load result in the next instruction nC code for A = B + E; C = B + F; lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) stall stall lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) 11 cycles 13 cycles MKP-logo Chapter 4 — The Processor — 43 Control Hazards nBranch determines flow of control nFetching next instruction depends on branch outcome nPipeline can’t always fetch correct instruction nStill working on ID stage of branch nIn MIPS pipeline nNeed to compare registers and compute target early in the pipeline nAdd hardware to do it in ID stage MKP-logo Chapter 4 — The Processor — 44 f04-31-P374493 Stall on Branch nWait until branch outcome determined before fetching next instruction MKP-logo Chapter 4 — The Processor — 45 Branch Prediction nLonger pipelines can’t readily determine branch outcome early nStall penalty becomes unacceptable nPredict outcome of branch nOnly stall if prediction is wrong nIn MIPS pipeline nCan predict branches not taken nFetch instruction after branch, with no delay MKP-logo Chapter 4 — The Processor — 46 f04-32-P374493 MIPS with Predict Not Taken Prediction correct Prediction incorrect MKP-logo Chapter 4 — The Processor — 47 More-Realistic Branch Prediction nStatic branch prediction nBased on typical branch behavior nExample: loop and if-statement branches nPredict backward branches taken nPredict forward branches not taken nDynamic branch prediction nHardware measures actual branch behavior ne.g., record recent history of each branch nAssume future behavior will continue the trend nWhen wrong, stall while re-fetching, and update history MKP-logo Chapter 4 — The Processor — 48 Pipeline Summary nPipelining improves performance by increasing instruction throughput nExecutes multiple instructions in parallel nEach instruction has the same latency nSubject to hazards nStructure, data, control nInstruction set design affects complexity of pipeline implementation The BIG Picture MKP-logo Chapter 4 — The Processor — 49 f04-33-P374493 MIPS Pipelined Datapath WB MEM Right-to-left flow leads to hazards MKP-logo Chapter 4 — The Processor — 50 f04-35-P374493 Pipeline registers nNeed registers between stages nTo hold information produced in previous cycle MKP-logo Chapter 4 — The Processor — 51 Pipeline Operation nCycle-by-cycle flow of instructions through the pipelined datapath n“Single-clock-cycle” pipeline diagram nShows pipeline usage in a single cycle nHighlight resources used nc.f. “multi-clock-cycle” diagram nGraph of operation over time nWe’ll look at “single-clock-cycle” diagrams for load & store MKP-logo Chapter 4 — The Processor — 52 f04-36-P374493-IF IF for Load, Store, … MKP-logo Chapter 4 — The Processor — 53 f04-36-P374493-ID ID for Load, Store, … MKP-logo Chapter 4 — The Processor — 54 f04-37-P374493 EX for Load MKP-logo Chapter 4 — The Processor — 55 f04-38-P374493-MEM MEM for Load MKP-logo Chapter 4 — The Processor — 56 f04-38-P374493-WB WB for Load Wrong register number MKP-logo Chapter 4 — The Processor — 57 f04-41-P374493 Corrected Datapath for Load MKP-logo Chapter 4 — The Processor — 58 f04-39-P374493 EX for Store MKP-logo Chapter 4 — The Processor — 59 f04-40-P374493-MEM MEM for Store MKP-logo Chapter 4 — The Processor — 60 f04-40-P374493-WB WB for Store MKP-logo Chapter 4 — The Processor — 61 f04-43-P374493 Multi-Cycle Pipeline Diagram nForm showing resource usage MKP-logo Chapter 4 — The Processor — 62 Multi-Cycle Pipeline Diagram nTraditional form f04-44-P374493 MKP-logo Chapter 4 — The Processor — 63 f04-45-P374493 Single-Cycle Pipeline Diagram nState of pipeline in a given cycle MKP-logo Chapter 4 — The Processor — 64 f04-46-P374493 Pipelined Control (Simplified) MKP-logo Chapter 4 — The Processor — 65 f04-50-P374493 Pipelined Control nControl signals derived from instruction nAs in single-cycle implementation MKP-logo Chapter 4 — The Processor — 66 f04-51-P374493 Pipelined Control MKP-logo Chapter 4 — The Processor — 67 Data Hazards in ALU Instructions nConsider this sequence: n sub $2, $1,$3 and $12,$2,$5 or $13,$6,$2 add $14,$2,$2 sw $15,100($2) nWe can resolve hazards with forwarding nHow do we detect when to forward? MKP-logo Chapter 4 — The Processor — 68 f04-52-P374493 Dependencies & Forwarding MKP-logo Chapter 4 — The Processor — 69 Detecting the Need to Forward nPass register numbers along pipeline ne.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX pipeline register nALU operand register numbers in EX stage are given by nID/EX.RegisterRs, ID/EX.RegisterRt nData hazards when n1a. EX/MEM.RegisterRd = ID/EX.RegisterRs n1b. EX/MEM.RegisterRd = ID/EX.RegisterRt n2a. MEM/WB.RegisterRd = ID/EX.RegisterRs n2b. MEM/WB.RegisterRd = ID/EX.RegisterRt Fwd from EX/MEM pipeline reg Fwd from MEM/WB pipeline reg MKP-logo Chapter 4 — The Processor — 70 Detecting the Need to Forward nBut only if forwarding instruction will write to a register! nEX/MEM.RegWrite, MEM/WB.RegWrite nAnd only if Rd for that instruction is not $zero nEX/MEM.RegisterRd ≠ 0, MEM/WB.RegisterRd ≠ 0 MKP-logo Chapter 4 — The Processor — 71 f04-54-P374493-bottom Forwarding Paths MKP-logo Chapter 4 — The Processor — 72 Forwarding Conditions nEX hazard nif (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 nif (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 nMEM hazard nif (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 nif (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 MKP-logo Chapter 4 — The Processor — 73 Double Data Hazard nConsider the sequence: n add $1,$1,$2 add $1,$1,$3 add $1,$1,$4 nBoth hazards occur nWant to use the most recent nRevise MEM hazard condition nOnly fwd if EX hazard condition isn’t true MKP-logo Chapter 4 — The Processor — 74 Revised Forwarding Condition nMEM hazard nif (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 nif (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 MKP-logo Chapter 4 — The Processor — 75 f04-56-P374493 Datapath with Forwarding MKP-logo Chapter 4 — The Processor — 76 f04-58-P374493 Load-Use Data Hazard Need to stall for one cycle MKP-logo Chapter 4 — The Processor — 77 Load-Use Hazard Detection nCheck when using instruction is decoded in ID stage nALU operand register numbers in ID stage are given by nIF/ID.RegisterRs, IF/ID.RegisterRt nLoad-use hazard when nID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt)) nIf detected, stall and insert bubble MKP-logo Chapter 4 — The Processor — 78 How to Stall the Pipeline nForce control values in ID/EX register to 0 nEX, MEM and WB do nop (no-operation) nPrevent update of PC and IF/ID register nUsing instruction is decoded again nFollowing instruction is fetched again n1-cycle stall allows MEM to read data for lw nCan subsequently forward to EX stage MKP-logo Chapter 4 — The Processor — 79 f04-59-P374493 Stall/Bubble in the Pipeline Stall inserted here MKP-logo Chapter 4 — The Processor — 80 f04-59-P374493-what-really-happens Stall/Bubble in the Pipeline Or, more accurately… MKP-logo Chapter 4 — The Processor — 81 f04-60-P374493 Datapath with Hazard Detection MKP-logo Chapter 4 — The Processor — 82 Stalls and Performance nStalls reduce performance nBut are required to get correct results nCompiler can arrange code to avoid hazards and stalls nRequires knowledge of the pipeline structure The BIG Picture MKP-logo Chapter 4 — The Processor — 83 f04-61-P374493 Branch Hazards nIf branch outcome determined in MEM PC Flush these instructions (Set control values to 0) MKP-logo Chapter 4 — The Processor — 84 Reducing Branch Delay nMove hardware to determine outcome to ID stage nTarget address adder nRegister comparator nExample: branch taken n 36: sub $10, $4, $8 40: beq $1, $3, 7 44: and $12, $2, $5 48: or $13, $2, $6 52: add $14, $4, $2 56: slt $15, $6, $7 ... 72: lw $4, 50($7) MKP-logo Chapter 4 — The Processor — 85 f04-62-P374493-top Example: Branch Taken MKP-logo Chapter 4 — The Processor — 86 f04-62-P374493-bottom Example: Branch Taken MKP-logo Chapter 4 — The Processor — 87 Data Hazards for Branches nIf a comparison register is a destination of 2nd or 3rd preceding ALU instruction … IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB add $4, $5, $6 add $1, $2, $3 beq $1, $4, target nCan resolve using forwarding MKP-logo Chapter 4 — The Processor — 88 Data Hazards for Branches nIf a comparison register is a destination of preceding ALU instruction or 2nd preceding load instruction nNeed 1 stall cycle beq stalled IF ID EX MEM WB IF ID EX MEM WB IF ID ID EX MEM WB add $4, $5, $6 lw $1, addr beq $1, $4, target MKP-logo Chapter 4 — The Processor — 89 Data Hazards for Branches nIf a comparison register is a destination of immediately preceding load instruction nNeed 2 stall cycles beq stalled IF ID EX MEM WB IF ID ID ID EX MEM WB beq stalled lw $1, addr beq $1, $0, target MKP-logo Chapter 4 — The Processor — 90 Dynamic Branch Prediction nIn deeper and superscalar pipelines, branch penalty is more significant nUse dynamic prediction nBranch prediction buffer (aka branch history table) nIndexed by recent branch instruction addresses nStores outcome (taken/not taken) nTo execute a branch nCheck table, expect the same outcome nStart fetching from fall-through or target nIf wrong, flush pipeline and flip prediction MKP-logo Chapter 4 — The Processor — 91 1-Bit Predictor: Shortcoming nInner loop branches mispredicted twice! outer: … … inner: … … beq …, …, inner … beq …, …, outer nMispredict as taken on last iteration of inner loop nThen mispredict as not taken on first iteration of inner loop next time around MKP-logo Chapter 4 — The Processor — 92 f04-63-P374493 2-Bit Predictor nOnly change prediction on two successive mispredictions MKP-logo Chapter 4 — The Processor — 93 Calculating the Branch Target nEven with predictor, still need to calculate the target address n1-cycle penalty for a taken branch nBranch target buffer nCache of target addresses nIndexed by PC when instruction fetched nIf hit and instruction is branch predicted taken, can fetch target immediately MKP-logo Chapter 4 — The Processor — 94 Exceptions and Interrupts n“Unexpected” events requiring change in flow of control nDifferent ISAs use the terms differently nException nArises within the CPU ne.g., undefined opcode, overflow, syscall, … nInterrupt nFrom an external I/O controller nDealing with them without sacrificing performance is hard MKP-logo Chapter 4 — The Processor — 95 Handling Exceptions nIn MIPS, exceptions managed by a System Control Coprocessor (CP0) nSave PC of offending (or interrupted) instruction nIn MIPS: Exception Program Counter (EPC) nSave indication of the problem nIn MIPS: Cause register nWe’ll assume 1-bit n0 for undefined opcode, 1 for overflow nJump to handler at 8000 00180 MKP-logo Chapter 4 — The Processor — 96 An Alternate Mechanism nVectored Interrupts nHandler address determined by the cause nExample: nUndefined opcode: C000 0000 nOverflow: C000 0020 n…: C000 0040 nInstructions either nDeal with the interrupt, or nJump to real handler MKP-logo Chapter 4 — The Processor — 97 Handler Actions nRead cause, and transfer to relevant handler nDetermine action required nIf restartable nTake corrective action nuse EPC to return to program nOtherwise nTerminate program nReport error using EPC, cause, … MKP-logo Chapter 4 — The Processor — 98 Exceptions in a Pipeline nAnother form of control hazard nConsider overflow on add in EX stage nadd $1, $2, $1 nPrevent $1 from being clobbered nComplete previous instructions nFlush add and subsequent instructions nSet Cause and EPC register values nTransfer control to handler nSimilar to mispredicted branch nUse much of the same hardware MKP-logo Chapter 4 — The Processor — 99 f04-66-P374493 Pipeline with Exceptions MKP-logo Chapter 4 — The Processor — 100 Exception Properties nRestartable exceptions nPipeline can flush the instruction nHandler executes, then returns to the instruction nRefetched and executed from scratch nPC saved in EPC register nIdentifies causing instruction nActually PC + 4 is saved nHandler must adjust MKP-logo Chapter 4 — The Processor — 101 Exception Example nException on add in n 40 sub $11, $2, $4 44 and $12, $2, $5 48 or $13, $2, $6 4C add $1, $2, $1 50 slt $15, $6, $7 54 lw $16, 50($7) … nHandler n 80000180 sw $25, 1000($0) 80000184 sw $26, 1004($0) … MKP-logo Chapter 4 — The Processor — 102 f04-67-P374493-top Exception Example MKP-logo Chapter 4 — The Processor — 103 f04-67-P374493-bottom Exception Example MKP-logo Chapter 4 — The Processor — 104 Multiple Exceptions nPipelining overlaps multiple instructions nCould have multiple exceptions at once nSimple approach: deal with exception from earliest instruction nFlush subsequent instructions n“Precise” exceptions nIn complex pipelines nMultiple instructions issued per cycle nOut-of-order completion nMaintaining precise exceptions is difficult! MKP-logo Chapter 4 — The Processor — 105 Imprecise Exceptions nJust stop pipeline and save state nIncluding exception cause(s) nLet the handler work out nWhich instruction(s) had exceptions nWhich to complete or flush nMay require “manual” completion nSimplifies hardware, but more complex handler software nNot feasible for complex multiple-issue out-of-order pipelines MKP-logo Chapter 4 — The Processor — 106 Instruction-Level Parallelism (ILP) nPipelining: executing multiple instructions in parallel nTo increase ILP nDeeper pipeline nLess work per stage Þ shorter clock cycle nMultiple issue nReplicate pipeline stages Þ multiple pipelines nStart multiple instructions per clock cycle nCPI < 1, so use Instructions Per Cycle (IPC) nE.g., 4GHz 4-way multiple-issue n16 BIPS, peak CPI = 0.25, peak IPC = 4 nBut dependencies reduce this in practice MKP-logo Chapter 4 — The Processor — 107 Multiple Issue nStatic multiple issue nCompiler groups instructions to be issued together nPackages them into “issue slots” nCompiler detects and avoids hazards nDynamic multiple issue nCPU examines instruction stream and chooses instructions to issue each cycle nCompiler can help by reordering instructions nCPU resolves hazards using advanced techniques at runtime MKP-logo Chapter 4 — The Processor — 108 Speculation n“Guess” what to do with an instruction nStart operation as soon as possible nCheck whether guess was right nIf so, complete the operation nIf not, roll-back and do the right thing nCommon to static and dynamic multiple issue nExamples nSpeculate on branch outcome nRoll back if path taken is different nSpeculate on load nRoll back if location is updated MKP-logo Chapter 4 — The Processor — 109 Compiler/Hardware Speculation nCompiler can reorder instructions ne.g., move load before branch nCan include “fix-up” instructions to recover from incorrect guess nHardware can look ahead for instructions to execute nBuffer results until it determines they are actually needed nFlush buffers on incorrect speculation MKP-logo Chapter 4 — The Processor — 110 Speculation and Exceptions nWhat if exception occurs on a speculatively executed instruction? ne.g., speculative load before null-pointer check nStatic speculation nCan add ISA support for deferring exceptions nDynamic speculation nCan buffer exceptions until instruction completion (which may not occur) MKP-logo Chapter 4 — The Processor — 111 Static Multiple Issue nCompiler groups instructions into “issue packets” nGroup of instructions that can be issued on a single cycle nDetermined by pipeline resources required nThink of an issue packet as a very long instruction nSpecifies multiple concurrent operations nÞ Very Long Instruction Word (VLIW) MKP-logo Chapter 4 — The Processor — 112 Scheduling Static Multiple Issue nCompiler must remove some/all hazards nReorder instructions into issue packets nNo dependencies with a packet nPossibly some dependencies between packets nVaries between ISAs; compiler must know! nPad with nop if necessary n MKP-logo Chapter 4 — The Processor — 113 MIPS with Static Dual Issue nTwo-issue packets nOne ALU/branch instruction nOne load/store instruction n64-bit aligned nALU/branch, then load/store nPad an unused instruction with nop Address Instruction type Pipeline Stages n ALU/branch IF ID EX MEM WB n + 4 Load/store IF ID EX MEM WB n + 8 ALU/branch IF ID EX MEM WB n + 12 Load/store IF ID EX MEM WB n + 16 ALU/branch IF ID EX MEM WB n + 20 Load/store IF ID EX MEM WB MKP-logo Chapter 4 — The Processor — 114 f04-69-P374493 MIPS with Static Dual Issue MKP-logo Chapter 4 — The Processor — 115 Hazards in the Dual-Issue MIPS nMore instructions executing in parallel nEX data hazard nForwarding avoided stalls with single-issue nNow can’t use ALU result in load/store in same packet nadd $t0, $s0, $s1 load $s2, 0($t0) nSplit into two packets, effectively a stall nLoad-use hazard nStill one cycle use latency, but now two instructions nMore aggressive scheduling required MKP-logo Chapter 4 — The Processor — 116 Scheduling Example nSchedule this for dual-issue MIPS Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1,–4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 ALU/branch Load/store cycle Loop: nop lw $t0, 0($s1) 1 addi $s1, $s1,–4 nop 2 addu $t0, $t0, $s2 nop 3 bne $s1, $zero, Loop sw $t0, 4($s1) 4 nIPC = 5/4 = 1.25 (c.f. peak IPC = 2) MKP-logo Chapter 4 — The Processor — 117 Loop Unrolling nReplicate loop body to expose more parallelism nReduces loop-control overhead nUse different registers per replication nCalled “register renaming” nAvoid loop-carried “anti-dependencies” nStore followed by a load of the same register nAka “name dependence” nReuse of a register name MKP-logo Chapter 4 — The Processor — 118 Loop Unrolling Example nIPC = 14/8 = 1.75 nCloser to 2, but at cost of registers and code size ALU/branch Load/store cycle Loop: addi $s1, $s1,–16 lw $t0, 0($s1) 1 nop lw $t1, 12($s1) 2 addu $t0, $t0, $s2 lw $t2, 8($s1) 3 addu $t1, $t1, $s2 lw $t3, 4($s1) 4 addu $t2, $t2, $s2 sw $t0, 16($s1) 5 addu $t3, $t4, $s2 sw $t1, 12($s1) 6 nop sw $t2, 8($s1) 7 bne $s1, $zero, Loop sw $t3, 4($s1) 8 MKP-logo Chapter 4 — The Processor — 119 Dynamic Multiple Issue n“Superscalar” processors nCPU decides whether to issue 0, 1, 2, … each cycle nAvoiding structural and data hazards nAvoids the need for compiler scheduling nThough it may still help nCode semantics ensured by the CPU MKP-logo Chapter 4 — The Processor — 120 Dynamic Pipeline Scheduling nAllow the CPU to execute instructions out of order to avoid stalls nBut commit result to registers in order nExample n lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, 20 nCan start sub while addu is waiting for lw MKP-logo Chapter 4 — The Processor — 121 Dynamically Scheduled CPU f04-72-P374493 Results also sent to any waiting reservation stations Reorders buffer for register writes Can supply operands for issued instructions Preserves dependencies Hold pending operands MKP-logo Chapter 4 — The Processor — 122 Register Renaming nReservation stations and reorder buffer effectively provide register renaming nOn instruction issue to reservation station nIf operand is available in register file or reorder buffer nCopied to reservation station nNo longer required in the register; can be overwritten nIf operand is not yet available nIt will be provided to the reservation station by a function unit nRegister update may not be required MKP-logo Chapter 4 — The Processor — 123 Speculation nPredict branch and continue issuing nDon’t commit until branch outcome determined nLoad speculation nAvoid load and cache miss delay nPredict the effective address nPredict loaded value nLoad before completing outstanding stores nBypass stored values to load unit nDon’t commit load until speculation cleared MKP-logo Chapter 4 — The Processor — 124 Why Do Dynamic Scheduling? nWhy not just let the compiler schedule code? nNot all stalls are predicable ne.g., cache misses nCan’t always schedule around branches nBranch outcome is dynamically determined nDifferent implementations of an ISA have different latencies and hazards MKP-logo Chapter 4 — The Processor — 125 Does Multiple Issue Work? nYes, but not as much as we’d like nPrograms have real dependencies that limit ILP nSome dependencies are hard to eliminate ne.g., pointer aliasing nSome parallelism is hard to expose nLimited window size during instruction issue nMemory delays and limited bandwidth nHard to keep pipelines full nSpeculation can help if done well The BIG Picture MKP-logo Chapter 4 — The Processor — 126 Power Efficiency nComplexity of dynamic scheduling and speculations requires power nMultiple simpler cores may be better Microprocessor Year Clock Rate Pipeline Stages Issue width Out-of-order/ Speculation Cores Power i486 1989 25MHz 5 1 No 1 5W Pentium 1993 66MHz 5 2 No 1 10W Pentium Pro 1997 200MHz 10 3 Yes 1 29W P4 Willamette 2001 2000MHz 22 3 Yes 1 75W P4 Prescott 2004 3600MHz 31 3 Yes 1 103W Core 2006 2930MHz 14 4 Yes 2 75W UltraSparc III 2003 1950MHz 14 4 No 1 90W UltraSparc T1 2005 1200MHz 6 1 No 8 70W MKP-logo Chapter 4 — The Processor — 127 The Opteron X4 Microarchitecture f04-74-P374493 72 physical registers MKP-logo Chapter 4 — The Processor — 128 The Opteron X4 Pipeline Flow nFor integer operations f04-75-P374493 nFP is 5 stages longer nUp to 106 RISC-ops in progress nBottlenecks nComplex instructions with long dependencies nBranch mispredictions nMemory access delays MKP-logo Chapter 4 — The Processor — 129 Fallacies nPipelining is easy (!) nThe basic idea is easy nThe devil is in the details ne.g., detecting data hazards nPipelining is independent of technology nSo why haven’t we always done pipelining? nMore transistors make more advanced techniques feasible nPipeline-related ISA design needs to take account of technology trends ne.g., predicated instructions MKP-logo Chapter 4 — The Processor — 130 Pitfalls nPoor ISA design can make pipelining harder ne.g., complex instruction sets (VAX, IA-32) nSignificant overhead to make pipelining work nIA-32 micro-op approach ne.g., complex addressing modes nRegister update side effects, memory indirection ne.g., delayed branches nAdvanced pipelines have long delay slots MKP-logo Chapter 4 — The Processor — 131 Concluding Remarks nISA influences design of datapath and control nDatapath and control influence design of ISA nPipelining improves instruction throughput using parallelism nMore instructions completed per second nLatency for each instruction not reduced nHazards: structural, data, control nMultiple issue and dynamic scheduling (ILP) nDependencies limit achievable parallelism nComplexity leads to the power wall