MKP-logo-white-transparent Title 4th-edition
Chapter 4
The Processor

MKP-logo
Chapter 4 — The Processor — 2
Introduction
nCPU performance factors
nInstruction count
nDetermined by ISA and compiler
nCPI and Cycle time
nDetermined by CPU hardware
nWe will examine two MIPS implementations
nA simplified version
nA more realistic pipelined version
nSimple subset, shows most aspects
nMemory reference: lw, sw
nArithmetic/logical: add, sub, and, or, slt
nControl transfer: beq, j

MKP-logo
Chapter 4 — The Processor — 3
Instruction Execution
nPC ® instruction memory, fetch instruction
nRegister numbers ® register file, read registers
nDepending on instruction class
nUse ALU to calculate
nArithmetic result
nMemory address for load/store
nBranch target address
nAccess data memory for load/store
nPC ¬ target address or PC + 4

MKP-logo
Chapter 4 — The Processor — 4
CPU Overview
f04-01-P374493

MKP-logo
Chapter 4 — The Processor — 5
f04-01-P374493
Multiplexers
nCan’t just join wires together
nUse multiplexers

MKP-logo
Chapter 4 — The Processor — 6
f04-02-P374493
Control

MKP-logo
Chapter 4 — The Processor — 7
Logic Design Basics
nInformation encoded in binary
nLow voltage = 0, High voltage = 1
nOne wire per bit
nMulti-bit data encoded on multi-wire buses
nCombinational element
nOperate on data
nOutput is a function of input
nState (sequential) elements
nStore information

MKP-logo
Chapter 4 — The Processor — 8
Combinational Elements
nAND-gate
nY = A & B
A
B
Y
I0
I1
Y
M
u
x
S
nMultiplexer
nY = S ? I1 : I0
A
B
Y
+
A
B
Y
ALU
F
nAdder
nY = A + B
nArithmetic/Logic Unit
nY = F(A, B)

MKP-logo
Chapter 4 — The Processor — 9
Sequential Elements
nRegister: stores data in a circuit
nUses a clock signal to determine when to update the stored value
nEdge-triggered: update when Clk changes from 0 to 1
D
Clk
Q
Clk
D
Q

MKP-logo
Chapter 4 — The Processor — 10
Sequential Elements
nRegister with write control
nOnly updates on clock edge when write control input is 1
nUsed when stored value is required later
D
Clk
Q
Write
Write
D
Q
Clk

MKP-logo
Chapter 4 — The Processor — 11
Clocking Methodology
nCombinational logic transforms data during clock cycles
nBetween clock edges
nInput from state elements, output to state element
nLongest delay determines clock period
f04-04-P374493 f04-03-P374493

MKP-logo
Chapter 4 — The Processor — 12
Building a Datapath
nDatapath
nElements that process data and addresses
in the CPU
nRegisters, ALUs, mux’s, memories, …
nWe will build a MIPS datapath incrementally
nRefining the overview design

MKP-logo
Chapter 4 — The Processor — 13
Instruction Fetch
32-bit register
Increment by 4 for next instruction
f04-06-P374493

MKP-logo
Chapter 4 — The Processor — 14
f04-07-P374493
R-Format Instructions
nRead two register operands
nPerform arithmetic/logical operation
nWrite register result

MKP-logo
Chapter 4 — The Processor — 15
f04-08-P374493
Load/Store Instructions
nRead register operands
nCalculate address using 16-bit offset
nUse ALU, but sign-extend offset
nLoad: Read memory and update register
nStore: Write register value to memory

MKP-logo
Chapter 4 — The Processor — 16
Branch Instructions
nRead register operands
nCompare operands
nUse ALU, subtract and check Zero output
nCalculate target address
nSign-extend displacement
nShift left 2 places (word displacement)
nAdd to PC + 4
nAlready calculated by instruction fetch

MKP-logo
Chapter 4 — The Processor — 17
f04-09-P374493
Branch Instructions
Just
re-routes wires
Sign-bit wire replicated

MKP-logo
Chapter 4 — The Processor — 18
Composing the Elements
nFirst-cut data path does an instruction in one clock cycle
nEach datapath element can only do one function at a time
nHence, we need separate instruction and data memories
nUse multiplexers where alternate data sources are used for different instructions

MKP-logo
Chapter 4 — The Processor — 19
f04-10-P374493
R-Type/Load/Store Datapath

MKP-logo
Chapter 4 — The Processor — 20
f04-11-P374493
Full Datapath

MKP-logo
Chapter 4 — The Processor — 21
ALU Control
nALU used for
nLoad/Store: F = add
nBranch: F = subtract
nR-type: F depends on funct field
ALU control
Function
0000
AND
0001
OR
0010
add
0110
subtract
0111
set-on-less-than
1100
NOR

MKP-logo
Chapter 4 — The Processor — 22
ALU Control
nAssume 2-bit ALUOp derived from opcode
nCombinational logic derives ALU control
opcode
ALUOp
Operation
funct
ALU function
ALU control
lw
00
load word
XXXXXX
add
0010
sw
00
store word
XXXXXX
add
0010
beq
01
branch equal
XXXXXX
subtract
0110
R-type
10
add
100000
add
0010
subtract
100010
subtract
0110
AND
100100
AND
0000
OR
100101
OR
0001
set-on-less-than
101010
set-on-less-than
0111

MKP-logo
Chapter 4 — The Processor — 23
The Main Control Unit
nControl signals derived from instruction
0
rs
rt
rd
shamt
funct
31:26
5:0
25:21
20:16
15:11
10:6
35 or 43
rs
rt
address
31:26
25:21
20:16
15:0
4
rs
rt
address
31:26
25:21
20:16
15:0
R-type
Load/
Store
Branch
opcode
always read
read, except for load
write for R-type and load
sign-extend and add

MKP-logo
Chapter 4 — The Processor — 24
f04-17-P374493
Datapath With Control

MKP-logo
Chapter 4 — The Processor — 25
f04-19-P374493
R-Type Instruction

MKP-logo
Chapter 4 — The Processor — 26
f04-20-P374493
Load Instruction

MKP-logo
Chapter 4 — The Processor — 27
f04-21-P374493
Branch-on-Equal Instruction

MKP-logo
Chapter 4 — The Processor — 28
Implementing Jumps
nJump uses word address
nUpdate PC with concatenation of
nTop 4 bits of old PC
n26-bit jump address
n00
nNeed an extra control signal decoded from opcode
2
address
31:26
25:0
Jump

MKP-logo
Chapter 4 — The Processor — 29
f04-24-P374493
Datapath With Jumps Added

MKP-logo
Chapter 4 — The Processor — 30
Performance Issues
nLongest delay determines clock period
nCritical path: load instruction
nInstruction memory ® register file ® ALU ® data memory ® register file
nNot feasible to vary period for different instructions
nViolates design principle
nMaking the common case fast
nWe will improve performance by pipelining

MKP-logo
Chapter 4 — The Processor — 31
f04-25-P374493
Pipelining Analogy
nPipelined laundry: overlapping execution
nParallelism improves performance
nFour loads:
nSpeedup
= 8/3.5 = 2.3
nNon-stop:
nSpeedup
= 2n/0.5n + 1.5 ≈ 4
= number of stages

MKP-logo
Chapter 4 — The Processor — 32
MIPS Pipeline
nFive stages, one step per stage
1.IF: Instruction fetch from memory
2.ID: Instruction decode & register read
3.EX: Execute operation or calculate address
4.MEM: Access memory operand
5.WB: Write result back to register

MKP-logo
Chapter 4 — The Processor — 33
Pipeline Performance
nAssume time for stages is
n100ps for register read or write
n200ps for other stages
nCompare pipelined datapath with single-cycle datapath
Instr
Instr fetch
Register read
ALU op
Memory access
Register write
Total time
lw
200ps
100 ps
200ps
200ps
100 ps
800ps
sw
200ps
100 ps
200ps
200ps
700ps
R-format
200ps
100 ps
200ps
100 ps
600ps
beq
200ps
100 ps
200ps
500ps

MKP-logo
Chapter 4 — The Processor — 34
f04-27-P374493
Pipeline Performance
Single-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)

MKP-logo
Chapter 4 — The Processor — 35
Pipeline Speedup
nIf all stages are balanced
ni.e., all take the same time
nTime between instructionspipelined
= Time between instructionsnonpipelined
Number of stages
nIf not balanced, speedup is less
nSpeedup due to increased throughput
nLatency (time for each instruction) does not decrease

MKP-logo
Chapter 4 — The Processor — 36
Pipelining and ISA Design
nMIPS ISA designed for pipelining
nAll instructions are 32-bits
nEasier to fetch and decode in one cycle
nc.f. x86: 1- to 17-byte instructions
nFew and regular instruction formats
nCan decode and read registers in one step
nLoad/store addressing
nCan calculate address in 3rd stage, access memory in 4th stage
nAlignment of memory operands
nMemory access takes only one cycle

MKP-logo
Chapter 4 — The Processor — 37
Hazards
nSituations that prevent starting the next instruction in the next cycle
nStructure hazards
nA required resource is busy
nData hazard
nNeed to wait for previous instruction to complete its data read/write
nControl hazard
nDeciding on control action depends on previous instruction

MKP-logo
Chapter 4 — The Processor — 38
Structure Hazards
nConflict for use of a resource
nIn MIPS pipeline with a single memory
nLoad/store requires data access
nInstruction fetch would have to stall for that cycle
nWould cause a pipeline “bubble”
nHence, pipelined datapaths require separate instruction/data memories
nOr separate instruction/data caches

MKP-logo
Chapter 4 — The Processor — 39
data-hazard-bubble-no-forwarding
Data Hazards
nAn instruction depends on completion of data access by a previous instruction
nadd $s0, $t0, $t1
sub $t2, $s0, $t3

MKP-logo
Chapter 4 — The Processor — 40
f04-29-P374493
Forwarding (aka Bypassing)
nUse result when it is computed
nDon’t wait for it to be stored in a register
nRequires extra connections in the datapath

MKP-logo
Chapter 4 — The Processor — 41
f04-30-P374493
Load-Use Data Hazard
nCan’t always avoid stalls by forwarding
nIf value not computed when needed
nCan’t forward backward in time!

MKP-logo
Chapter 4 — The Processor — 42
Code Scheduling to Avoid Stalls
nReorder code to avoid use of load result in the next instruction
nC code for A = B + E; C = B + F;
lw $t1, 0($t0)
lw $t2, 4($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
lw $t4, 8($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
stall
stall
lw $t1, 0($t0)
lw $t2, 4($t0)
lw $t4, 8($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
11 cycles
13 cycles

MKP-logo
Chapter 4 — The Processor — 43
Control Hazards
nBranch determines flow of control
nFetching next instruction depends on branch outcome
nPipeline can’t always fetch correct instruction
nStill working on ID stage of branch
nIn MIPS pipeline
nNeed to compare registers and compute target early in the pipeline
nAdd hardware to do it in ID stage

MKP-logo
Chapter 4 — The Processor — 44
f04-31-P374493
Stall on Branch
nWait until branch outcome determined before fetching next instruction

MKP-logo
Chapter 4 — The Processor — 45
Branch Prediction
nLonger pipelines can’t readily determine branch outcome early
nStall penalty becomes unacceptable
nPredict outcome of branch
nOnly stall if prediction is wrong
nIn MIPS pipeline
nCan predict branches not taken
nFetch instruction after branch, with no delay

MKP-logo
Chapter 4 — The Processor — 46
f04-32-P374493
MIPS with Predict Not Taken
Prediction correct
Prediction incorrect

MKP-logo
Chapter 4 — The Processor — 47
More-Realistic Branch Prediction
nStatic branch prediction
nBased on typical branch behavior
nExample: loop and if-statement branches
nPredict backward branches taken
nPredict forward branches not taken
nDynamic branch prediction
nHardware measures actual branch behavior
ne.g., record recent history of each branch
nAssume future behavior will continue the trend
nWhen wrong, stall while re-fetching, and update history

MKP-logo
Chapter 4 — The Processor — 48
Pipeline Summary
nPipelining improves performance by increasing instruction throughput
nExecutes multiple instructions in parallel
nEach instruction has the same latency
nSubject to hazards
nStructure, data, control
nInstruction set design affects complexity of pipeline implementation
The BIG Picture

MKP-logo
Chapter 4 — The Processor — 49
f04-33-P374493
MIPS Pipelined Datapath
WB
MEM
Right-to-left flow leads to hazards

MKP-logo
Chapter 4 — The Processor — 50
f04-35-P374493
Pipeline registers
nNeed registers between stages
nTo hold information produced in previous cycle

MKP-logo
Chapter 4 — The Processor — 51
Pipeline Operation
nCycle-by-cycle flow of instructions through the pipelined datapath
n“Single-clock-cycle” pipeline diagram
nShows pipeline usage in a single cycle
nHighlight resources used
nc.f. “multi-clock-cycle” diagram
nGraph of operation over time
nWe’ll look at “single-clock-cycle” diagrams for load & store

MKP-logo
Chapter 4 — The Processor — 52
f04-36-P374493-IF
IF for Load, Store, …

MKP-logo
Chapter 4 — The Processor — 53
f04-36-P374493-ID
ID for Load, Store, …

MKP-logo
Chapter 4 — The Processor — 54
f04-37-P374493
EX for Load

MKP-logo
Chapter 4 — The Processor — 55
f04-38-P374493-MEM
MEM for Load

MKP-logo
Chapter 4 — The Processor — 56
f04-38-P374493-WB
WB for Load
Wrong
register
number

MKP-logo
Chapter 4 — The Processor — 57
f04-41-P374493
Corrected Datapath for Load

MKP-logo
Chapter 4 — The Processor — 58
f04-39-P374493
EX for Store

MKP-logo
Chapter 4 — The Processor — 59
f04-40-P374493-MEM
MEM for Store

MKP-logo
Chapter 4 — The Processor — 60
f04-40-P374493-WB
WB for Store

MKP-logo
Chapter 4 — The Processor — 61
f04-43-P374493
Multi-Cycle Pipeline Diagram
nForm showing resource usage

MKP-logo
Chapter 4 — The Processor — 62
Multi-Cycle Pipeline Diagram
nTraditional form
f04-44-P374493

MKP-logo
Chapter 4 — The Processor — 63
f04-45-P374493
Single-Cycle Pipeline Diagram
nState of pipeline in a given cycle

MKP-logo
Chapter 4 — The Processor — 64
f04-46-P374493
Pipelined Control (Simplified)

MKP-logo
Chapter 4 — The Processor — 65
f04-50-P374493
Pipelined Control
nControl signals derived from instruction
nAs in single-cycle implementation

MKP-logo
Chapter 4 — The Processor — 66
f04-51-P374493
Pipelined Control

MKP-logo
Chapter 4 — The Processor — 67
Data Hazards in ALU Instructions
nConsider this sequence:
n sub $2, $1,$3
and $12,$2,$5
or  $13,$6,$2
add $14,$2,$2
sw  $15,100($2)
nWe can resolve hazards with forwarding
nHow do we detect when to forward?

MKP-logo
Chapter 4 — The Processor — 68
f04-52-P374493
Dependencies & Forwarding

MKP-logo
Chapter 4 — The Processor — 69
Detecting the Need to Forward
nPass register numbers along pipeline
ne.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX pipeline register
nALU operand register numbers in EX stage are given by
nID/EX.RegisterRs, ID/EX.RegisterRt
nData hazards when
n1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
n1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
n2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
n2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
Fwd from
EX/MEM
pipeline reg
Fwd from
MEM/WB
pipeline reg

MKP-logo
Chapter 4 — The Processor — 70
Detecting the Need to Forward
nBut only if forwarding instruction will write to a register!
nEX/MEM.RegWrite, MEM/WB.RegWrite
nAnd only if Rd for that instruction is not $zero
nEX/MEM.RegisterRd ≠ 0,
MEM/WB.RegisterRd ≠ 0

MKP-logo
Chapter 4 — The Processor — 71
f04-54-P374493-bottom
Forwarding Paths

MKP-logo
Chapter 4 — The Processor — 72
Forwarding Conditions
nEX hazard
nif (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
    and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
  ForwardA = 10
nif (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
    and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
  ForwardB = 10
nMEM hazard
nif (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
    and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
  ForwardA = 01
nif (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
    and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
  ForwardB = 01

MKP-logo
Chapter 4 — The Processor — 73
Double Data Hazard
nConsider the sequence:
n add $1,$1,$2
add $1,$1,$3
add $1,$1,$4
nBoth hazards occur
nWant to use the most recent
nRevise MEM hazard condition
nOnly fwd if EX hazard condition isn’t true

MKP-logo
Chapter 4 — The Processor — 74
Revised Forwarding Condition
nMEM hazard
nif (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
    and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
                 and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
    and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
  ForwardA = 01
nif (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
    and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
                 and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
    and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
  ForwardB = 01

MKP-logo
Chapter 4 — The Processor — 75
f04-56-P374493
Datapath with Forwarding

MKP-logo
Chapter 4 — The Processor — 76
f04-58-P374493
Load-Use Data Hazard
Need to stall for one cycle

MKP-logo
Chapter 4 — The Processor — 77
Load-Use Hazard Detection
nCheck when using instruction is decoded in ID stage
nALU operand register numbers in ID stage are given by
nIF/ID.RegisterRs, IF/ID.RegisterRt
nLoad-use hazard when
nID/EX.MemRead and
  ((ID/EX.RegisterRt = IF/ID.RegisterRs) or
   (ID/EX.RegisterRt = IF/ID.RegisterRt))
nIf detected, stall and insert bubble

MKP-logo
Chapter 4 — The Processor — 78
How to Stall the Pipeline
nForce control values in ID/EX register
to 0
nEX, MEM and WB do nop (no-operation)
nPrevent update of PC and IF/ID register
nUsing instruction is decoded again
nFollowing instruction is fetched again
n1-cycle stall allows MEM to read data for lw
nCan subsequently forward to EX stage

MKP-logo
Chapter 4 — The Processor — 79
f04-59-P374493
Stall/Bubble in the Pipeline
Stall inserted here

MKP-logo
Chapter 4 — The Processor — 80
f04-59-P374493-what-really-happens
Stall/Bubble in the Pipeline
Or, more accurately…

MKP-logo
Chapter 4 — The Processor — 81
f04-60-P374493
Datapath with Hazard Detection

MKP-logo
Chapter 4 — The Processor — 82
Stalls and Performance
nStalls reduce performance
nBut are required to get correct results
nCompiler can arrange code to avoid hazards and stalls
nRequires knowledge of the pipeline structure
The BIG Picture

MKP-logo
Chapter 4 — The Processor — 83
f04-61-P374493
Branch Hazards
nIf branch outcome determined in MEM
PC
Flush these
instructions
(Set control
values to 0)

MKP-logo
Chapter 4 — The Processor — 84
Reducing Branch Delay
nMove hardware to determine outcome to ID stage
nTarget address adder
nRegister comparator
nExample: branch taken
n 36:  sub  $10, $4, $8
40:  beq  $1,  $3, 7
44:  and  $12, $2, $5
48:  or   $13, $2, $6
52:  add  $14, $4, $2
56:  slt  $15, $6, $7
     ...
72:  lw   $4, 50($7)

MKP-logo
Chapter 4 — The Processor — 85
f04-62-P374493-top
Example: Branch Taken

MKP-logo
Chapter 4 — The Processor — 86
f04-62-P374493-bottom
Example: Branch Taken

MKP-logo
Chapter 4 — The Processor — 87
Data Hazards for Branches
nIf a comparison register is a destination of 2nd or 3rd preceding ALU instruction
…
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
add $4, $5, $6
add $1, $2, $3
beq $1, $4, target
nCan resolve using forwarding

MKP-logo
Chapter 4 — The Processor — 88
Data Hazards for Branches
nIf a comparison register is a destination of preceding ALU instruction or 2nd preceding load
instruction
nNeed 1 stall cycle
beq stalled
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
ID
EX
MEM
WB
add $4, $5, $6
lw  $1, addr
beq $1, $4, target

MKP-logo
Chapter 4 — The Processor — 89
Data Hazards for Branches
nIf a comparison register is a destination of immediately preceding load instruction
nNeed 2 stall cycles
beq stalled
IF
ID
EX
MEM
WB
IF
ID
ID
ID
EX
MEM
WB
beq stalled
lw  $1, addr
beq $1, $0, target

MKP-logo
Chapter 4 — The Processor — 90
Dynamic Branch Prediction
nIn deeper and superscalar pipelines, branch penalty is more significant
nUse dynamic prediction
nBranch prediction buffer (aka branch history table)
nIndexed by recent branch instruction addresses
nStores outcome (taken/not taken)
nTo execute a branch
nCheck table, expect the same outcome
nStart fetching from fall-through or target
nIf wrong, flush pipeline and flip prediction

MKP-logo
Chapter 4 — The Processor — 91
1-Bit Predictor: Shortcoming
nInner loop branches mispredicted twice!
outer: …
       …
inner: …
       …
       beq …, …, inner
       …
       beq …, …, outer
nMispredict as taken on last iteration of inner loop
nThen mispredict as not taken on first iteration of inner loop next time around

MKP-logo
Chapter 4 — The Processor — 92
f04-63-P374493
2-Bit Predictor
nOnly change prediction on two successive mispredictions

MKP-logo
Chapter 4 — The Processor — 93
Calculating the Branch Target
nEven with predictor, still need to calculate the target address
n1-cycle penalty for a taken branch
nBranch target buffer
nCache of target addresses
nIndexed by PC when instruction fetched
nIf hit and instruction is branch predicted taken, can fetch target immediately

MKP-logo
Chapter 4 — The Processor — 94
Exceptions and Interrupts
n“Unexpected” events requiring change
in flow of control
nDifferent ISAs use the terms differently
nException
nArises within the CPU
ne.g., undefined opcode, overflow, syscall, …
nInterrupt
nFrom an external I/O controller
nDealing with them without sacrificing performance is hard

MKP-logo
Chapter 4 — The Processor — 95
Handling Exceptions
nIn MIPS, exceptions managed by a System Control Coprocessor (CP0)
nSave PC of offending (or interrupted) instruction
nIn MIPS: Exception Program Counter (EPC)
nSave indication of the problem
nIn MIPS: Cause register
nWe’ll assume 1-bit
n0 for undefined opcode, 1 for overflow
nJump to handler at 8000 00180

MKP-logo
Chapter 4 — The Processor — 96
An Alternate Mechanism
nVectored Interrupts
nHandler address determined by the cause
nExample:
nUndefined opcode: C000 0000
nOverflow: C000 0020
n…: C000 0040
nInstructions either
nDeal with the interrupt, or
nJump to real handler

MKP-logo
Chapter 4 — The Processor — 97
Handler Actions
nRead cause, and transfer to relevant handler
nDetermine action required
nIf restartable
nTake corrective action
nuse EPC to return to program
nOtherwise
nTerminate program
nReport error using EPC, cause, …

MKP-logo
Chapter 4 — The Processor — 98
Exceptions in a Pipeline
nAnother form of control hazard
nConsider overflow on add in EX stage
nadd $1, $2, $1
nPrevent $1 from being clobbered
nComplete previous instructions
nFlush add and subsequent instructions
nSet Cause and EPC register values
nTransfer control to handler
nSimilar to mispredicted branch
nUse much of the same hardware

MKP-logo
Chapter 4 — The Processor — 99
f04-66-P374493
Pipeline with Exceptions

MKP-logo
Chapter 4 — The Processor — 100
Exception Properties
nRestartable exceptions
nPipeline can flush the instruction
nHandler executes, then returns to the instruction
nRefetched and executed from scratch
nPC saved in EPC register
nIdentifies causing instruction
nActually PC + 4 is saved
nHandler must adjust

MKP-logo
Chapter 4 — The Processor — 101
Exception Example
nException on add in
n 40 sub  $11, $2, $4
44 and  $12, $2, $5
48 or   $13, $2, $6
4C add  $1,  $2, $1
50 slt  $15, $6, $7
54 lw   $16, 50($7)
…
nHandler
n 80000180 sw   $25, 1000($0)
80000184 sw   $26, 1004($0)
…

MKP-logo
Chapter 4 — The Processor — 102
f04-67-P374493-top
Exception Example

MKP-logo
Chapter 4 — The Processor — 103
f04-67-P374493-bottom
Exception Example

MKP-logo
Chapter 4 — The Processor — 104
Multiple Exceptions
nPipelining overlaps multiple instructions
nCould have multiple exceptions at once
nSimple approach: deal with exception from earliest instruction
nFlush subsequent instructions
n“Precise” exceptions
nIn complex pipelines
nMultiple instructions issued per cycle
nOut-of-order completion
nMaintaining precise exceptions is difficult!

MKP-logo
Chapter 4 — The Processor — 105
Imprecise Exceptions
nJust stop pipeline and save state
nIncluding exception cause(s)
nLet the handler work out
nWhich instruction(s) had exceptions
nWhich to complete or flush
nMay require “manual” completion
nSimplifies hardware, but more complex handler software
nNot feasible for complex multiple-issue
out-of-order pipelines

MKP-logo
Chapter 4 — The Processor — 106
Instruction-Level Parallelism (ILP)
nPipelining: executing multiple instructions in parallel
nTo increase ILP
nDeeper pipeline
nLess work per stage Þ shorter clock cycle
nMultiple issue
nReplicate pipeline stages Þ multiple pipelines
nStart multiple instructions per clock cycle
nCPI < 1, so use Instructions Per Cycle (IPC)
nE.g., 4GHz 4-way multiple-issue
n16 BIPS, peak CPI = 0.25, peak IPC = 4
nBut dependencies reduce this in practice

MKP-logo
Chapter 4 — The Processor — 107
Multiple Issue
nStatic multiple issue
nCompiler groups instructions to be issued together
nPackages them into “issue slots”
nCompiler detects and avoids hazards
nDynamic multiple issue
nCPU examines instruction stream and chooses instructions to issue each cycle
nCompiler can help by reordering instructions
nCPU resolves hazards using advanced techniques at runtime

MKP-logo
Chapter 4 — The Processor — 108
Speculation
n“Guess” what to do with an instruction
nStart operation as soon as possible
nCheck whether guess was right
nIf so, complete the operation
nIf not, roll-back and do the right thing
nCommon to static and dynamic multiple issue
nExamples
nSpeculate on branch outcome
nRoll back if path taken is different
nSpeculate on load
nRoll back if location is updated

MKP-logo
Chapter 4 — The Processor — 109
Compiler/Hardware Speculation
nCompiler can reorder instructions
ne.g., move load before branch
nCan include “fix-up” instructions to recover from incorrect guess
nHardware can look ahead for instructions to execute
nBuffer results until it determines they are actually needed
nFlush buffers on incorrect speculation

MKP-logo
Chapter 4 — The Processor — 110
Speculation and Exceptions
nWhat if exception occurs on a speculatively executed instruction?
ne.g., speculative load before null-pointer check
nStatic speculation
nCan add ISA support for deferring exceptions
nDynamic speculation
nCan buffer exceptions until instruction completion (which may not occur)

MKP-logo
Chapter 4 — The Processor — 111
Static Multiple Issue
nCompiler groups instructions into “issue packets”
nGroup of instructions that can be issued on a single cycle
nDetermined by pipeline resources required
nThink of an issue packet as a very long instruction
nSpecifies multiple concurrent operations
nÞ Very Long Instruction Word (VLIW)

MKP-logo
Chapter 4 — The Processor — 112
Scheduling Static Multiple Issue
nCompiler must remove some/all hazards
nReorder instructions into issue packets
nNo dependencies with a packet
nPossibly some dependencies between packets
nVaries between ISAs; compiler must know!
nPad with nop if necessary
n

MKP-logo
Chapter 4 — The Processor — 113
MIPS with Static Dual Issue
nTwo-issue packets
nOne ALU/branch instruction
nOne load/store instruction
n64-bit aligned
nALU/branch, then load/store
nPad an unused instruction with nop
Address
Instruction type
Pipeline Stages
n
ALU/branch
IF
ID
EX
MEM
WB
n + 4
Load/store
IF
ID
EX
MEM
WB
n + 8
ALU/branch
IF
ID
EX
MEM
WB
n + 12
Load/store
IF
ID
EX
MEM
WB
n + 16
ALU/branch
IF
ID
EX
MEM
WB
n + 20
Load/store
IF
ID
EX
MEM
WB

MKP-logo
Chapter 4 — The Processor — 114
f04-69-P374493
MIPS with Static Dual Issue

MKP-logo
Chapter 4 — The Processor — 115
Hazards in the Dual-Issue MIPS
nMore instructions executing in parallel
nEX data hazard
nForwarding avoided stalls with single-issue
nNow can’t use ALU result in load/store in same packet
nadd  $t0, $s0, $s1
load $s2, 0($t0)
nSplit into two packets, effectively a stall
nLoad-use hazard
nStill one cycle use latency, but now two instructions
nMore aggressive scheduling required

MKP-logo
Chapter 4 — The Processor — 116
Scheduling Example
nSchedule this for dual-issue MIPS
Loop: lw   $t0, 0($s1)      # $t0=array element
      addu $t0, $t0, $s2    # add scalar in $s2
      sw   $t0, 0($s1)      # store result
      addi $s1, $s1,–4      # decrement pointer
      bne  $s1, $zero, Loop # branch $s1!=0
ALU/branch
Load/store
cycle
Loop:
nop
lw   $t0, 0($s1)
1
addi $s1, $s1,–4
nop
2
addu $t0, $t0, $s2
nop
3
bne  $s1, $zero, Loop
sw   $t0, 4($s1)
4
nIPC = 5/4 = 1.25 (c.f. peak IPC = 2)

MKP-logo
Chapter 4 — The Processor — 117
Loop Unrolling
nReplicate loop body to expose more parallelism
nReduces loop-control overhead
nUse different registers per replication
nCalled “register renaming”
nAvoid loop-carried “anti-dependencies”
nStore followed by a load of the same register
nAka “name dependence”
nReuse of a register name

MKP-logo
Chapter 4 — The Processor — 118
Loop Unrolling Example
nIPC = 14/8 = 1.75
nCloser to 2, but at cost of registers and code size
ALU/branch
Load/store
cycle
Loop:
addi $s1, $s1,–16
lw   $t0, 0($s1)
1
nop
lw   $t1, 12($s1)
2
addu $t0, $t0, $s2
lw   $t2, 8($s1)
3
addu $t1, $t1, $s2
lw   $t3, 4($s1)
4
addu $t2, $t2, $s2
sw   $t0, 16($s1)
5
addu $t3, $t4, $s2
sw   $t1, 12($s1)
6
nop
sw   $t2, 8($s1)
7
bne  $s1, $zero, Loop
sw   $t3, 4($s1)
8

MKP-logo
Chapter 4 — The Processor — 119
Dynamic Multiple Issue
n“Superscalar” processors
nCPU decides whether to issue 0, 1, 2, … each cycle
nAvoiding structural and data hazards
nAvoids the need for compiler scheduling
nThough it may still help
nCode semantics ensured by the CPU

MKP-logo
Chapter 4 — The Processor — 120
Dynamic Pipeline Scheduling
nAllow the CPU to execute instructions out of order to avoid stalls
nBut commit result to registers in order
nExample
n lw    $t0, 20($s2)
addu  $t1, $t0, $t2
sub   $s4, $s4, $t3
slti  $t5, $s4, 20
nCan start sub while addu is waiting for lw

MKP-logo
Chapter 4 — The Processor — 121
Dynamically Scheduled CPU
f04-72-P374493
Results also sent to any waiting reservation stations
Reorders buffer for register writes
Can supply operands for issued instructions
Preserves dependencies
Hold pending operands

MKP-logo
Chapter 4 — The Processor — 122
Register Renaming
nReservation stations and reorder buffer effectively provide register renaming
nOn instruction issue to reservation station
nIf operand is available in register file or reorder buffer
nCopied to reservation station
nNo longer required in the register; can be overwritten
nIf operand is not yet available
nIt will be provided to the reservation station by a function unit
nRegister update may not be required

MKP-logo
Chapter 4 — The Processor — 123
Speculation
nPredict branch and continue issuing
nDon’t commit until branch outcome determined
nLoad speculation
nAvoid load and cache miss delay
nPredict the effective address
nPredict loaded value
nLoad before completing outstanding stores
nBypass stored values to load unit
nDon’t commit load until speculation cleared

MKP-logo
Chapter 4 — The Processor — 124
Why Do Dynamic Scheduling?
nWhy not just let the compiler schedule code?
nNot all stalls are predicable
ne.g., cache misses
nCan’t always schedule around branches
nBranch outcome is dynamically determined
nDifferent implementations of an ISA have different latencies and hazards

MKP-logo
Chapter 4 — The Processor — 125
Does Multiple Issue Work?
nYes, but not as much as we’d like
nPrograms have real dependencies that limit ILP
nSome dependencies are hard to eliminate
ne.g., pointer aliasing
nSome parallelism is hard to expose
nLimited window size during instruction issue
nMemory delays and limited bandwidth
nHard to keep pipelines full
nSpeculation can help if done well
The BIG Picture

MKP-logo
Chapter 4 — The Processor — 126
Power Efficiency
nComplexity of dynamic scheduling and speculations requires power
nMultiple simpler cores may be better
Microprocessor
Year
Clock Rate
Pipeline Stages
Issue width
Out-of-order/ Speculation
Cores
Power
i486
1989
25MHz
5
1
No
1
5W
Pentium
1993
66MHz
5
2
No
1
10W
Pentium Pro
1997
200MHz
10
3
Yes
1
29W
P4 Willamette
2001
2000MHz
22
3
Yes
1
75W
P4 Prescott
2004
3600MHz
31
3
Yes
1
103W
Core
2006
2930MHz
14
4
Yes
2
75W
UltraSparc III
2003
1950MHz
14
4
No
1
90W
UltraSparc T1
2005
1200MHz
6
1
No
8
70W

MKP-logo
Chapter 4 — The Processor — 127
The Opteron X4 Microarchitecture
f04-74-P374493
72 physical registers

MKP-logo
Chapter 4 — The Processor — 128
The Opteron X4 Pipeline Flow
nFor integer operations
f04-75-P374493
nFP is 5 stages longer
nUp to 106 RISC-ops in progress
nBottlenecks
nComplex instructions with long dependencies
nBranch mispredictions
nMemory access delays

MKP-logo
Chapter 4 — The Processor — 129
Fallacies
nPipelining is easy (!)
nThe basic idea is easy
nThe devil is in the details
ne.g., detecting data hazards
nPipelining is independent of technology
nSo why haven’t we always done pipelining?
nMore transistors make more advanced techniques feasible
nPipeline-related ISA design needs to take account of technology trends
ne.g., predicated instructions

MKP-logo
Chapter 4 — The Processor — 130
Pitfalls
nPoor ISA design can make pipelining harder
ne.g., complex instruction sets (VAX, IA-32)
nSignificant overhead to make pipelining work
nIA-32 micro-op approach
ne.g., complex addressing modes
nRegister update side effects, memory indirection
ne.g., delayed branches
nAdvanced pipelines have long delay slots

MKP-logo
Chapter 4 — The Processor — 131
Concluding Remarks
nISA influences design of datapath and control
nDatapath and control influence design of ISA
nPipelining improves instruction throughput
using parallelism
nMore instructions completed per second
nLatency for each instruction not reduced
nHazards: structural, data, control
nMultiple issue and dynamic scheduling (ILP)
nDependencies limit achievable parallelism
nComplexity leads to the power wall