dce Load/Store Instructions • Read register operands • Calculate address using 16-bit offsetg – Use ALU, but sign-extend offset • Load: Read memory and update register • Store: Write reg
Trang 2dce
Chapter 3
The Processor
Adapted from Computer Organization and
2008
Trang 3dce
The Five classic Components of a Computer
Trang 4• Determined by ISA and compiler
– CPI and Cycle time
• Determined by CPU hardware Determined by CPU hardware
• We will examine two MIPS implementations
– A simplified version – A more realistic pipelined version
• Simple subset, shows most aspects
– Memory reference: lw, sw – Arithmetic/logical: add, sub, and, or, slt – Control transfer: beq, j Control transfer: beq, j
Trang 5dce
Instruction Execution
• PC → instruction memory, fetch instruction
• Register numbersRegister numbers → register file read registers→ register file, read registers
• Depending on instruction class
– Use ALU to calculate Use ALU to calculate
• Arithmetic result
• Memory address for load/store
• Branch target address
– Access data memory for load/store – PC ← target address or PC + 4
– PC ← target address or PC + 4
Trang 6dce
CPU Overview
Trang 8dce
Control
Trang 9dce
Logic Design Basics
• Information encoded in binary
Low voltage = 0 High voltage = 1
– Low voltage = 0, High voltage = 1– One wire per bit
Multi bit data encoded on multi wire buses
– Multi-bit data encoded on multi-wire buses
• Combinational element
– Operate on data– Output is a function of input
• State (sequential) elements
– Store information
Trang 10Y ALU
Trang 11dce
Sequential Elements
• Register: stores data in a circuit
Uses a clock signal to determine when to
– Uses a clock signal to determine when to update the stored value
– Edge-triggered: update when Clk changes
– Edge-triggered: update when Clk changes from 0 to 1
Clk D
Clk
D
Q
Trang 12dce
Sequential Elements
• Register with write control
Only updates on clock edge when write
– Only updates on clock edge when write control input is 1
– Used when stored value is required later
Trang 13dce
Clocking Methodology
• Combinational logic transforms data
during clock cycles
– Between clock edges– Input from state elements, output to state p , pelement
– Longest delay determines clock period
Trang 14dce
Building a Datapath
• Datapath
Elements that process data and addresses
– Elements that process data and addresses
in the CPU
• Registers, ALUs, mux’s, memories, … Registers, ALUs, mux s, memories, …
• We will build a MIPS datapath
incrementally
– Refining the overview design
Trang 16dce
Review Instruction Formats
Trang 17dce
R-Format Instructions
• Read two register operands
Perform arithmetic/logical operation
• Perform arithmetic/logical operation
• Write register result
Trang 18dce
Load/Store Instructions
• Read register operands
• Calculate address using 16-bit offsetg
– Use ALU, but sign-extend offset
• Load: Read memory and update register
• Store: Write register value to memory
Trang 19– Use ALU, subtract and check Zero output
• Calculate target address
– Sign-extend displacement– Shift left 2 places (word displacement)– Add to PC + 4
• Already calculated by instruction fetch
Trang 20Sign-bit wire Sign bit wire replicated
Trang 21dce
Composing the Elements
• First-cut data path does an instruction in
one clock cycle
– Each datapath element can only do one function at a time
– Hence, we need separate instruction and data memories
• Use multiplexers where alternate data
sources are used for different instructions
Trang 22dce
R-Type/Load/Store Datapath
Trang 23dce
Full Datapath
Trang 24R type: F depends on funct field– R-type: F depends on funct field
Trang 25dce
ALU Control
• Assume 2-bit ALUOp derived from opcode
Combinational logic derives ALU control– Combinational logic derives ALU control
opcode ALUOp Operation funct ALU function ALU control
lw 00 load word XXXXXX add 0010
sw 00 store word XXXXXX add 0010
beq q 01 branch equal q XXXXXX subtract 0110
R-type 10 add 100000 add 0010
subtract 100010 subtract 0110 AND 100100 AND 0000 AND 100100 AND 0000
set-on-less-than 101010 set-on-less-than 0111
Trang 26dce
The Main Control Unit
• Control signals derived from instruction
write for R-type
d l d
sign-extend and add
Trang 27dce
Datapath With Control
Trang 28dce
R-Type Instruction
Trang 29dce
Load Instruction
Trang 30dce
Branch-on-Equal Instruction
Trang 31• Update PC with concatenation of
– Top 4 bits of old PCop b ts o o d C– 26-bit jump address– 0000
• Need an extra control signal decoded from
Trang 32dce
Datapath With Jumps Added
Trang 33dce
Performance Issues
• Longest delay determines clock period
Critical path: load instruction
– Critical path: load instruction– Instruction memory → register file → ALU →data memory → register file
data memory → register file
• Not feasible to vary period for different
instructions
• Violates design principle
– Making the common case fast
• We will improve performance by pipelining
Trang 34dce
Pipelining Analogy
• Pipelined laundry: overlapping execution
Parallelism improves performance– Parallelism improves performance
Trang 35dce
MIPS Pipeline
• Five stages, one step per stage
1 IF: Instruction fetch from memory
2 ID: Instruction decode & register read
3 EX: Execute operation or calculate address
4 MEM: Access memory operand
5 WB W it lt b k t i t
5 WB: Write result back to register
Trang 36dce
Pipeline Performance
• Assume time for stages is
– 100ps for register read or write 100ps for register read or write – 200ps for other stages
• Compare pipelined datapath with single-cycle p p p p g ydatapath
read
access
Register write
Trang 38dce
Pipeline Speedup
• If all stages are balanced
i e all take the same time
– i.e., all take the same time– Time between instructionspipelined
= Time between instructionsnonpipelined
Number of stages
• If not balanced, speedup is less
• Speedup due to increased throughput
– Latency (time for each instruction) does not decrease
Trang 39dce
Pipelining and ISA Design
• MIPS ISA designed for pipelining
– All instructions are 32-bitsAll instructions are 32 bits
• Easier to fetch and decode in one cycle
• c.f x86: 1- to 17-byte instructions
– Few and regular instruction formats
• Can decode and read registers in one step
L d/ t dd i
– Load/store addressing
• Can calculate address in 3 rd stage, access memory in 4 y th stage g
– Alignment of memory operands
• Memory access takes only one cycle
Trang 41dce
Structure Hazards
• Conflict for use of a resource
In MIPS pipeline with a single memory
• In MIPS pipeline with a single memory
– Load/store requires data access
I i f h ld h t ll f h
– Instruction fetch would have to stall for that
cycle
Would cause a pipeline “bubble”
• Would cause a pipeline bubble
• Hence, pipelined datapaths require
separate instruction/data memories
– Or separate instruction/data caches
Trang 43dce
Forwarding (aka Bypassing)
• Use result when it is computed
Don’t wait for it to be stored in a register– Don t wait for it to be stored in a register– Requires extra connections in the datapath
Trang 44dce
Load-Use Data Hazard
• Can’t always avoid stalls by forwarding
If value not computed when needed– If value not computed when needed– Can’t forward backward in time!
Trang 45dce
Code Scheduling to Avoid Stalls
• Reorder code to avoid use of load result in the next instruction
sw $t3 12($t0)
lw $t4 , 8($t0) add $t5, $t1, $t4
sw $t5, 16($t0)
stall
sw $t3, 12($t0) add $t5, $t1, $t4
sw $t5, 16($t0)
Trang 46dce
Control Hazards
• Branch determines flow of control
– Fetching next instruction depends on branchFetching next instruction depends on branch outcome
– Pipeline can’t always fetch correct instruction
• Still working on ID stage of branch
• In MIPS pipeline
– Need to compare registers and compute target early in the pipeline
Add h d t d it i ID t– Add hardware to do it in ID stage
Trang 48branch outcome early
– Stall penalty becomes unacceptable
Predict o tcome of branch
• Predict outcome of branch
– Only stall if prediction is wrong
• In MIPS pipeline
– Can predict branches not taken– Fetch instruction after branch, with no delay
Trang 50dce
More-Realistic Branch Prediction
• Static branch prediction
– Based on typical branch behavior Based on typical branch behavior – Example: loop and if-statement branches
• Predict backward branches taken
• Predict forward branches not taken
• Dynamic branch prediction
– Hardware measures actual branch behavior
• e.g., record recent history of each branch
– Assume future behavior will continue the trend ssu e u u e be a o co ue e e d
• When wrong, stall while re-fetching, and update history
Trang 51dce
Pipeline Summary
The BIG Picture
• Pipelining improves performance by
increasing instruction throughput
– Executes multiple instructions in parallel– Each instruction has the same latency
• Subject to hazards
– Structure, data, controlStructure, data, control
• Instruction set design affects complexity of
Trang 53dce
Pipeline registers
• Need registers between stages
To hold information produced in previous cycle– To hold information produced in previous cycle
Trang 54– “Single-clock-cycle” pipeline diagram
• Shows pipeline usage in a single cycle
• Highlight resources used
– c f “multi-clock-cycle” diagramc.f multi clock cycle diagram
• Graph of operation over time
• We’ll look at “single-clock-cycle” diagrams We ll look at single clock cycle diagrams for load & store
Trang 55dce
IF for Load, Store, …
Trang 56dce
ID for Load, Store, …
Trang 57dce
EX for Load
Trang 58dce
MEM for Load
Trang 59dce
WB for Load
Wrong register
Trang 60dce
Corrected Datapath for Load
Trang 61dce
EX for Store
Trang 62dce
MEM for Store
Trang 63dce
WB for Store
Trang 64dce
Multi-Cycle Pipeline Diagram
• Form showing resource usage
Trang 65dce
Multi-Cycle Pipeline Diagram
• Traditional form
Trang 66dce
Single-Cycle Pipeline Diagram
• State of pipeline in a given cycle
Trang 67dce
Pipelined Control (Simplified)
Trang 68dce
Pipelined Control
• Control signals derived from instruction
As in single cycle implementation– As in single-cycle implementation
Trang 69dce
Pipelined Control
Trang 70• We can resolve hazards with forwarding
– How do we detect when to forward?
o do e detect e to o a d
Trang 71dce
Dependencies & Forwarding
Trang 72dce
Detecting the Need to Forward
• Pass register numbers along pipeline
e g ID/EX RegisterRs = register number for Rs
– e.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX pipeline register
• ALU operand register numbers in EX stage p g g
are given by
– ID/EX.RegisterRs, ID/EX.RegisterRt
D t h d h
• Data hazards when
1a EX/MEM.RegisterRd = ID/EX.RegisterRs
1b EX/MEM RegisterRd = ID/EX RegisterRt
Fwd from EX/MEM pipeline reg
1b EX/MEM.RegisterRd = ID/EX.RegisterRt
2a MEM/WB.RegisterRd = ID/EX.RegisterRs
2b MEM/WB.RegisterRd = ID/EX.RegisterRt
Fwd from MEM/WB pipeline reg
p p g
Trang 73dce
Detecting the Need to Forward
• But only if forwarding instruction will write
to a register!
– EX/MEM.RegWrite, MEM/WB.RegWrite
And onl if Rd for that instr ction is not
• And only if Rd for that instruction is not
$zero
EX/MEM R i t Rd ≠ 0– EX/MEM.RegisterRd ≠ 0,MEM/WB.RegisterRd ≠ 0
Trang 74dce
Forwarding Paths
Trang 75dce
Forwarding Conditions
• EX hazard
– if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 10
– if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 10
• MEM hazard
– if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01
– if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
Trang 76dce
Double Data Hazard
• Consider the sequence:
add $1 $1 $2
add $1,$1,$2add $1,$1,$3add $1,$1,$4
add $ ,$ ,$
• Both hazards occur
Want to use the most recent
– Want to use the most recent
• Revise MEM hazard condition
O l f d if EX h d diti i ’t t– Only fwd if EX hazard condition isn’t true
Trang 77dce
Revised Forwarding Condition
• MEM hazard
– if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) ( g ( g )
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
and (MEM/WB RegisterRd = ID/EX RegisterRs)) ForwardA = 01
– if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
and (MEM/WB RegisterRd = ID/EX RegisterRt)) ForwardB = 01
Trang 78dce
Datapath with Forwarding
Trang 79dce
Load-Use Data Hazard
Need to stall for one cycle
Trang 80dce
Load-Use Hazard Detection
• Check when using instruction is decoded
• If detected, stall and insert bubble
Trang 81dce
How to Stall the Pipeline
• Force control values in ID/EX register
to 0
– EX, MEM and WB do nop (no-operation)
Pre ent pdate of PC and IF/ID register
• Prevent update of PC and IF/ID register
– Using instruction is decoded again– Following instruction is fetched again– 1-cycle stall allows MEM to read data for lw
• Can subsequently forward to EX stage
Trang 83dce
Stall/Bubble in the Pipeline
Trang 84dce
Datapath with Hazard Detection
Trang 85dce
Stalls and Performance
The BIG Picture
• Stalls reduce performance
– But are required to get correct results
• Compiler can arrange code to avoid
hazards and stalls
– Requires knowledge of the pipeline structure
Trang 87dce
Reducing Branch Delay
• Move hardware to determine outcome to ID stageg
– Target address adder – Register comparator
• Example: branch taken
36: sub $10, $4, $8 40: beq $1 $3 7 44: and $12, $2, $5 48: or $13, $2, $6 52: add $14 $4 $2 56: slt $15, $6, $7
Trang 88
dce
Example: Branch Taken
Trang 89dce
Example: Branch Taken
Trang 90dce
Data Hazards for Branches
• If a comparison register is a destination of
2nd or 3rd preceding ALU instruction
2 or 3 preceding ALU instruction
Trang 91dce
Data Hazards for Branches
• If a comparison register is a destination of preceding ALU instruction or 2nd preceding
preceding ALU instruction or 2 preceding load instruction
Need 1 stall cycle– Need 1 stall cycle
Trang 92dce
Data Hazards for Branches
• If a comparison register is a destination of immediately preceding load instruction
– Need 2 stall cycles
Trang 93dce
Dynamic Branch Prediction
• In deeper and superscalar pipelines, branch penalty is more significant
• Use dynamic prediction
– Branch prediction buffer (aka branch history table) p ( y ) – Indexed by recent branch instruction addresses – Stores outcome (taken/not taken)
– To execute a branch
• Check table, expect the same outcome
• Start fetching from fall-through or target
• If wrong, flush pipeline and flip prediction
Trang 94dce
1-Bit Predictor: Shortcoming
• Inner loop branches mispredicted twice!
outer: …
… inner: …
… beq …, …, inner
… beq outer beq …, …, outer
– Mispredict as taken on last iteration of inner loop
– Then mispredict as not taken on first
it ti f i l t ti diteration of inner loop next time around
Trang 96dce
Calculating the Branch Target
• Even with predictor, still need to calculate the target address
– 1-cycle penalty for a taken branch
Branch target b ffer
• Branch target buffer
– Cache of target addresses– Indexed by PC when instruction fetched
• If hit and instruction is branch predicted taken, can fetch target immediately
Trang 97dce
Exceptions and Interrupts
• “Unexpected” events requiring change
in flow of control
– Different ISAs use the terms differently
• Exceptionp
– Arises within the CPU
• e.g., undefined opcode, overflow, syscall, …
• Interrupt
– From an external I/O controller
• Dealing with them without sacrificing
performance is hard
Trang 98• Save PC of offending (or interrupted) instruction
– In MIPS: Exception Program Counter (EPC)
• Save indication of the problem
– In MIPS: Cause register We’ll assume 1 bit
– We ll assume 1-bit
• 0 for undefined opcode, 1 for overflow
• Jump to handler at 8000 00180
Trang 99• Instructions either
– Deal with the interrupt orDeal with the interrupt, or– Jump to real handler