Advanced Computer Architecture - Lecture 11: Computer hardware design. This lecture will cover the following: pipeline and instruction level parallelism; structural hazards; data hazards; control hazards; pipelining the R-type and load instruction; branch prediction; multiple streams;...
Trang 1CS 704
Advanced Computer Architecture
Lecture 11
Computer Hardware Design
(Pipeline and Instruction Level Parallelism)
Prof Dr M Ashraf Chughtai
Trang 3Key components of pipeline data path
Performance enhancement due to pipeline Introduction to hazards in pipelined
datapath
Trang 4Structural Hazards
Attempt to use the same resource two
different ways at the same time, e.g.,
Single memory port is accessed for instruction fetch and data read in the same clock cycle would be a structural hazard
… Example : next slide
Trang 5MAC/VU-Advanced
Computer Architecture Lecture 11 –Computer Hardware Design (5) 5
Single Memory is a Structural Hazard
Two memory read operations in the 4th cycle:
The LOAD instruction accesses memory to read data and the
4 th instruction fetched from the same memory
Trang 6Single Memory is a Structural Hazard
Insert stall (bubble) to avoid memory
Trang 7Structural hazard exists when
Single write port of register accessed for two
WB operations in same clock cycle –
this situation does not exist in 5-stage pipeline But it may exist in 4 and 5 stage multi-cycle pipeline
Explanation next………
Trang 8Pipelining the Load Instruction
The five independent functional units in the pipeline
datapath are: Inst Fetch, Dec/Reg Rd, ALU for Exec, Data Mem and Register File’s Write port for the Wr stage
Here, we have separate register’s read and write ports so registers read and write is allowed at the same time
Each functional unit is used once
Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch Reg/Dec Exec Mem Wr 1st lw
Ifetch Reg/Dec Exec Mem Wr 2nd lw
Ifetch Reg/Dec Exec Mem Wr 3rd lw
Trang 9MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
The Four Stages of R-type
R-type instruction does not access data memory,
so it only takes 4 clocks, or say 4 stages to
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Ifetch Reg/Dec Exec Wr Rtype
Trang 10Pipelining the R-type and Load
Instruction
We have pipeline conflict or structural hazard:
– Two instructions try to write to the register file at the
same time!
– Only one write port
Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Ifetch Reg/Dec Exec Wr Rtype
Ifetch Reg/Dec Exec Wr Rtype
Ifetch Reg/Dec Exec Mem Wr
Trang 11Each functional unit must be used at the same
stage for all instructions:
– Load uses Register File’s Write Port during its
Trang 12Solution 1: Insert “Bubble” into the Pipeline
Insert a “bubble” into the pipeline to prevent 2 writes at the
same cycle
– The control logic can be complex. The control logic can be complex.
– Lose instruction fetch and issue opportunity.
No instruction is started in Cycle 6!
Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Ifetch Reg/Dec Exec Wr Rtype
Ifetch Reg/Dec Exec
Ifetch Reg/Dec Exec Mem Wr Load
Ifetch Reg/Dec Exec Wr Rtype
Ifetch Reg/Dec Exec Wr Rtype Pipeline
Bubble
Ifetch Reg/Dec Exec Wr
Trang 13Delay R-type’s register write by one cycle:
– Now R-type instructions also use Reg File’s write port at Stage 5
– Mem stage is a NO-OP stage: nothing is being done.
Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Ifetch Reg/Dec Mem Wr Rtype
Ifetch Reg/Dec Mem Wr Rtype
Ifetch Reg/Dec Exec Mem Wr Load
Ifetch Reg/Dec Mem Wr Rtype
Ifetch Reg/Dec Mem Wr Rtype
Ifetch Reg/Dec Exec Wr Rtype Mem
Trang 14Eliminating Structural Hazards?
Structural hazards can be eliminated or
minimized by either using the stall operation
or adding multiple functional units
Program Flow
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
Trang 15Example: Dual-port vs
Single-port
Machine A: Dual ported memory
Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate
Ideal CPI = 1 for both
Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1)
x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
Machine A is 1.33 times faster
Trang 16Stall degrades the performance
Here, is an example:
Suppose data reference instructions constitute 40% of mix, and processor with structural hazard has clock rate 1.05
times higher than the processor without hazard
The Average Instruction time = CPI x Clock Cycle Time
= (1 + 0.4 x 1) x clock cycle time Ideal / 1.05
= 1.4 / 1.05 x clock cycle time Ideal
= 1.3 x clock cycle time Ideal The processor without structural hazard is 1.3 times faster than with Structural hazard
Trang 17MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5) 17
Additional Functional Units increase cost
Memory structural hazard is removed by
- using two Cache memory units:
- Instruction memory
- Data Memory Two write ports in register file allow 4-stage and 5-stage pipe mix
Trang 18
Data Hazards
Attempt to use item before it is ready; e.g.,
One sock of pair in dryer and one in
washer; can’t fold until get sock from
washer through dryer
Instruction depends on result of prior instruction still in the pipeline
Trang 19Pipelining changes the relative timing of
instruction by overlapping their execution
This overlap introduces the Data and Control
Hazard
Data Hazard occurs when order of operand
read/write is changed viz-z-viz sequential access
to the operands, which gives rise to data
dependency
Let us consider an example ……
Trang 20Example Data Hazard on R1
Trang 21MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5) 21
Data Hazard due to Dependencies backwards
in time are hazards
Or R8, R1 ,R9 Xor R10, R1 ,R11
F
E X
ME M
W B
Add instruction provide its results to sub after 3 cycles, to
and after 2 and to Or after 1 clock cycles
Trang 22stall cycles after next IF and decode, before the register
read Data Hazard Solution #1 - Stall
Time (clock cycles)
Stall
Trang 24“Forward” result from one stage to another
From the EX/MEM pipeline register to Sub ALU stage,
MEM/WB pipeline register to AND ALU stage
Data Hazard Solution - Forwarding
or r8, r1 ,r9 xor r10, r1 ,r11
Trang 25MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5) 25
Dependencies backwards in time are hazards
In this case, we Can’t solve with forwarding:
Must delay/stall instruction dependent on
loads
Forwarding (or Bypassing):
What about Loads?
Time (clock cycles)
lw r1 ,0(r2) sub r4, r1 ,r3
I F
ID/R F
Trang 26Branch Taken: Branch changes the PC+4 to new target
Branch Not Taken: Branch does not change the
PC+4
The simple way to deal with the branch is:
Trang 27The simple way to deal with the branch is:
freeze the pipeline holding any
instruction after the branch
instruction and
flush the pipeline to delete the
instructions after the branch if
condition is evaluated, and branch
is to take
Trang 28Control Hazard: Example BEQ Taken
Trang 29MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5) 29
Explanation Branch Hazard
Here, If the BEQ is taken then
but the next instruction (LOAD) is fetched in the ID
stage, i.e., before the PC+4 is changed to new target address
This gives rise to Branch or Control Hazard
See example on next slide ………
Trang 30Dealing with Branches
Trang 32Reducing number of Stall
Extra H/W to evaluate condition at the end of ID
stage of BEQ instruction
Trang 33MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5) 33
Reducing number of Stall
Here, you can see that if we move up decision to the end of ID stage (2nd stage)
by adding hardware to compare the registers being read The number of stalls reduces to 2 clock cycles per branch instruction
It can further be reduced to 1 in case of BEQZ or BNEZ if zero register is tested after Instruction Fetch
Trang 34Solution# 2 Redo Fetch after Branch
IFetch
We know that once a branch has been detected during the Instruction decode /Register read stage, the next instruction fetch cycle should essentially be a stall, if
we assume that branch is taken
Next slide please ………
Trang 35MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5) 35
However, the instruction fetched in this cycle
never performs useful work, and is ignored
Therefore, re-fetch the Branch successor
instruction will is provide the correct instruction.
Indeed, the second fetch is not essential branch is not taken
Impact: 1 clock cycles per branch instruction if branch is
un-taken
Solution# 2 Redo Fetch after Branch
Trang 36Solution# 3 Delayed Branch – S/W method
I n s t r.
O r d e r
Time (clock cycles)
Add Beq Misc
Reg Mem Reg
Redefine branch behavior to take place after the next instruction by introducing other instruction (may be No-OP) which is always executed
Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” (- 50% of time)
Trang 37The two possible predictions are:
- Predict Branch not-taken
- Predict branch taken
Trang 38Branch Prediction Flowchart
Trang 39MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5) 39
- This scheme is implemented assuming
every branch as branch Not-taken
- So the processor continues to fetch branch
as normal instructions
Sequence when branch is not-taken
1 Predict – Branch not taken
Trang 40- We the decision has been made, and the
branch is taken, then fetch operations are turned into NO-OP and fetch is restarted
at the target address
Sequence when branch is taken
Predict Branch not taken … Cont’d
Trang 41MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5) 41
An alternative way is to treat every branch as Branch taken
As soon as the target address is computed, we assume that the branch is to be taken and start fetching and executing
at the target
In a five stage pipeline the target address and condition
evaluation are available at the same time, so this technique
Trang 42Here, the branch is taken for 1000 time, so the
prediction “Branch Taken” fails 1 in 1000, hence
no stall for 1000 times
Further, the compiler can improve performance by organizing the code so that the most frequent path matches the hardware choice
2 Predict - Branch taken
Trang 43Solution #5 Multiple Streams
Have two pipelines
Pre-fetch each branch into a separate pipeline Use appropriate pipeline
Results
Leads to bus & register contention
Multiple branches lead to further pipelines
Trang 44Solution# 5 Pre-fetch Branch Target
Target of branch is pre-fetched in
addition to instructions following
branch
Keep target until branch is executed
Used by IBM 360/91
Trang 45Type of hazards in pipelined datapath
Structural hazards occur when same
resource is accessed by more than one
instructions
One memory port or one register write port
It can be removed by using either multiple resources or inserting stall
Stall degrades the pipeline performance
MAC/VU-Advanced
Lecture 11 –Computer Hardware
Design (5)
Trang 47Summary – 4 ways to handle control hazard
1: Stall until branch direction is clear
2: Predict Branch Not Taken
Execute successor instructions in sequence
“Squash” instructions in pipeline if branch actually taken PC+4 already calculated, so use it to get next instruction
3: Predict Branch Taken
4: Delayed Branch
Define branch to take place AFTER a following instruction
1 slot delay allows proper decision and branch target address in 5 stage pipeline
MAC/VU-Advanced
Lecture 11 –Computer Hardware
Design (5)
Trang 48and ALLAH Hafiz