Advanced Computer Architecture - Lecture 11: Computer hardware design

Advanced Computer Architecture - Lecture 11: Computer hardware design. This lecture will cover the following: pipeline and instruction level parallelism; structural hazards; data hazards; control hazards; pipelining the R-type and load instruction; branch prediction; multiple streams;...

Trang 1

CS 704

Advanced Computer Architecture

Lecture 11

Computer Hardware Design

(Pipeline and Instruction Level Parallelism)

Prof Dr M Ashraf Chughtai

Trang 3

Key components of pipeline data path

Performance enhancement due to pipeline Introduction to hazards in pipelined

datapath

Trang 4

Structural Hazards

Attempt to use the same resource two

different ways at the same time, e.g.,

Single memory port is accessed for instruction fetch and data read in the same clock cycle would be a structural hazard

… Example : next slide

Trang 5

MAC/VU-Advanced

Computer Architecture Lecture 11 –Computer Hardware Design (5) 5

Single Memory is a Structural Hazard

Two memory read operations in the 4th cycle:

The LOAD instruction accesses memory to read data and the

4 th instruction fetched from the same memory

Trang 6

Single Memory is a Structural Hazard

Insert stall (bubble) to avoid memory

Trang 7

Structural hazard exists when

Single write port of register accessed for two

WB operations in same clock cycle –

this situation does not exist in 5-stage pipeline But it may exist in 4 and 5 stage multi-cycle pipeline

Explanation next………

Trang 8

Pipelining the Load Instruction

The five independent functional units in the pipeline

datapath are: Inst Fetch, Dec/Reg Rd, ALU for Exec, Data Mem and Register File’s Write port for the Wr stage

Here, we have separate register’s read and write ports so registers read and write is allowed at the same time

Each functional unit is used once

Clock

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Ifetch Reg/Dec Exec Mem Wr 1st lw

Ifetch Reg/Dec Exec Mem Wr 2nd lw

Ifetch Reg/Dec Exec Mem Wr 3rd lw

Trang 9

MAC/VU-Advanced

Computer Architecture

Lecture 11 –Computer Hardware

The Four Stages of R-type

R-type instruction does not access data memory,

so it only takes 4 clocks, or say 4 stages to

Cycle 1 Cycle 2 Cycle 3 Cycle 4

Ifetch Reg/Dec Exec Wr Rtype

Trang 10

Pipelining the R-type and Load

Instruction

We have pipeline conflict or structural hazard:

– Two instructions try to write to the register file at the

same time!

– Only one write port

Clock

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9

Ifetch Reg/Dec Exec Mem Wr

Trang 11

Each functional unit must be used at the same

stage for all instructions:

– Load uses Register File’s Write Port during its

Trang 12

Solution 1: Insert “Bubble” into the Pipeline

Insert a “bubble” into the pipeline to prevent 2 writes at the

same cycle

– The control logic can be complex. The control logic can be complex.

– Lose instruction fetch and issue opportunity.

No instruction is started in Cycle 6!

Clock

Ifetch Reg/Dec Exec

Ifetch Reg/Dec Exec Mem Wr Load

Ifetch Reg/Dec Exec Wr Rtype Pipeline

Bubble

Ifetch Reg/Dec Exec Wr

Trang 13

Delay R-type’s register write by one cycle:

– Now R-type instructions also use Reg File’s write port at Stage 5

– Mem stage is a NO-OP stage: nothing is being done.

Clock

Ifetch Reg/Dec Mem Wr Rtype

Ifetch Reg/Dec Exec Mem Wr Load

Ifetch Reg/Dec Mem Wr Rtype

Ifetch Reg/Dec Exec Wr Rtype Mem

Trang 14

Eliminating Structural Hazards?

Structural hazards can be eliminated or

minimized by either using the stall operation

or adding multiple functional units

Program Flow

IFetch Dcd Exec Mem WB

Trang 15

Example: Dual-port vs

Single-port

Machine A: Dual ported memory

Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate

Ideal CPI = 1 for both

Loads are 40% of instructions executed

SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)

= Pipeline Depth

SpeedUpB = Pipeline Depth/(1 + 0.4 x 1)

x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth

SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33

Machine A is 1.33 times faster

Trang 16

Stall degrades the performance

Here, is an example:

Suppose data reference instructions constitute 40% of mix, and processor with structural hazard has clock rate 1.05

times higher than the processor without hazard

The Average Instruction time = CPI x Clock Cycle Time

= (1 + 0.4 x 1) x clock cycle time Ideal / 1.05

= 1.4 / 1.05 x clock cycle time Ideal

= 1.3 x clock cycle time Ideal The processor without structural hazard is 1.3 times faster than with Structural hazard

Trang 17

MAC/VU-Advanced

Design (5) 17

Additional Functional Units increase cost

Memory structural hazard is removed by

- using two Cache memory units:

- Instruction memory

- Data Memory Two write ports in register file allow 4-stage and 5-stage pipe mix

Trang 18

Data Hazards

Attempt to use item before it is ready; e.g.,

One sock of pair in dryer and one in

washer; can’t fold until get sock from

washer through dryer

Instruction depends on result of prior instruction still in the pipeline

Trang 19

Pipelining changes the relative timing of

instruction by overlapping their execution

This overlap introduces the Data and Control

Hazard

Data Hazard occurs when order of operand

read/write is changed viz-z-viz sequential access

to the operands, which gives rise to data

dependency

Let us consider an example ……

Trang 20

Example Data Hazard on R1

Trang 21

MAC/VU-Advanced

Design (5) 21

Data Hazard due to Dependencies backwards

in time are hazards

Or R8, R1 ,R9 Xor R10, R1 ,R11

F

E X

ME M

W B

Add instruction provide its results to sub after 3 cycles, to

and after 2 and to Or after 1 clock cycles

Trang 22

stall cycles after next IF and decode, before the register

read Data Hazard Solution #1 - Stall

Time (clock cycles)

Stall

Trang 24

“Forward” result from one stage to another

From the EX/MEM pipeline register to Sub ALU stage,

MEM/WB pipeline register to AND ALU stage

Data Hazard Solution - Forwarding

or r8, r1 ,r9 xor r10, r1 ,r11

Trang 25

MAC/VU-Advanced

Design (5) 25

Dependencies backwards in time are hazards

In this case, we Can’t solve with forwarding:

Must delay/stall instruction dependent on

loads

Forwarding (or Bypassing):

What about Loads?

lw r1 ,0(r2) sub r4, r1 ,r3

I F

ID/R F

Trang 26

Branch Taken: Branch changes the PC+4 to new target

Branch Not Taken: Branch does not change the

PC+4

The simple way to deal with the branch is:

Trang 27

The simple way to deal with the branch is:

freeze the pipeline holding any

instruction after the branch

instruction and

flush the pipeline to delete the

instructions after the branch if

condition is evaluated, and branch

is to take

Trang 28

Control Hazard: Example BEQ Taken

Trang 29

MAC/VU-Advanced

Design (5) 29

Explanation Branch Hazard

Here, If the BEQ is taken then

but the next instruction (LOAD) is fetched in the ID

stage, i.e., before the PC+4 is changed to new target address

This gives rise to Branch or Control Hazard

See example on next slide ………

Trang 30

Dealing with Branches

Trang 32

Reducing number of Stall

Extra H/W to evaluate condition at the end of ID

stage of BEQ instruction

Trang 33

MAC/VU-Advanced

Design (5) 33

Reducing number of Stall

Here, you can see that if we move up decision to the end of ID stage (2nd stage)

by adding hardware to compare the registers being read The number of stalls reduces to 2 clock cycles per branch instruction

It can further be reduced to 1 in case of BEQZ or BNEZ if zero register is tested after Instruction Fetch

Trang 34

Solution# 2 Redo Fetch after Branch

IFetch

We know that once a branch has been detected during the Instruction decode /Register read stage, the next instruction fetch cycle should essentially be a stall, if

we assume that branch is taken

Next slide please ………

Trang 35

MAC/VU-Advanced

Design (5) 35

However, the instruction fetched in this cycle

never performs useful work, and is ignored

Therefore, re-fetch the Branch successor

instruction will is provide the correct instruction.

Indeed, the second fetch is not essential branch is not taken

Impact: 1 clock cycles per branch instruction if branch is

un-taken

Solution# 2 Redo Fetch after Branch

Trang 36

Solution# 3 Delayed Branch – S/W method

I n s t r.

O r d e r

Add Beq Misc

Reg Mem Reg

Redefine branch behavior to take place after the next instruction by introducing other instruction (may be No-OP) which is always executed

Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” (- 50% of time)

Trang 37

The two possible predictions are:

- Predict Branch not-taken

- Predict branch taken

Trang 38

Branch Prediction Flowchart

Trang 39

MAC/VU-Advanced

Design (5) 39

- This scheme is implemented assuming

every branch as branch Not-taken

- So the processor continues to fetch branch

as normal instructions

Sequence when branch is not-taken

1 Predict – Branch not taken

Trang 40

- We the decision has been made, and the

branch is taken, then fetch operations are turned into NO-OP and fetch is restarted

at the target address

Sequence when branch is taken

Predict Branch not taken … Cont’d

Trang 41

MAC/VU-Advanced

Design (5) 41

An alternative way is to treat every branch as Branch taken

As soon as the target address is computed, we assume that the branch is to be taken and start fetching and executing

at the target

In a five stage pipeline the target address and condition

evaluation are available at the same time, so this technique

Trang 42

Here, the branch is taken for 1000 time, so the

prediction “Branch Taken” fails 1 in 1000, hence

no stall for 1000 times

Further, the compiler can improve performance by organizing the code so that the most frequent path matches the hardware choice

2 Predict - Branch taken

Trang 43

Solution #5 Multiple Streams

Have two pipelines

Pre-fetch each branch into a separate pipeline Use appropriate pipeline

Results

Leads to bus & register contention

Multiple branches lead to further pipelines

Trang 44

Solution# 5 Pre-fetch Branch Target

Target of branch is pre-fetched in

addition to instructions following

branch

Keep target until branch is executed

Used by IBM 360/91

Trang 45

Type of hazards in pipelined datapath

Structural hazards occur when same

resource is accessed by more than one

instructions

One memory port or one register write port

It can be removed by using either multiple resources or inserting stall

Stall degrades the pipeline performance

MAC/VU-Advanced

Design (5)

Trang 47

Summary – 4 ways to handle control hazard

1: Stall until branch direction is clear

2: Predict Branch Not Taken

Execute successor instructions in sequence

“Squash” instructions in pipeline if branch actually taken PC+4 already calculated, so use it to get next instruction

3: Predict Branch Taken

4: Delayed Branch

Define branch to take place AFTER a following instruction

1 slot delay allows proper decision and branch target address in 5 stage pipeline

MAC/VU-Advanced

Design (5)

Trang 48

and ALLAH Hafiz

Tiêu đề	computer hardware design
Người hướng dẫn	Prof. Dr. M. Ashraf Chughtai
Trường học	mac/vu
Chuyên ngành	advanced computer architecture
Thể loại	lecture

Định dạng
Số trang	48
Dung lượng	1,46 MB