2013 Register File consists of 32 × 32-bit registers Two registers read and one written in a cycle Registers are selected by: RA selects register to be read on BusA RB select
Trang 1Vo Tan Phuong
Trang 32013
Designing a Processor: Step-by-Step
Datapath Components and Clocking
Assembling an Adequate Datapath
Controlling the Execution of Instructions
The Main Controller and ALU Controller
Drawback of the single-cycle processor design
Trang 42013
Recall, performance is determined by:
Instruction count
Clock cycles per instruction (CPI)
Clock cycle time
Processor design will affect
Clock cycles per instruction
Clock cycle time
Single cycle datapath and control design:
Advantage: One clock cycle per instruction
Disadvantage: long cycle time
I-Count
Trang 52013
Analyze instruction set => datapath requirements
The meaning of each instruction is given by the register transfers
Datapath must include storage elements for ISA registers
Datapath must support each register transfer
Select datapath components and clocking methodology
Assemble datapath meeting the requirements
Analyze implementation of each instruction
Determine the setting of control signals for register transfer
Assemble the control logic
Trang 62013
All instructions are 32-bit wide
Three instruction formats: R-type, I-type, and J-type
Op 6 : 6-bit opcode of the instruction
Rs 5 , Rt 5 , Rd 5 : 5-bit source and destination register numbers
sa 5 : 5-bit shift amount used by shift instructions
funct 6 : 6-bit function field for R-type instructions
immediate 16 : 16-bit immediate value or address offset
immediate 26 : 26-bit target address of the jump instruction
Op 6 Rs 5 Rt 5 Rd 5 sa 5 funct 6
Op 6 Rs 5 Rt 5 immediate 16
Op 6 immediate 26
Trang 72013
Only a subset of the MIPS instructions are considered
ALU instructions (R-type): add, sub, and, or, xor, slt
Immediate instructions (I-type): addi, slti, andi, ori, xori
Load and Store (I-type): lw, sw
Branch (I-type): beq, bne
Jump (J-type): j
This subset does not include all the integer instructions
But sufficient to illustrate design of datapath and control
Concepts used to implement the MIPS subset are used
to construct a broad spectrum of computers
Trang 82013
slt rd, rs, rt set on less than op 6 = 0 rs 5 rt 5 rd 5 0 0x2a
addi rt, rs, im 16 add immediate 0x08 rs 5 rt 5 im 16
slti rt, rs, im 16 slt immediate 0x0a rs 5 rt 5 im 16
andi rt, rs, im 16 and immediate 0x0c rs 5 rt 5 im 16
ori rt, rs, im 16 or immediate 0x0d rs 5 rt 5 im 16
xori rt, im 16 xor immediate 0x0e rs 5 rt 5 im 16
lw rt, im 16 (rs) load word 0x23 rs 5 rt 5 im 16
sw rt, im 16 (rs) store word 0x2b rs 5 rt 5 im 16
beq rs, rt, im 16 branch if equal 0x04 rs 5 rt 5 im 16
bne rs, rt, im 16 branch not equal 0x05 rs 5 rt 5 im 16
Trang 92013
RTL is a description of data flow between registers
RTL gives a meaning to the instructions
All instructions are fetched from memory at address PC
Instruction RTL Description
ORI Reg(Rt) ← Reg(Rs) | zero_ext(Im16); PC ← PC + 4
LW Reg(Rt) ← MEM[Reg(Rs) + sign_ext(Im16)]; PC ← PC + 4
SW MEM[Reg(Rs) + sign_ext(Im16)] ← Reg(Rt); PC ← PC + 4
BEQ if (Reg(Rs) == Reg(Rt))
PC ← PC + 4 + 4 × sign_extend(Im16) else PC ← PC + 4
Trang 102013
Fetch operands: data1 ← Reg(Rs), data2 ← Reg(Rt)
Execute operation: ALU_result ← func(data1, data2)
Write ALU result: Reg(Rd) ← ALU_result
Next PC address: PC ← PC + 4
Fetch operands: data1 ← Reg(Rs), data2 ← Extend(imm16)
Execute operation: ALU_result ← op(data1, data2)
Write ALU result: Reg(Rt) ← ALU_result
Next PC address: PC ← PC + 4
Fetch operands: data1 ← Reg(Rs), data2 ← Reg(Rt)
Equality: zero ← subtract(data1, data2)
Branch: if (zero) PC ← PC + 4 + 4×sign_ext(imm16)
else PC ← PC + 4
Trang 112013
LW Fetch instruction: Instruction ← MEM[PC]
Fetch base register: base ← Reg(Rs)
Calculate address: address ← base + sign_extend(imm16)
Read memory: data ← MEM[address]
Write register Rt: Reg(Rt) ← data
Next PC address: PC ← PC + 4
Fetch registers: base ← Reg(Rs), data ← Reg(Rt)
Calculate address: address ← base + sign_extend(imm16)
Write memory: MEM[address] ← data
Next PC address: PC ← PC + 4
Target PC address: target ← PC[31:28] || Imm26 || ‘00’
concatenation
Trang 122013
Memory
Registers
Read source register Rs
Read source register Rt
Write destination register Rt or Rd
Program counter PC register and Adder to increment PC
Sign and Zero extender for immediate constant
ALU for executing instructions
Trang 132013
Designing a Processor: Step-by-Step
Datapath Components and Clocking
Assembling an Adequate Datapath
Controlling the Execution of Instructions
The Main Controller and ALU Controller
Drawback of the single-cycle processor design
Trang 14Instruction Memory
Address Data_in Data_out
Mem Read
Mem Write
32
32
32
clk
Trang 152013
Register
Similar to the D-type Flip-Flop
n-bit input and output
Write Enable (WE):
Enable / disable writing of register
Negated (0): Data_Out will not change
Asserted (1): Data_Out will become Data_In after clock edge
Edge triggered Clocking
Register output is modified at clock edge
Register
Data_In
Clock
Write Enable
n bits
Data_Out
n bits
WE
Trang 162013
Register File consists of 32 × 32-bit registers
Two registers read and one written in a cycle
Registers are selected by:
RA selects register to be read on BusA
RB selects register to be read on BusB
RW selects the register to be written
Clock input
The clock input is used ONLY during write operation
During read, register file behaves as a combinational logic block
RA or RB valid => BusA or BusB valid after access time
RW RA RB
Register File
Trang 17
R0 is not used
WE
WE
WE
Trang 182013
Allow multiple sources to drive a single bus
Two Inputs:
Data_in
One Output: Data_out
If ( Enable ) Data_out = Data_in else Data_out = High Impedance state (output is disconnected)
Tri-state buffers can be
used to build multiplexors
Trang 19ALU Selection
SLT: ALU does a SUB and check the sign and overflow
Trang 202013
Instruction memory needs only provide read access
Because datapath does not write instructions
Behaves as combinational logic for read
Address selects Instruction after access time
Data Memory is used for load and store
The Clock synchronizes the write operation
Separate instruction and data memories
Later, we will replace them with caches
MemWrite MemRead
Data Memory
Address Data_in
32
Trang 212013
Clocks are needed in a sequential
logic to decide when a state element
(register) should be updated
To ensure correctness, a clocking
methodology defines when data can
be written and read
Data must be valid
and stable before arrival of clock edge
Edge-triggered clocking allows a register to be read and written during same clock cycle
Trang 222013
With edge-triggered clocking, the clock cycle must be
long enough to accommodate the path from one register through the combinational logic to another register
through combinational logic
Ts : setup time that input to a register must be stable
before arrival of clock edge
Th: hold time that input to a register must hold after arrival of clock edge
Hold time (Th) is normally satisfied since Tclk-q > Th
Trang 232013
Clock skew arises because the clock signal uses different paths with slightly different delays to reach state elements
Clock skew is the difference in absolute time between
when two storage elements see a clock edge
With a clock skew, the clock cycle time is increased
Clock skew is reduced by balancing the clock delays
Tcycle ≥ Tclk-q + Tmax_combinational + Tsetup+ Tskew
Trang 242013
Designing a Processor: Step-by-Step
Datapath Components and Clocking
Assembling an Adequate Datapath
Controlling the Execution of Instructions
The Main Controller and ALU Controller
Drawback of the single-cycle processor design
Trang 252013
We can now assemble the datapath from its components
For instruction fetching, we need …
Program Counter (PC) register
Instruction Memory
Adder for incrementing PC
The least significant 2 bits
of the PC are ‘00’ since
PC is a multiple of 4
Datapath does not handle branch or jump instructions
32
Address Instruction
Instruction Memory
30 bits of PC by 1
32
Address Instruction
Instruction Memory
next PC
clk
Trang 262013
Control signals
Op 6 Rs 5 Rt 5 Rd 5 sa 5 funct 6
ALUCtrl RegWrite
BusA & BusB provide data input to ALU
ALU result is connected to BusW
32
Address Instruction
Instruction Memory
Trang 272013
Control signals
ALUCtrl is derived from the Op field
Op 6 Rs 5 Rt 5 immediate 16
ALUCtrl RegWrite
32
Address Instruction
Instruction Memory
PC and Rt
Rt selects register
to write, not Rd clk
Trang 282013
Control signals
ALUCtrl is derived from either the Op or the funct field
RegDst selects the register destination as either Rt or Rd
A mux selects RW
as either Rt or Rd
Another mux selects 2 nd ALU input as either data on BusB or the extended immediate
ALUCtrl RegWrite
Instruction Memory
Trang 292013
For R-type ALU instructions, RegDst is
‘1’ to select Rd on RW
select BusB as second ALU input The active part of datapath is shown in green
For I-type ALU instructions, RegDst is
‘0’ to select Rt on RW
select Extended immediate as second ALU input The active part of datapath is shown in green
Instruction Memory
Instruction Memory
Trang 302013
Two types of extensions
Zero-extension for unsigned constants
Sign-extension for signed constants
Control signal ExtOp indicates type of extension
Extender Implementation: wiring and one AND gate
Imm16
Trang 312013
dce
Additional Control signals
BusB is connected to Data_in of Data Memory for store instructions
Adding Data Memory to Datapath
A data memory is added for load and store instructions
A 3 rd mux selects data on BusW as either ALU result or memory data_out
Data Memory
Address Data_in Data_out
Instruction Memory
Trang 32Address Data_in Data_out
Instruction Memory
RegDst = ‘0’ selects Rt
as destination register
RegWrite = ‘1’ to enable writing of register file
MemtoReg = ‘1’ places the data read from memory on BusW
ExtOp = 1 to sign-extend Immmediate16 to 32 bits
Clock edge updates PC and Register Rt
Trang 33Address Data_in Data_out
Instruction Memory
RegDst = ‘X’ because
no register is written
RegWrite = ‘0’ to disable writing of register file
MemtoReg = ‘X’ because don’t care what data is put on BusW
ExtOp = 1 to sign-extend Immmediate16 to 32 bits
Clock edge updates PC and Data Memory
Trang 342013
Additional Control Signals
Next
PC
Next PC logic
computes jump or branch target instruction address
zero PCSrc
Bne Beq
J
ALUCtrl
Reg Write
ExtOp
RegDst
ALUSrc
Data Memory
Address Data_in Data_out
Instruction Memory
Mem Write
Mem toReg
Trang 352013
Imm16 is sign-extended to 30 bits
Jump target address: upper 4 bits of PC are concatenated with Imm26
PCSrc = J + (Beq Zero) + (Bne Zero)
26
Beq Bne
J Zero
Trang 36Address Data_in Data_out
Instruction Memory
= 0
Mem Write
= 0
Mem toReg
Trang 37Address Data_in Data_out
Instruction Memory
= 0
Mem Write
= 0
Mem toReg
RegWrite, MemRead, and MemWrite are 0
Either Beq = 1 or Bne
depending on opcode
Clock edge updates PC register only
ALUSrc = 0 to select
value on BusB ALUCtrl = SUB to generate Zero Flag
Next PC outputs branch target address
PCSrc = 1 if branch is taken
ALUSrc
= 0
Trang 382013
Designing a Processor: Step-by-Step
Datapath Components and Clocking
Assembling an Adequate Datapath
Controlling the Execution of Instructions
The Main Controller and ALU Controller
Drawback of the single-cycle processor design
Trang 392013
Main Control Input:
6-bit opcode field from instruction
Main Control Output:
Datapath
32
Address Instruction
Instruction Memory
A
L
U
ALU Control Input:
6-bit opcode field from instruction
6-bit function field from instruction
ALU Control Output:
ALU Control
Main Control
Trang 40Address Data_in Data_out
Instruction Memory
J, Beq, Bne
MemtoReg
MemRead MemWrite
ExtOp
Main Control
Op ALU
Ctrl
ALUop func
clk
Trang 412013
dce
the data value on BusW
second register file output (BusB)
Second ALU operand comes from the extended 16-bit immediate
Data_out ← Memory[address]
Memory[address] ← Data_in
If branch is taken
J PC ← PC + 4 PC ← Jump target address
Main Control Signals
Trang 422013
X is a don’t care (can be 0 or 1), used to minimize logic
Op Reg
Dst
Reg Write
Ext
Op
ALU Src Beq Bne J
Mem Read
Mem Write
Mem toReg
ori 0 = Rt 1 0=zero 1=Imm 0 0 0 0 0 0
Trang 432013
RegDst = R-type
RegWrite = (sw + beq + bne + j)
ExtOp = (andi + ori + xori)
ALUSrc = (R-type + beq + bne)
Trang 442013
Input Output 4-bit
Encoding
Op 6 funct 6 ALUCtrl
The 4-bit ALUCtrl is encoded according to the ALU implementation
Trang 452013
Designing a Processor: Step-by-Step
Datapath Components and Clocking
Assembling an Adequate Datapath
Controlling the Execution of Instructions
The Main Controller and ALU Controller
Drawback of the single-cycle processor design
Trang 462013
Long cycle time
All instructions take as much time as the slowest instruction
longest delay
Instruction Fetch
ALU Reg Read Decode ALU Reg
Write
Load Instruction
Fetch
Decode Reg Read
Compute Address
Reg Write Memory Read
Store Instruction
Fetch
Decode Reg Read
Compute Address Memory Write
Trang 472013
New PC Old PC
Data Memory Access Time Old Data Memory Output Value Data from DM
Occurs Clk
Clock Cycle
Trang 482013
Long cycle time: long enough for Slowest instruction
+ Data Memory Access Time
+ Delay through MemtoReg Mux
+ Setup Time for Register File Write + Clock Skew
Cycle time is longer than needed for other instructions
Therefore, single cycle processor design is not used in practice
Trang 492013
Break instruction execution into five steps
Instruction fetch
Instruction decode, register read, target address for jump/branch
Execution, memory address calculation, or branch outcome
Memory access or ALU instruction completion
Load instruction completion
One clock cycle per step (clock cycle is reduced)
First 2 steps are the same for all instructions
ALU & Store 4 Branch 3
Trang 502013
Assume the following operation times for components:
Instruction and data memories: 200 ps
ALU and adders: 180 ps
Decode and Register file access (read or write): 150 ps
Ignore the delays in PC, mux, extender, and wires
Which of the following would be faster and by how much?
Single-cycle implementation for all instructions
Multicycle implementation optimized for every class of instructions
Assume the following instruction mix:
40% ALU, 20% Loads, 10% stores, 20% branches, & 10% jumps
Trang 51Register Read
ALU Operation
Data Memory
Register Write Total
ALU 200 150 180 150 680 ps Load 200 150 180 200 150 880 ps Store 200 150 180 200 730 ps Branch 200 150 180 530 ps
880 ps determined by longest delay (load instruction)
880 ps / (3.8 × 200 ps) = 880 / 760 = 1.16
Compare and write PC