Advanced Computer Architecture - Lecture 45: Putting it all together. This lecture will cover the following: introduction and quantitative principles; instruction set architecture; computer hardware design; instruction level parallelism – dynamic; instruction level parallelism – static; memory hierarchy system;...
Trang 2Today’s Topics
Module 1: Introduction
Module 2: Instruction Set Architecture
Module 3: Computer hardware design
Module 4: Instruction Level Parallelism – Dynamic
Module 5: Instruction Level Parallelism – Static
Module 6: Memory Hierarchy system
Module 7: Multiprocessing
Module 8: I/O Systems
Module 9: Networks and Clusters
MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 2
Trang 3Module 1:
Introduction and Quantitative Principles
We started this course distinguishing the
computer organization and computer
architecture
Architecture refers to those attributes of a
computer visible to the programmer or
compiler writer; e.g.,
instruction set, memory addressing techniques, I/O mechanisms etc
MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 3
Trang 4The architecture of the members of a
processor family are same whereas
organization of same architecture may differ between different members of the family
MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 4
Trang 5Module 1: Introduction …
Computer Development
We also introduced the computers
developments with academic and commercial perspectives
Academically, modern computer
developments have their infancy in 1944-49,
when John von Neumann introduced the
concept of stored-program computer, referred
to as Electronic Discrete Variable Automatic Computer – EDVAC
MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 5
Trang 6Module 1: Introduction …
Computer Development
Commercially , the first machine was built by
Eckert-Mauchly Computer Corporation in 1949
In 1971, Intel introduced first cheep
microprocessor 4004 and then 80 x 86 series
In 1998, more than 350 million microprocessors with different instruction set architectures were
in use; this number has risen to more than a
billion in 2006
MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 6
Trang 7Module 1: Introduction …
Computer Generations
Technological developments , from vacuum
tubes to VLSI circuits, dynamic memory and
network technology gave birth to four different generations of computers
This course has viewed the Computer
Architecture from four perspectives
Processor Design
Memory Hierarchy
Input/output and storages
Multiprocessor and Network interconnection
MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 7
Trang 8Module 1: Quantitative Principles
The key to the quantitative analysis in
determining the effectiveness of the entire
computing system is the computer hardware and software performance
In this respect , we discussed:
Price-performance design
CPU performance metrics
CPU benchmark suites
MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 8
Trang 9MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 9
Module 1: Price-Performance Design
The issue of cost-performance is complex one
At one extreme, high-performance computers designer may not give importance to the cost
in achieving the performance goal
At the other end, low-cost designer may
sacrifice performance to some extent
The price-performance design lies between
these extremes where the designer balances cost and hence price verses performance
Trang 10MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 10
Module1: CPU Benchmark Suites
In order to compare the performance of two
machines, a user can simply compare the
execution time of the same workload running
on both the machines
In practice users want to know, without
running their own programs, that how well the machine will perform on their workload
This is accomplished by evaluating the
machine using a set of benchmarks – the
programs specifically chosen to measure the performance
Trang 11MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 11
Module1: CPU Benchmark Suites
Five levels of programs are used as benchmarks:
1 Real Applications – scientific programs evaluate
the performance of a machine
2 Modified Applications – the real applications with
certain blocks modified to focus desired aspects
of application,
3 Kernels – the small key pieces extracted from the
real program
4 Toy benchmarks – small codes normally used as
beginning programming assignments.
5 Synthetic benchmarks – the small section of
Artificially created program
Trang 12MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 12
Module1: Quantitative Principles of Quantitative
Performance Measurement
Quantitatively the performance of a system can
be enhanced by speedup of a fraction of
system based on the concept of the common case first
Amdahl’s Law is the basis of the measure of
the performance enhancement
which defines the Speedup due to enhancement
E that that accelerates a fraction F of the task as:
Trang 13MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 13
Module 1: Amdahl's Law
Ex Time without Enhancement
Trang 14Module 2:
Instruction Set Architecture
The three pillars of computer architecture are:
hardware,
instruction set
software
Hardware facilitates to run the software and
instruction set is the interface between the
hardware and software
– While talking about the Instruction set
architecture the focus of our discussion has been:
MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 14
Trang 16MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 16
Module 2: Taxonomy of Instruction Set
The taxonomy of Instruction set was defined as:
– Stack Architecture:
– Accumulator Architecture
– General Purpose Register Architecture
Register – memory
Register – Register (load/store)
Memory – Memory Architecture (Obsolete)
Trang 17MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 17
Module 2: Types of Operands and
Arithmetic, data transfer, control
and support operations
Trang 18MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 18
Module 2: Types of Operands addressing
modes
Operand Addressing Modes
Immediate, register, direct Immediate, register, direct (absolute) and
Indirect
Classification of Indirect Addressing
Register, indexed, relative
Register, indexed, relative (i.e with
displacement) and memory
Special Addressing Modes
Auto-increment, auto-decrement and scaled
Control Instruction Addressing modes
Branch, jump and procedure call/return
Trang 19MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 19
Module 3: Computer Hardware design
Basic building blocks of a computer
Sub-systems of CPU: Datapath and Control Processor design steps
Processor design parameters
Hardware design process
Timing signals
Uni-bus, 2-bus and 3-bus structures
3-bus based single cycles data path
Trang 20Data Path
CONTROL
MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 20
Sub-systems of Central Processing Unit
– Datapath:
the path that facilitates the
transfer of information from
one part (register/memory/ IO)
to the other part of the system
- Control:
the hardware that generates
signals to control the
sequence of steps and direct
the flow of information
through the datapath
At a “higher level” a CPU can be viewed as consisting of two sub-systems
Trang 21Module 3: Datapath Implementations
The datapath is the arithmetic organ of the Von- Neumann’s stored-program
Based on the concepts of single cycle,
multiple cycle and pipelined architecture
MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 21
Trang 22Module 3: Datapath Implementation
It consists of registers, internal buses,
arithmetic units and shifters
Each register in the register file has:
- a load control line that enables data load to
Trang 23Lecture 45 Putting It All together(2) 23
Module 3: Single/Multiple Cycle Approach
In the Single Cycle implementation, the cycle time is set to accommodate the longest
instruction, the Load instruction.
In the Multiple Cycles implementation, the
cycle time is set to accomplish longest step, the memory read/write
Consequently, the cycle time for the Single
Cycle implementation can be five times longer
than the multiple cycle implementation.
MAC/VU-Advanced
Computer Architecture
Trang 24MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 24
Module 3: Pipelined Datapath
Pipelining is a fundamental concept
Where an instructions is completed in multiple steps using distinct resources
It utilizes capabilities of the Datapath by
Starting next instruction while working on the current one
The pipelined datapath may encounter three
types of hazards
Structural, Data and Control
Trang 25Module 3: Pipeline Hazards
Structural hazards occur when same
resource is accessed by more than one
instructions; e.g.,
One memory port or one register write port
It can be removed by using either multiple resources or inserting stall
Stall degrades the pipeline performance
MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 25
Trang 26Module 3: Pipeline Hazards
Data Hazards occur when attempt is made to read invalid data
Data hazard can be removed by using stall and forwarding techniques
Control hazards occur when an attempt is
made to branch prior to the evaluation of the condition
Four ways to handle control hazards
MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 26
Trang 27Module 3: 4 ways to handle control hazard
1: Stall until branch direction is clear
2: Predict Branch Not Taken
Execute successor instructions in sequence
3: Predict Branch Taken
Trang 28MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 28
Module 4: Instruction Level parallelism
Simple pipeline facilitates in-order execution
Whereas, in order to enhance the performance
of the pipeline, we want to begin execution as soon as the data operands are available, i.e , out-of-order execution
Out-of-order execution may introduce data
hazards of type WAR and WAW
Instruction Level Parallelism can be achieved
by Hardware or Software
Trang 29MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 29
Module 4: Instruction Level Parallelism
In SW parallelism, the dependencies are
defined by program result in hazards if HW
cannot resolve
HW exploiting ILP works when dependence
cannot be determined at run time
These hardware techniques to exploit ILP are referred to as Dynamic Scheduling techniques
Trang 30MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 30
Module 4: Instruction Level Parallelism
Dynamic scheduling is accomplished by
dividing the ID stage into two parts
Issue the instruction in-order Read operand out-of-order
Structural and data dependencies are checked
at ID stage
It facilitates out-of-order execution which
results in out-of-order completion
Trang 31Lecture 45 Putting It All together(2)
Recap: ILP- Dynamic Scheduling
31
We discussed the score-boarding and
Tomasulo’s algorithm as the basic
concepts for dynamic scheduling in
integer and floating-point datapath
The structures implementing these
concepts facilitate out-of-order execution
to minimize data dependencies thus avoid data hazards without stalls
MAC/VU-Advanced
Computer Architecture
Trang 32MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 32
high Performance without special compilers
Here, the control and buffers are distributed
with Function Units (FU)
or pointers to reservation stations(RS) ; i.e., the registers are renamed
Unlike Scoreboard, Tomasulo can have multiple loads outstanding
Trang 33Lecture 45 Putting It All together(2)
33
We also discussed branch-prediction
techniques and different types of
branch-predictors, used to reduce the number of
stalls due to control hazards
The concept of multiple instructions issue
was discussed in details
This concept is used to reduce the CPI to less that one, thus, the performance of the
processor is enhanced
MAC/VU-Advanced
Computer Architecture
Trang 34Lecture 45 Putting It All together(2)
34
We studied extensions to the Tomasulo’s structure by including hardware-based
speculation
It allows to speculate that branch is
correctly predicted, thus may execute of-order
out-but commit in-order having confirmed that the speculation is correct and no
exceptions exist
MAC/VU-Advanced
Computer Architecture
Trang 35Module 4: Instruction Level Parallelism
The major hardware-based techniques studied are summarized here:
Technique Hazards type stalls Reduced
Data Hazard Stalls
bypass
Control Hazard Stalls
and Branch Scheduling
Hazard Stalls from
Scheduling (score boarding)
Trang 36Dynamic Scheduling Techniques Cont’d
Technique Hazards type stalls Reduced
Stalls from: data hazards
with renaming from dependences and
anti-(Tomasulo’s Approach) from output dependences
Trang 37Module 5: Static Approach for ILP
The multiple-instruction-issues per cycle processors are rated as the high-
performance processors
These processors exist in a variety of
flavors, such as:
Trang 38Module 5: Static Approach for ILP
The superscalar processors exploit ILP
using static as well as dynamic scheduling approaches
The VLIW processors, on the other hand, exploits ILP using static scheduling only
The major software scheduling techniques, under discussion, to reduce the data and control stalls, are as follows:
MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 38
Trang 39Introduction to Static Scheduling in ILP
Technique Hazards type stalls Reduced
- Basic Compiler Data
Trang 40Module 6: Memory Hierarchy System
Here, we discussed how the gap between the speed of processor and the storage devices - DRAM, SRAM and Disk is increasing with time
We studied that in order to obtain high speed storage at the cheapest cost per byte, different types of memory modules are organize in
hierarchy, based on the:
Concept of Caching and
Principle of Locality
MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 40
Trang 41Module 6: Memory Hierarchy System
The principle of locality states that to obtain data or instructions of a program, the
processor access, at any instant of time,
a relatively small portion of the address space
of the fastest memory closet to the processor There are two different types of locality:
Temporal locality is the locality in time
Spatial locality is the locality in space
MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 41
Trang 42Module 6: Memory Hierarchy System
Concept of caching states that a small, fastest and most expensive storage be used as the
staging area or temporary-place to:
– store frequently-used subset of the data or
instructions from the relatively cheaper, larger and slower memory; and
– To avoid having to go to the main memory
every time this information is needed
The performance of cache is limited by different types of penalties
MAC/VU-Advanced
Computer Architecture Lecture 27 Memory Hierarchy (3) 42
Trang 43MAC/VU-Advanced
Computer Architecture Lecture 45 Putting It All together(2) 43
Module 6: Memory Hierarchy System
Then we that talked four options to improve the cache performance
These options are used to reduce:
─ the miss penalty
─ the miss rate
─ the miss Penalty or miss rate via
Parallelism
─ the time to hit in the cache