Advanced Computer Architecture - Lecture 30: Memory hierarchy design. This lecture will cover the following: cache performance enhancement; reducing miss rate; classification of cache misses; reducing cache miss rate; way prediction and pseudo-associativity; compiler optimization;...
Trang 1CS 704
Advanced Computer Architecture
Lecture 30
Memory Hierarchy Design
Cache Performance Enhancement
(Reducing Miss Rate)
Trang 2Today’s Topics
Recap: Reducing Miss Penalty
Classification of Cache Misses
Reducing Cache Miss Rate
Summary
MAC/VU-Advanced
Computer Architecture Lecture 30 Memory Hierarchy (6) 2
Trang 3Recap: Improving Cache Performance
─ The miss penalty
─ The miss rate
─ The miss Penalty or miss rate via
Parallelism
─ The time to hit in the cache
Trang 4MAC/VU-Advanced
Computer Architecture Lecture 30 Memory Hierarchy (6) 4
Recap: Reducing Miss Penalty
– Multilevel Caches
– Critical Word first and Early Restart
– Priority to Read Misses Over writes
– Merging Write Buffers
– Victim Caches
Trang 5Recap: Reducing Miss Penalty
‘Multi level caches’
‘The more the merrier
Trang 6MAC/VU-Advanced
Computer Architecture Lecture 30 Memory Hierarchy (6) 6
Recap: Reducing Miss Penalty
“ Critical Word First and Early Restart’ intolerance
Reduces miss-penalty
Trang 7Recap: Reducing Miss Penalty
‘Priority to read miss over the write miss’
Favoritism
Trang 8MAC/VU-Advanced
Computer Architecture Lecture 30 Memory Hierarchy (6) 8
Recap: Reducing Miss Penalty
‘Merging write-buffer,’
Acquaintance
Victim cache
Salvage
Trang 9Recap: Reducing Miss Penalty
Reduces miss penalty
Multi level caches
Reduces miss rate
Cache-misses
Methods to reduce the miss rate
Trang 11Cache Misses - Classification
Trang 12MAC/VU-Advanced
Computer Architecture Lecture 30 Memory Hierarchy (6) 12
Cache Misses - Classification
Conflict Miss
Many blocks map to the same address or set
Trang 13Cache Misses - Classification
Trang 14MAC/VU-Advanced
Computer Architecture Lecture 30 Memory Hierarchy (6) 14
Cache Misses - Classification
Trang 15Reducing Miss Rate
1 Larger Block Size
2 Larger Caches
3 Higher Associativity
4 Way Prediction and Pseudo-associativity
5 Compiler Optimization
Trang 16MAC/VU-Advanced
Computer Architecture Lecture 30 Memory Hierarchy (6) 16
1: Larger Block Size
Reduce the miss rate
Spatial locality
Larger block have maximum number of data
or instructions
Trang 171: Larger Block Size
In small cache , larger blocks may increase
Trang 18MAC/VU-Advanced
Computer Architecture Lecture 30 Memory Hierarchy (6) 18
1: Larger Block Size
Trang 191: Larger Block Size
Assumption
80 clock cycles of over head
Delivers 16 bytes every 2 clock cycle
Hit time of 1 clock cycle
Trang 20MAC/VU-Advanced
Computer Architecture Lecture 30 Memory Hierarchy (6) 20
1: Larger Block Size: Solution
Average memory access time
= Hit time + Miss Rate x Miss Penalty
4KB cache, the miss rate = 7.24%
Miss penalty =
80 +4 = 84 clocks
Trang 211: Larger Block Size: Solution
Trang 22MAC/VU-Advanced
Computer Architecture Lecture 30 Memory Hierarchy (6) 22
1: Larger Block Size: Solution
Copy table 5.18 pp 428
Trang 231: Larger Block Size: Solution
– Latency and bandwidth of the lower level memory
– High latency and high bandwidth
Trang 24MAC/VU-Advanced
Computer Architecture Lecture 30 Memory Hierarchy (6) 24
2: Large Cache Size
Reduce Capacity misses
In 2001
2 nd – level and 3 rd -level caches
Drawback
Longer hit time
Higher cost (access time)
Trang 273: Higher Associativity
Trang 28MAC/VU-Advanced
Computer Architecture Lecture 30 Memory Hierarchy (6) 28
4: Way Prediction and
Pseudo-associativity
Fast hit time of direct-mapped caches
Lower conflict misses of set-associative caches
Reduces the conflict misses
Trang 294: Way Prediction and Pseudo-associativity
Way Prediction
Block in a set
Steps
1 Extra bits
Trang 30MAC/VU-Advanced
Computer Architecture Lecture 30 Memory Hierarchy (6) 30
4: Way Prediction and Pseudo-associativity
2-way prediction 4-way prediction
2 Multiplexer
Single tag
Trang 314: Way Prediction and Pseudo-associativity
Other blocks for matches in subsequent
clock cycles
Alpha 21264
Latency of 1 clock cycle
3 clock cycles
Trang 32MAC/VU-Advanced
Computer Architecture Lecture 30 Memory Hierarchy (6) 32
4: Way Prediction and Pseudo-associativity
Pseudo-associative
Column associative caches
Miss
“Pseudo-set”
Trang 334: Way Prediction and Pseudo-associativity
Pseudo Associative caches
Performance
“Slower hit”
Trang 355: Compiler Optimization
Instruction
Code optimization
– Reordering the Procedures
– Using Cache-line Alignment
Trang 36The code-line alignment method:
– Decreases cache miss
– Entry point is at the beginning of a cache
block
Trang 395: Compiler Optimization: Loop Interchange
• program having nested loops that
access data in non-sequential
order for j (0100) and in
sequential order for i (05000)
Trang 435: Compiler Optimization:
Using Blocking
‘Row major order’ (row-by-row)
‘Column major order’
Iteration for matrix multiplication
Trang 45Whereas,
can hold one N x N matrix and
one row of N elements ,
then the full matrix Z and one ( i th )
row of Y can stay in the cache
Trang 46MAC/VU-Advanced
Computer Architecture Lecture 30 Memory Hierarchy (6) 46
5: Compiler Optimization:
Using Blocking
B is chosen such that one row of B and
one B x B matrix can fit in in cache.
This ensures that the y and z blocks are
resident on cache
Let us have a look on to the modified code which shows that the two inner loops now compute in steps of size B (blocking
factor) rather than the full N x N size of
arrays X and Z
Trang 475: Compiler Optimization:
Using Blocking
Trang 49The way-prediction techniques checks a
section of cache for hit and then on miss it checks the rest of the cache
The final technique – loop interchange and blocking , is a software approach to
optimize the cache performance
Next time we will talk about the way to
enhance performance by having processor
Trang 50Example: Avg Memory Access
Time vs Miss Rate
Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs CCT
direct mapped
Trang 51Allah Hafiz