Advanced Computer Architecture - Lecture 30: Memory hierarchy design

Advanced Computer Architecture - Lecture 30: Memory hierarchy design. This lecture will cover the following: cache performance enhancement; reducing miss rate; classification of cache misses; reducing cache miss rate; way prediction and pseudo-associativity; compiler optimization;...

Trang 1

CS 704

Advanced Computer Architecture

Lecture 30

Memory Hierarchy Design

Cache Performance Enhancement

(Reducing Miss Rate)

Trang 2

Today’s Topics

Recap: Reducing Miss Penalty

Classification of Cache Misses

Reducing Cache Miss Rate

Summary

MAC/VU-Advanced

Computer Architecture Lecture 30 Memory Hierarchy (6) 2

Trang 3

Recap: Improving Cache Performance

─ The miss penalty

─ The miss rate

─ The miss Penalty or miss rate via

Parallelism

─ The time to hit in the cache

Trang 4

MAC/VU-Advanced

– Multilevel Caches

– Critical Word first and Early Restart

– Priority to Read Misses Over writes

– Merging Write Buffers

– Victim Caches

Trang 5

‘Multi level caches’

‘The more the merrier

Trang 6

MAC/VU-Advanced

“ Critical Word First and Early Restart’ intolerance

Reduces miss-penalty

Trang 7

‘Priority to read miss over the write miss’

Favoritism

Trang 8

MAC/VU-Advanced

‘Merging write-buffer,’

Acquaintance

Victim cache

Salvage

Trang 9

Reduces miss penalty

Multi level caches

Reduces miss rate

Cache-misses

Methods to reduce the miss rate

Trang 11

Cache Misses - Classification

Trang 12

MAC/VU-Advanced

Conflict Miss

Many blocks map to the same address or set

Trang 13

Trang 14

MAC/VU-Advanced

Trang 15

Reducing Miss Rate

1 Larger Block Size

2 Larger Caches

3 Higher Associativity

4 Way Prediction and Pseudo-associativity

5 Compiler Optimization

Trang 16

MAC/VU-Advanced

1: Larger Block Size

Reduce the miss rate

Spatial locality

Larger block have maximum number of data

or instructions

Trang 17

In small cache , larger blocks may increase

Trang 18

MAC/VU-Advanced

Trang 19

1: Larger Block Size

Assumption

80 clock cycles of over head

Delivers 16 bytes every 2 clock cycle

Hit time of 1 clock cycle

Trang 20

MAC/VU-Advanced

1: Larger Block Size: Solution

Average memory access time

= Hit time + Miss Rate x Miss Penalty

4KB cache, the miss rate = 7.24%

Miss penalty =

80 +4 = 84 clocks

Trang 21

Trang 22

MAC/VU-Advanced

Copy table 5.18 pp 428

Trang 23

– Latency and bandwidth of the lower level memory

– High latency and high bandwidth

Trang 24

MAC/VU-Advanced

2: Large Cache Size

Reduce Capacity misses

In 2001

2 nd – level and 3 rd -level caches

Drawback

Longer hit time

Higher cost (access time)

Trang 27

3: Higher Associativity

Trang 28

MAC/VU-Advanced

4: Way Prediction and

Pseudo-associativity

Fast hit time of direct-mapped caches

Lower conflict misses of set-associative caches

Reduces the conflict misses

Trang 29

4: Way Prediction and Pseudo-associativity

Way Prediction

Block in a set

Steps

1 Extra bits

Trang 30

MAC/VU-Advanced

2-way prediction 4-way prediction

2 Multiplexer

Single tag

Trang 31

 Other blocks for matches in subsequent

clock cycles

 Alpha 21264

 Latency of 1 clock cycle

 3 clock cycles

Trang 32

MAC/VU-Advanced

Pseudo-associative

Column associative caches

Miss

“Pseudo-set”

Trang 33

Pseudo Associative caches

Performance

“Slower hit”

Trang 35

5: Compiler Optimization

Instruction

Code optimization

– Reordering the Procedures

– Using Cache-line Alignment

Trang 36

The code-line alignment method:

– Decreases cache miss

– Entry point is at the beginning of a cache

block

Trang 39

5: Compiler Optimization: Loop Interchange

• program having nested loops that

access data in non-sequential

order for j (0100) and in

sequential order for i (05000)

Trang 43

5: Compiler Optimization:

Using Blocking

‘Row major order’ (row-by-row)

‘Column major order’

Iteration for matrix multiplication

Trang 45

Whereas,

can hold one N x N matrix and

one row of N elements ,

then the full matrix Z and one ( i th )

row of Y can stay in the cache

Trang 46

MAC/VU-Advanced

Using Blocking

B is chosen such that one row of B and

one B x B matrix can fit in in cache.

This ensures that the y and z blocks are

resident on cache

Let us have a look on to the modified code which shows that the two inner loops now compute in steps of size B (blocking

factor) rather than the full N x N size of

arrays X and Z

Trang 47

Using Blocking

Trang 49

The way-prediction techniques checks a

section of cache for hit and then on miss it checks the rest of the cache

The final technique – loop interchange and blocking , is a software approach to

optimize the cache performance

Next time we will talk about the way to

enhance performance by having processor

Trang 50

Example: Avg Memory Access

Time vs Miss Rate

Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs CCT

direct mapped

Trang 51

Allah Hafiz

Tiêu đề	Memory Hierarchy Design
Người hướng dẫn	Prof. Dr. M. Ashraf Chughtai
Trường học	mac/vu
Chuyên ngành	advanced computer architecture
Thể loại	lecture

Định dạng
Số trang	51
Dung lượng	1,35 MB