Advanced Computer Architecture - Lecture 29: Memory hierarchy design

Advanced Computer Architecture - Lecture 29: Memory hierarchy design. This lecture will cover the following: cache performance enhancement by reducing cache miss penalty; cache performance; reducing miss penalty; CPU execution time equation; improving cache performance;...

Trang 1

CS 704

Advanced Computer Architecture

Lecture 29

Memory Hierarchy Design

Cache Performance Enhancement by:

Reducing Cache Miss Penalty

Prof Dr M Ashraf Chughtai

Trang 3

MAC/VU-Advanced

Recap: Memory Hierarchy Designer’s Concerns

Block placement: Where can a block be placed

in the upper level?

Block identification: How is a block found if it is

in the upper level?

Block replacement: Which block should be

replaced on a miss?

Write strategy: What happens on a write?

Trang 4

Recap: Write Buffer for Write Through

cache write strategies

– write back

write through

use of write-buffer

Trang 5

MAC/VU-Advanced

Recap: Write Buffer for Write Through

level-2 cache is introduce in between the Level-1 cache and the DRAM main memory

- Write Allocate and

- No-Write Allocate

Trang 6

Recap: Write Miss Policies

Write Allocate:

– A block is allocated in the cache on a

write miss, i.e., the block to be written is available in the cache

No-Write Allocate:

– The blocks stay out of the cache until the

program tries to read the blocks; i.e., the block is modified only in the lower level memory

Trang 7

Impact of Caches on CPU Performance

 CPU Execution Time equation

 CPU (exTime) =

(CPU Exe. clock cycle + Memory Stall cycles ) x Clock Cycle Time

Trang 8

Impact of Caches on CPU Performance:

Trang 9

Impact of Caches on CPU Performance:

Example

CPU Time =

(CPU Exe. clock cycle + Memory Stall cycles ) x Clock Cycle Time

CPU Time with cache (including cache mis s )

Trang 10

Cache Performance (Review)

Number of Mis s es or mis s rate

Cos t per Mis s or mis s penalty

 Memory s tall clock cycles equal to the s um of

 IC x Reads per ins t. x Read mis s rate x Read Mis s

Penalty ; and

 IC x writes per ins t. x Write Mis s Rate x Write

Mis s Penalty

Trang 11

Number of reads x read mis s rate x read mis s penalty +

Number of write x write mis s rate x write mis s penalty

Trang 12

Note that the average memory access time

is an indirect measure of the CPU

performance and is not substitute for the

Execution Time

However, this formula can decide about the split caches (i.e., instruction cache and data cache) or unified cache

E.g., if we have to find out which of these

two types of caches has lower miss rate we can use this formula as follows:

Trang 13

MAC/VU-Advanced

Cache Performance: Example

Statement: Let us consider 32KB unified

cache with misses per 1000 instruction

equals 43.3 and instruction/data split caches each of 16KB with instruction cache misses per 1000 as 3.82 and data cache as 40.9;

Trang 14

Cache Performance: Example

– hit takes 1 clock cycle where the miss

penalty is 100 cycles and

– a load or store takes one extra cycle on

unified cache

Assuming through caches with buffer and ignore stalls due to write buffer – Find the average memory access time in

write-each case

Note to solve this problem we first find the miss rate and then average memory access time

Trang 15

MAC/VU-Advanced

Cache Performance: Solution

1: Miss Rate

= (Misses/1000) / (Accesses/ inst.)

Miss Rate 16KB Inst = (3.82/1000) /1.0 = 0.0038

Miss Rate 16KB data = (40.9/1000) /0.36 = 0.114

As about 74% of the memory access are

instructions therefore overall miss rate for split caches = (74% x 0.0038) + (26% x 0.114)

= 0.0324

Miss Rate 32KB unified = (43.3/1000) /(1+0.36) = 0.0318

i.e., the unified cache has slightly lower miss rate

Trang 16

Cache Performance: solution 2: Average Memory Access Time

= %inst x (Hit time + Inst Miss rate x miss penalty) + %data x (Hit time + data Miss rate x miss penalty)

Average Memory Access Time split =

74% x (1 + 0.0038 x 100) + 26% x (1 + 0.114 x 100) = 4.24

Average Memory Access Time unified =

74%x (1 + 0.0.0318 x 100) + 26% x (1+1+0.0318 x 100) = 4.44

i.e., the split caches have slightly better

average access time and also avoids

Structural Hazards

Trang 17

MAC/VU-Advanced

Improving Cache Performance

Average memory access time gives

framework to optimize the cache

performance

The Average memory access time formula:

Average Memory Access time =

Hit Time + Miss Rate x Miss Penalty

Trang 18

Four General Options

1 Reduce the miss penalty,

2 Reduce the miss rate,

3 Reduce miss Penalty or miss rate

via Parallelism

4 Reduce the time to hit in the cache

Trang 19

MAC/VU-Advanced

Reducing Miss Penalty

1 Multilevel Caches

2 Critical Word first and Early Restart

3 Priority to Read Misses Over write

Misses

4 Merging Write Buffers

5 Victim Caches

Trang 20

1: Multilevel Caches (to reduce Miss Penalty)

This technique ignores the CPU but

concentrates on the interface between cache and maim memory

Multiple levels of caches

Tradeoff between cache size (cache

effectiveness and cost (access time) a small fastest memory is used as level-1 cache

Trang 21

MAC/VU-Advanced

1: Multilevel Caches (Performance Analysis)

Average access Time is:

Access Time average

= Hit Time L1 + Miss Rate + L1 x Miss Penalty L1

= Hit Time L2 + Miss Rate+ L2 x Miss Penaltyx L2

Therefore,

The Average memory access time

= Hit Time L1 + Miss Rate + L1 x (Hit Time L2 +

Miss Rate L2 x Miss Penaltyx L2 )

Trang 22

1: Multilevel Caches (Performance Analysis)

Stall/instruction average =

Misses per instruction L1 x Hit Time L2 +

Misses per instructionL2 x Miss Penaltyx L2

Trang 24

1: Multilevel Caches (to reduce Miss Penalty)

Local Miss Rate:

Global Miss Rate :

cache divided by the total number of

Trang 25

MAC/VU-Advanced

Global miss rate

Trang 26

Example: Find the local and global miss rates, the Average Memory Access Time and Average

memory stall cycles per instruction, given that for

1000 reference with 40 misses in L1 cache and 20

in L2 Cache;

Assume:

– miss penalty for L2 cache-memory = 100 clock cycles

– hit time for L2 cache is 10 clock cycle

– Hit time for L1 cache is 1 Clock cycle

– Memory Reference per instruction = 1.5

Trang 27

Local Miss Rate for L2 = 50% [20/40]

Global Miss Rate for L2 cache = 2% [(20/1000)x100]

Then

Trang 28

Average Memory Access time

= Hit Time L1 + Miss Rate L1 + L1 x Miss Penalty L1 (1)

where, Miss Penalty L1

= Hit Time L2+Miss Rate L2+ L2 x Miss Penalty L2 x L2 (2)

Substituting 2 in 1 we get,

Average Memory Access time

= Hit Time L1 + Miss Rate L1 x

(Hit Time L2 + Miss RateL2 + L2 x Miss Penalty L2 x L2)

=1 + 4% x (10 +50% x 100) = 3.4 cycles

Trang 29

MAC/VU-Advanced

Average Memory Stalls per instruction (i.e., miss penalty)

For Memory references per instruction = 1.5

Misses per instruction for L1 = 40 x 1.5 = 60 per 1000 instructions

Misses per instruction for L2 = 20 x 1.5 = 30 per 1000 instructions

Average Memory Stalls per instruction

= (60/1000) x10 + 30/1000) x 100

= 0.6 +3.0 = 3.6 clock cycles

i.e., the average miss penalty using multi level

caches reduces by a factor of 100/3.6 = 28 relative

to the single level cache

Trang 30

Miss rate verses cache size for multilevel caches

unified cache is 2-way set associative with LRU replacement

Trang 31

MAC/VU-Advanced

2: Critical word first and early restart

Critical word first and Early restart

Don’t wait for full block to be loaded

before restarting CPU

Explanation:

The CPU normally needs one word of a

block at a time, thus it doesn’t have to wait for full block to be loaded before sending

the requested word and restarting the CPU

Trang 32

2: Critical word first and early restart … Cont’d

Early Restart: Request the words in a block

in normal order

As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution

Critical Word First: “Request the missed

word from memory first; and the memory

sends it to the CPU as soon as it arrives”

The CPU continues filling the rest of the

words in the block afterwards

Trang 33

MAC/VU-Advanced

Example: Critical word first and early restart

Consider a computer using 64-byte [8 word] cache blocks

An L2 cache takes 11 clock cycles to get first 8-byte (critical word) and then

2 clock cycles per 8- byte word to get the rest of the block (and 2 issues per cycle)

Trang 34

Example: Critical word first and early restart

1. with critical word first (assuming no

other access to the rest of the block)

2. without critical word first (assuming

following instructions read data

sequentially 8-byte words at a time from the rest of the block; i.e., block load is required in this case)

Trang 35

MAC/VU-Advanced

Example: Solution 1: With Critical word first:

Average miss penalty

= Miss Penalty of critical word +

Miss penalty of the remaining words of the block

= 11 x 1 + (8-1) x 2 = 11+ 14 = 25 clock cycles

2: Without critical word first (it requires block load)

= [Miss Penalty of first word +

miss penalty of the remaining words of the block] + clock cycles to issue the load

= [11 x 1 + (8-1) x 2 ] + 8/4 = 25 + 4= 29 clock

cycles

2 issues/cycle so 4cycles for 8 issues

Trang 36

2: Critical word first and early restart … Cont’d

Merit: The merit of this technique is

that it doesn’t require extra hardware Drawback: This technique is generally useful only in large blocks, therefore

the programs exhibiting spatial locality may face a problem is accessing the

data or instruction from the memory,

as the next miss is to the remainder of the block

Trang 37

MAC/VU-Advanced

3: Priority to Read Miss over

the Write misses

This technique reduces the average miss

penalty by considering the overlap between the CPU and cache miss-penalty

We have already discussed how the

buffer reduces the write stalls in

write-through caches

The write-buffers ensure that write to

memory does not stall the processor;

Furthermore, write-buffer may hold …….

Trang 38

3: Priority to Read Miss over the Write misses

the updated value of a locations

needed on a read miss and the

processor is blocked till read returnsThus, the write buffers do complicate

memory access and needs considerationLet us consider an example program segment and discuss how the

complexities of write buffer can be

simplified by giving priority to read misses over the write misses

Trang 39

MAC/VU-Advanced

Consider the code:

SW R3, 512 (Ro) ; M[512] R3 Cache index 0

LW R1, 1024 (R0) ; R1<-M[1024] Cache index 0

LW R2, 512(R0) ; R2<- M[512] Cache index 0

Assuming that the code is implemented

using a direct mapped, write-through cache that maps 512 and 1024 to same block and a 4-word write buffer

Find if value in R2 always be equal to value

in R3

Trang 40

Discussion:

Note that this is case of RAW data hazard

Let us see how the cache access is

performed

The Data in R3 is placed in write-buffer after first instruction; i.e., after the store

instruction using the index 512

The following load instruction (i.e., the 2 nd

instruction) uses the same index 512; and

is therefore a miss

Trang 41

MAC/VU-Advanced

3: Give Priority to Read Miss over the Write misses

- The second load instruction (i.e., the 3rd

instruction is sequence) tries to put the value in location 512 into register R2;

this also results into a miss

- If the write buffer hasn’t completed

writing to location 512 in memory, the read of location 512 will put the old

value into the cache and then into R2

Trang 42

3: Give Priority to Read Miss over the Write misses

- Thus R3 would not equal to R2, this is because the reads are served before

write have been completed, i.e.,

Write through with write buffers offer

RAW conflicts with main memory reads

on cache misses; and

If processor simply waits for write

buffer to empty, it might increase read miss penalty (say for old MIPS 1000) by 50% )

Trang 43

MAC/VU-Advanced

The simplest way to come of this dilemma is

to give priority to read miss; i.e.,

– either Wait for write buffer to empty

– or Check write buffer contents on a read

miss; and if there are no conflicts and

memory system is available then let the

memory access continue for read

Note that by giving priority to Read Miss the cost (penalty) of write by the processor in the write-

back register can also be reduce

Trang 44

In Write-back caches, the read miss may

require replacing a dirty block, which is

– Normally done by Writing dirty block to

memory, and then do the read

– Better alternative is to Copy the dirty block

to a write-buffer, then do the read, and then

do the write

In this case, the CPU stall less since it

restarts as soon as read is done, hence

reduces the miss-penalty

Trang 45

Even the write back caches use a

simple buffer when a block is replaced

In Normal mode of operation, the write buffer absorbs write from CPU; and

commit it to memory in the background

Trang 46

4: Merging Write Buffer

However, here the problem,

particularly in write-through caches, is that small write-buffer may end up

stalling processor if they fill up; and the Processor needs to wait till write

committed to memory

This problem is resolved by Merging

cache-block entries in the write buffer, because:

Trang 47

MAC/VU-Advanced

Computer Architecture Lecture 29 Memory Hierarchy (5) 47

– Multiword writes are usually faster

than writes performed one at a time

– Writes usually modify one word in a

block; Thus …

If a write buffer already contains

some words from the given data

block we will merge current modified word with the block parts already in the buffer

Trang 48

That is, If the buffer contains other

modified blocks the address can be checked to see if

the address of this new data matches the address of valid write buffer entry

Then the new data are combined with the existing entry - it is called Write

Merge

Trang 49

MAC/VU-Advanced

Note that here, the CPU continues to

work while the write-buffer prepares to write the word to memory

This technique, therefore, reduces the number of stalls due to write-buffer

being full; hence

reduces the miss penalty through

improvement in efficiency of

write-buffer

Trang 50

Trang 51

MAC/VU-Advanced

Trang 52

5: Victim Caches: Reducing Miss Penalty

Another way to reduce the miss penalty is

to remember what was discarded as it may needed again.

This method reduces the miss penalty since the discarded data has already been fetched

so it can be used again at small cast

The victim cache contains only discarded

blocks because of some earlier miss; and

are ……

Trang 53

MAC/VU-Advanced

… checked on another miss to see if they have the desired data before going to the

next lower-level memory

If the desired data (or instruction) is found then the victim block and cache block are

swapped

This recycling requires small fully

associative cache between a cache and its refill path - called the victim cache as shown

in the following figure

Trang 54

Placement of victim cache in memory hierarchy

Cache

Trang 55

MAC/VU-Advanced

Summary

The first approach, ‘multi level caches’ is: ‘the more the merrier – extra people are welcome to come along’

Tiêu đề	Memory Hierarchy Design
Tác giả	Prof. Dr. M. Ashraf Chughtai
Trường học	mac/vu
Chuyên ngành	advanced computer architecture
Thể loại	lecture

Định dạng
Số trang	60
Dung lượng	1,74 MB