Advanced Computer Architecture - Lecture 29: Memory hierarchy design. This lecture will cover the following: cache performance enhancement by reducing cache miss penalty; cache performance; reducing miss penalty; CPU execution time equation; improving cache performance;...
Trang 1CS 704
Advanced Computer Architecture
Lecture 29
Memory Hierarchy Design
Cache Performance Enhancement by:
Reducing Cache Miss Penalty
Prof Dr M Ashraf Chughtai
Trang 3MAC/VU-Advanced
Recap: Memory Hierarchy Designer’s Concerns
Block placement: Where can a block be placed
in the upper level?
Block identification: How is a block found if it is
in the upper level?
Block replacement: Which block should be
replaced on a miss?
Write strategy: What happens on a write?
Trang 4Recap: Write Buffer for Write Through
cache write strategies
– write back
write through
use of write-buffer
Trang 5MAC/VU-Advanced
Recap: Write Buffer for Write Through
level-2 cache is introduce in between the Level-1 cache and the DRAM main memory
- Write Allocate and
- No-Write Allocate
Trang 6Recap: Write Miss Policies
Write Allocate:
– A block is allocated in the cache on a
write miss, i.e., the block to be written is available in the cache
No-Write Allocate:
– The blocks stay out of the cache until the
program tries to read the blocks; i.e., the block is modified only in the lower level memory
Trang 7Impact of Caches on CPU Performance
CPU Execution Time equation
CPU (exTime) =
(CPU Exe. clock cycle + Memory Stall cycles ) x Clock Cycle Time
Trang 8Impact of Caches on CPU Performance:
Trang 9Impact of Caches on CPU Performance:
Example
CPU Time =
(CPU Exe. clock cycle + Memory Stall cycles ) x Clock Cycle Time
CPU Time with cache (including cache mis s )
Trang 10Cache Performance (Review)
Number of Mis s es or mis s rate
Cos t per Mis s or mis s penalty
Memory s tall clock cycles equal to the s um of
IC x Reads per ins t. x Read mis s rate x Read Mis s
Penalty ; and
IC x writes per ins t. x Write Mis s Rate x Write
Mis s Penalty
Trang 11Cache Performance (Review)
Number of reads x read mis s rate x read mis s penalty +
Number of write x write mis s rate x write mis s penalty
Trang 12Cache Performance (Review)
Note that the average memory access time
is an indirect measure of the CPU
performance and is not substitute for the
Execution Time
However, this formula can decide about the split caches (i.e., instruction cache and data cache) or unified cache
E.g., if we have to find out which of these
two types of caches has lower miss rate we can use this formula as follows:
Trang 13MAC/VU-Advanced
Cache Performance: Example
Statement: Let us consider 32KB unified
cache with misses per 1000 instruction
equals 43.3 and instruction/data split caches each of 16KB with instruction cache misses per 1000 as 3.82 and data cache as 40.9;
Trang 14Cache Performance: Example
– hit takes 1 clock cycle where the miss
penalty is 100 cycles and
– a load or store takes one extra cycle on
unified cache
Assuming through caches with buffer and ignore stalls due to write buffer – Find the average memory access time in
write-each case
Note to solve this problem we first find the miss rate and then average memory access time
Trang 15MAC/VU-Advanced
Cache Performance: Solution
1: Miss Rate
= (Misses/1000) / (Accesses/ inst.)
Miss Rate 16KB Inst = (3.82/1000) /1.0 = 0.0038
Miss Rate 16KB data = (40.9/1000) /0.36 = 0.114
As about 74% of the memory access are
instructions therefore overall miss rate for split caches = (74% x 0.0038) + (26% x 0.114)
= 0.0324
Miss Rate 32KB unified = (43.3/1000) /(1+0.36) = 0.0318
i.e., the unified cache has slightly lower miss rate
Trang 16Cache Performance: solution 2: Average Memory Access Time
= %inst x (Hit time + Inst Miss rate x miss penalty) + %data x (Hit time + data Miss rate x miss penalty)
Average Memory Access Time split =
74% x (1 + 0.0038 x 100) + 26% x (1 + 0.114 x 100) = 4.24
Average Memory Access Time unified =
74%x (1 + 0.0.0318 x 100) + 26% x (1+1+0.0318 x 100) = 4.44
i.e., the split caches have slightly better
average access time and also avoids
Structural Hazards
Trang 17MAC/VU-Advanced
Improving Cache Performance
Average memory access time gives
framework to optimize the cache
performance
The Average memory access time formula:
Average Memory Access time =
Hit Time + Miss Rate x Miss Penalty
Trang 18Four General Options
1 Reduce the miss penalty,
2 Reduce the miss rate,
3 Reduce miss Penalty or miss rate
via Parallelism
4 Reduce the time to hit in the cache
Trang 19MAC/VU-Advanced
Reducing Miss Penalty
1 Multilevel Caches
2 Critical Word first and Early Restart
3 Priority to Read Misses Over write
Misses
4 Merging Write Buffers
5 Victim Caches
Trang 201: Multilevel Caches (to reduce Miss Penalty)
This technique ignores the CPU but
concentrates on the interface between cache and maim memory
Multiple levels of caches
Tradeoff between cache size (cache
effectiveness and cost (access time) a small fastest memory is used as level-1 cache
Trang 21MAC/VU-Advanced
1: Multilevel Caches (Performance Analysis)
Average access Time is:
Access Time average
= Hit Time L1 + Miss Rate + L1 x Miss Penalty L1
= Hit Time L2 + Miss Rate+ L2 x Miss Penaltyx L2
Therefore,
The Average memory access time
= Hit Time L1 + Miss Rate + L1 x (Hit Time L2 +
Miss Rate L2 x Miss Penaltyx L2 )
Trang 221: Multilevel Caches (Performance Analysis)
Stall/instruction average =
Misses per instruction L1 x Hit Time L2 +
Misses per instructionL2 x Miss Penaltyx L2
Trang 241: Multilevel Caches (to reduce Miss Penalty)
Local Miss Rate:
Global Miss Rate :
cache divided by the total number of
Trang 25MAC/VU-Advanced
1: Multilevel Caches (to reduce Miss Penalty)
Global miss rate
Trang 261: Multilevel Caches (to reduce Miss Penalty)
Example: Find the local and global miss rates, the Average Memory Access Time and Average
memory stall cycles per instruction, given that for
1000 reference with 40 misses in L1 cache and 20
in L2 Cache;
Assume:
– miss penalty for L2 cache-memory = 100 clock cycles
– hit time for L2 cache is 10 clock cycle
– Hit time for L1 cache is 1 Clock cycle
– Memory Reference per instruction = 1.5
Trang 27Local Miss Rate for L2 = 50% [20/40]
Global Miss Rate for L2 cache = 2% [(20/1000)x100]
Then
Trang 281: Multilevel Caches (to reduce Miss Penalty)
Average Memory Access time
= Hit Time L1 + Miss Rate L1 + L1 x Miss Penalty L1 (1)
where, Miss Penalty L1
= Hit Time L2+Miss Rate L2+ L2 x Miss Penalty L2 x L2 (2)
Substituting 2 in 1 we get,
Average Memory Access time
= Hit Time L1 + Miss Rate L1 x
(Hit Time L2 + Miss RateL2 + L2 x Miss Penalty L2 x L2)
=1 + 4% x (10 +50% x 100) = 3.4 cycles
Trang 29MAC/VU-Advanced
1: Multilevel Caches (to reduce Miss Penalty)
Average Memory Stalls per instruction (i.e., miss penalty)
For Memory references per instruction = 1.5
Misses per instruction for L1 = 40 x 1.5 = 60 per 1000 instructions
Misses per instruction for L2 = 20 x 1.5 = 30 per 1000 instructions
Average Memory Stalls per instruction
= (60/1000) x10 + 30/1000) x 100
= 0.6 +3.0 = 3.6 clock cycles
i.e., the average miss penalty using multi level
caches reduces by a factor of 100/3.6 = 28 relative
to the single level cache
Trang 301: Multilevel Caches (to reduce Miss Penalty)
Miss rate verses cache size for multilevel caches
unified cache is 2-way set associative with LRU replacement
Trang 31MAC/VU-Advanced
2: Critical word first and early restart
Critical word first and Early restart
Don’t wait for full block to be loaded
before restarting CPU
Explanation:
The CPU normally needs one word of a
block at a time, thus it doesn’t have to wait for full block to be loaded before sending
the requested word and restarting the CPU
Trang 322: Critical word first and early restart … Cont’d
Early Restart: Request the words in a block
in normal order
As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution
Critical Word First: “Request the missed
word from memory first; and the memory
sends it to the CPU as soon as it arrives”
The CPU continues filling the rest of the
words in the block afterwards
Trang 33MAC/VU-Advanced
Example: Critical word first and early restart
Consider a computer using 64-byte [8 word] cache blocks
An L2 cache takes 11 clock cycles to get first 8-byte (critical word) and then
2 clock cycles per 8- byte word to get the rest of the block (and 2 issues per cycle)
Trang 34Example: Critical word first and early restart
1. with critical word first (assuming no
other access to the rest of the block)
2. without critical word first (assuming
following instructions read data
sequentially 8-byte words at a time from the rest of the block; i.e., block load is required in this case)
Trang 35MAC/VU-Advanced
Example: Solution 1: With Critical word first:
Average miss penalty
= Miss Penalty of critical word +
Miss penalty of the remaining words of the block
= 11 x 1 + (8-1) x 2 = 11+ 14 = 25 clock cycles
2: Without critical word first (it requires block load)
= [Miss Penalty of first word +
miss penalty of the remaining words of the block] + clock cycles to issue the load
= [11 x 1 + (8-1) x 2 ] + 8/4 = 25 + 4= 29 clock
cycles
2 issues/cycle so 4cycles for 8 issues
Trang 362: Critical word first and early restart … Cont’d
Merit: The merit of this technique is
that it doesn’t require extra hardware Drawback: This technique is generally useful only in large blocks, therefore
the programs exhibiting spatial locality may face a problem is accessing the
data or instruction from the memory,
as the next miss is to the remainder of the block
Trang 37MAC/VU-Advanced
3: Priority to Read Miss over
the Write misses
This technique reduces the average miss
penalty by considering the overlap between the CPU and cache miss-penalty
We have already discussed how the
buffer reduces the write stalls in
write-through caches
The write-buffers ensure that write to
memory does not stall the processor;
Furthermore, write-buffer may hold …….
Trang 383: Priority to Read Miss over the Write misses
the updated value of a locations
needed on a read miss and the
processor is blocked till read returnsThus, the write buffers do complicate
memory access and needs considerationLet us consider an example program segment and discuss how the
complexities of write buffer can be
simplified by giving priority to read misses over the write misses
Trang 39MAC/VU-Advanced
3: Priority to Read Miss over the Write misses
Consider the code:
SW R3, 512 (Ro) ; M[512] R3 Cache index 0
LW R1, 1024 (R0) ; R1<-M[1024] Cache index 0
LW R2, 512(R0) ; R2<- M[512] Cache index 0
Assuming that the code is implemented
using a direct mapped, write-through cache that maps 512 and 1024 to same block and a 4-word write buffer
Find if value in R2 always be equal to value
in R3
Trang 403: Priority to Read Miss over the Write misses
Discussion:
Note that this is case of RAW data hazard
Let us see how the cache access is
performed
The Data in R3 is placed in write-buffer after first instruction; i.e., after the store
instruction using the index 512
The following load instruction (i.e., the 2 nd
instruction) uses the same index 512; and
is therefore a miss
Trang 41MAC/VU-Advanced
3: Give Priority to Read Miss over the Write misses
- The second load instruction (i.e., the 3rd
instruction is sequence) tries to put the value in location 512 into register R2;
this also results into a miss
- If the write buffer hasn’t completed
writing to location 512 in memory, the read of location 512 will put the old
value into the cache and then into R2
Trang 423: Give Priority to Read Miss over the Write misses
- Thus R3 would not equal to R2, this is because the reads are served before
write have been completed, i.e.,
Write through with write buffers offer
RAW conflicts with main memory reads
on cache misses; and
If processor simply waits for write
buffer to empty, it might increase read miss penalty (say for old MIPS 1000) by 50% )
Trang 43MAC/VU-Advanced
3: Priority to Read Miss over the Write misses
The simplest way to come of this dilemma is
to give priority to read miss; i.e.,
– either Wait for write buffer to empty
– or Check write buffer contents on a read
miss; and if there are no conflicts and
memory system is available then let the
memory access continue for read
Note that by giving priority to Read Miss the cost (penalty) of write by the processor in the write-
back register can also be reduce
Trang 443: Priority to Read Miss over the Write misses
In Write-back caches, the read miss may
require replacing a dirty block, which is
– Normally done by Writing dirty block to
memory, and then do the read
– Better alternative is to Copy the dirty block
to a write-buffer, then do the read, and then
do the write
In this case, the CPU stall less since it
restarts as soon as read is done, hence
reduces the miss-penalty
Trang 45Even the write back caches use a
simple buffer when a block is replaced
In Normal mode of operation, the write buffer absorbs write from CPU; and
commit it to memory in the background
Trang 464: Merging Write Buffer
However, here the problem,
particularly in write-through caches, is that small write-buffer may end up
stalling processor if they fill up; and the Processor needs to wait till write
committed to memory
This problem is resolved by Merging
cache-block entries in the write buffer, because:
Trang 47MAC/VU-Advanced
Computer Architecture Lecture 29 Memory Hierarchy (5) 47
4: Merging Write Buffer
– Multiword writes are usually faster
than writes performed one at a time
– Writes usually modify one word in a
block; Thus …
If a write buffer already contains
some words from the given data
block we will merge current modified word with the block parts already in the buffer
Trang 484: Merging Write Buffer
That is, If the buffer contains other
modified blocks the address can be checked to see if
the address of this new data matches the address of valid write buffer entry
Then the new data are combined with the existing entry - it is called Write
Merge
Trang 49MAC/VU-Advanced
4: Merging Write Buffer
Note that here, the CPU continues to
work while the write-buffer prepares to write the word to memory
This technique, therefore, reduces the number of stalls due to write-buffer
being full; hence
reduces the miss penalty through
improvement in efficiency of
write-buffer
Trang 504: Merging Write Buffer
Trang 51MAC/VU-Advanced
4: Merging Write Buffer
Trang 525: Victim Caches: Reducing Miss Penalty
Another way to reduce the miss penalty is
to remember what was discarded as it may needed again.
This method reduces the miss penalty since the discarded data has already been fetched
so it can be used again at small cast
The victim cache contains only discarded
blocks because of some earlier miss; and
are ……
Trang 53MAC/VU-Advanced
5: Victim Caches: Reducing Miss Penalty
… checked on another miss to see if they have the desired data before going to the
next lower-level memory
If the desired data (or instruction) is found then the victim block and cache block are
swapped
This recycling requires small fully
associative cache between a cache and its refill path - called the victim cache as shown
in the following figure
Trang 545: Victim Caches: Reducing Miss Penalty
Placement of victim cache in memory hierarchy
Cache
Trang 55MAC/VU-Advanced
Summary
The first approach, ‘multi level caches’ is: ‘the more the merrier – extra people are welcome to come along’