Advanced Computer Architecture - Lecture 27: Memory hierarchy design. This lecture will cover the following: cache design techniques; cache performance metrics; cache designs; addressing techniques; CPU clock cycles; memory stall cycles; block size tradeoff; categories of cache organization;...
Trang 1CS 704
Advanced Computer Architecture
Lecture 27
Memory Hierarchy Design
(Cache Design Techniques)
Prof Dr M Ashraf Chughtai
Trang 2Today’s Topics
Cache Performance Metrics
Cache Designs
Addressing Techniques
Summary
Trang 3Recap: Memory Hierarchy Principles
High speed storage at the
cheapest cost per byte
Different types of memory
modules are organize in hierarchy, based on the:
Concept of Caching
Principle of Locality
Trang 4Recap: Concept of Caching
A small, fastest and most expensive
storage be used as the staging area or
temporary-place
– store frequently-used subset of the data
or instructions from the relatively cheaper, larger and slower memory; and
– To avoid having to go to the main
memory every time this information is needed
Trang 5Recap: Principle of Locality
principle of locality
To obtain data or instructions of a
program, the processor access a
relatively small portion of the address space at any instant of time
Trang 6Recap: Types of Locality
There are two different types of locality Temporal locality
Spatial locality
Trang 7Recap: Working of Memory Hierarchy
― the memory hierarchy will keep the
chances are the processor will
access them again soon
Trang 8Recap: Working of Memory Hierarchy Cont’d
• NOT ONLY do we move the item
that has just been accessed
closer to the processor, but we ALSO move the data items that are adjacent to it
Trang 9Recap: Cache Devices
Cache device is a small SRAM which is
made directly accessible to the processor
Cache sits between normal main memory
and CPU as data and instruction caches and may be located on CPU chip or as a module Data transfer between cache - CPU, and
cache- Main memory is performed by the
cache controller
Cache and main memory is organized in
equal sized blocks
Trang 10Recap: Cache/Main Memory Data Transfer
An address-tag is associated with each
cache block that defines the relationship of the cache block with the higher-level
memory (say main memory)
Data Transfer between CPU and Caches
takes place as the word transfer
Data transfer between Cache and the Main memory takes place as the block transfer
Trang 11Recap: Cache operation
CPU requests contents of main memory
location
Controller checks cache blocks for this data
If present, i.e., HIT, it gets data or instruction from cache - fast
If not present, i.e., MISS, it reads required
block from main memory to cache, then
deliver from cache to CPU
Trang 12Cache Memory Performance
Miss rate, Miss Penalty, and Average access time are the major trade-off of Cache Memory
total memory accesses
As, Hit rate is defined as the fraction of memory
access that are found in the level-k memory or say the cache, therefore Miss Rate = 1 – Hit Rate
Trang 13Cache Memory Performance
i.e., the number of cycles CPU is stalled for
a memory access; and is determined by the sum of:
(i) The Cycles (time) to replace a block in the
upper level and
(ii) The Cycles (time) to deliver the block to
the processor
Average Access Time:
= Hit Time x (Hit Rate) + Miss Penalty x Miss Rate
Trang 14Cache Memory Performance
The performance of a CPU is the product of clock cycle time and sum of CPU clock cycles and
memory stall cycles
CPU Execution Time =
(CPU Clock Cycles + Memory Stall Cycles) x clock cycle time
Where,
memory stall cycles=
= Number of Misses x Miss Penalty
= IC x (Misses / Instructions) x Miss Penalty
= IC x [(Memory Access / Instructions)] x Miss Rate x Miss Penalty
Trang 15Memory Stall Cycles … cont’d
– Number of cycles for memory read and for
memory write may be different,
– Miss penalty for read may be different from
the write
– Memory Stall Clock Cycles =
Memory read stall cycles + Memory Write stall cycles
–
Trang 16Cache Performance Example
Assume a computer has CPI=1.0 when all memory accesses are hit; the only data accesses are
load/store access; and these are 50% of the total instructions
If the miss rate is 2% and miss penalty is 25 clock cycles, how much faster the computer will be if all instructions are HIT
Execution Time for all Hit = IC x 1.0 x cycle time
CPU Execution time with real cache =
CPU Execution time + Memory Stall time
Trang 17Cache Performance Example
Memory Stall Cycles =
= IC x (Instruction access + data access) per instruction
x
miss rate x miss penalty
= IC (1+ 0.5) x 0.02 x 25
= IC x 0.75
CPU Execution time (with cache)
= (IC x 1.0 + IC x 0.75) x clock time
= 1.75 x IC x Cycle time
Computer with no cache misses is 1.75 times faster
Trang 18Block Size Tradeoff: Miss Rate
Miss Rate Exploits Spatial Locality
Fewer blocks:
compromises temporal locality
Block Size
Trang 19Block Size Tradeoff: Miss Rate
• Miss rate probably will go to infinity It is true that
if an item is accessed, it is likely that it will be
accessed again soon.
Trang 20Block Size Tradeoff: Miss Rate
This is called the ping pong effect
The data is acting like a ping pong ball bouncing
in and out of the cache
MISS RATE is not the only cache performance metrics, we have to worry about the miss penalty
Trang 21Block Size Tradeoff:
Miss Penalty
Miss Penalty
Block Size
Trang 22Block Size Tradeoff: Average Access Time
Average Access Time
Performance metric than the miss rate or miss penalty
Trang 23Block Size Tradeoff: Average Access Time
Block Size
Miss
Rate Penalty Miss Average
Access Time
Increased Miss Penalty
& Miss Rate
Block Size
Trang 24Block Size Tradeoff: Average Access Time
Not only is the miss penalty is
increasing,
Miss rate is increasing as well
Trang 25How Do you Design a Cache?
– read: data <= Mem [Physical Address] – write: Mem [ Physical Address] <=
Trang 26How Do you Design a Cache?
Cache Controller
Cache DataPath Address
Data In
Data Out
R/W Active
Control Points
Signals
wait
Trang 27Categories of Cache Organization
Cache can be organized in three different
way based on the block placement policy
The block placement policy defines where a block from main (Physical) memory be
placed in the cache
There exist three block placement policies namely:
– Direct Mapped
– Fully Associative
– Set Associative
Trang 28Direct Mapped Cache Organization
339C ABCDEFAB 339C 1235678
Trang 29Direct Mapping Example
For example, a computer uses 16 MB main
memory and IMB cache.
These memories are organized in 4-byte blocks
That is, the main memory has IM Blocks and the cache is organized in 256K blocks each of 4-byte Each block in the main as well as in cache is
address by a line-number
Now in order to understand the placement policy, the main memory is logically divided into 16
sections each of 256K blocks (lines)
Each section is represented by a Tag number
Trang 30Direct Mapping Characteristics
Each block of main memory maps to only
one cache line
e.g., the block number at line 339C H from
any of the 16 sections must be place at line number 0CE7 with corresponding tag in the cache
This shows that the direct mapping is given by: (Block address) MOD (Number of Blocks in the cache)
(00339C) MOD (3FFF) = 0CE7
Trang 31Direct Mapping Address Structure
24 bit address
2 bit word identifier (4 byte block)
22 bit block identifier – for the main memory
– 8 bit tag (=22-14)
– 14 bit slot or line or index value for cache
No two blocks in the same line have the same Tag field
Check contents of cache by finding line and
checking Tag
Trang 32Direct Mapping Cache Organization
Least Significant w bits identify unique word
Next Significant s bits specify one memory block The MSBs are split into a cache line field r and a tag of s-r (most significant)
Trang 33Cache Design Another Example
Let us consider another example with realistic numbers:
Assume we have a 1 KB direct mapped cache with block size equals to 32 bytes
In other words, each block associated with the
cache tag will have 32 bytes in it (Row 1).
0 1 2 3
Byte 63 :
Byte 992 Byte 1023 :
Valid Bit
:
Trang 34Address Translation – Direct Mapped Cache
Assume the k+1 level main memory of 4GB, with Block Size equals to 32 bytes, and a k level cache
of 1Kbyte
Cache Index
0 4
31
Cache Tag
Ex: 0x01 Stored as part
of the cache “state”
Valid Bit
:
0 1 2 3
Byte 63 :
Cache Tag
Byte Select Ex: 0x00 9
Trang 35Direct Mapping pros & cons
Simple
Inexpensive
Fixed location for given block
– If a program accesses 2 blocks that
map to the same line repeatedly,
cache misses are very high
Valid bit is included to see if the
cache contents are valid …
explanation
Trang 37Fully Associative Cache Organization
– Forget about the Cache Index
– Place any block of the main memory any where
in the cache
Store all the upper bits of the address (except Byte select) that is associated with the cache block as the cache tag and have one comparator for every entry.
Trang 38Fully Associative Cache Organization
Trang 39Associative Mapping Example
Trang 40Tag 22 bit Word 2 bit
Associative Mapping: Address Structure
22 bit tag stored with each 32 bit block of data
Compare tag field with tag entry in cache to check for hit
Least significant 2 bits of address identify which
16 bit word is required from 32 bit data block
e.g
Trang 41Byte 32 Byte 33
Byte 63 :
Cache Tag
Byte Select Ex: 0x01
X X X
X
X
Trang 42Fully Associative Cache Organization
The address is sent to all entries at once and
compared in parallel and only the one that matches are sent to the output
This is called an associative lookup.
Hardware intensive
Fully associative cache is limited to 64 or less
entries.
Trang 43Fully Associative Cache Organization
Conflict miss is zero for a fully associative cache Assume we have 64 entries here The first 64
items we accessed can fit in.
But when we try to bring in the 65th item, we will need to throw one of them out to make room for
the new item
This bring us to the cache misses of type Capacity Miss
Trang 44Set Associative Mapping
Summary
Address length = (s + w) bits
Number of lines in set = k
Size of tag = (s – d) bits
Trang 45Set Associative Mapping
Cache is divided into a number of sets
Each set contains a number of lines
A given block maps to any line in a given set
– e.g Block B can be in any line of set i
e.g 2 lines per set
– 2 way associative mapping
– A given block can be in one of 2 lines in only
one set
Trang 46Set Associative Cache
This organization allows to place a block in a
restricted set of places in the cache, where a set is
a group of blocks in the cache at each index value Here a block is first mapped onto a set (i.e.,
mapped at an index value and then it can be placed anywhere within that set
The set is usually chosen by bit-selection; i.e.,
If there are n-blocks in a set, the cache placement
is referred to as the n-way set associative
Trang 47Set Associative Mapping Address Structure
Use set field to determine cache set to look in Compare tag field to see if we have a hit
Trang 48Two Way Set Associative
Mapping Example
Trang 49Two-way Set Associative Cache
Let us consider this example 2-way set Associative Cache
Here, two cache entries are possible for each index i.e., two direct mapped caches are working in
parallel
Cache Index Cache Data
Cache Block 0
Cache Tag Valid
:
Cache Data Cache Block 0 Cache Tag Valid
: : :
Trang 50Working of Two-way Set Associative Cache
Let us see how it works?
─ the cache index selects a set from the cache The two tags in the set are compared in parallel with the upper bits of the memory address.
─ If neither tag matches the incoming address tag,
we have a cache miss
─ Otherwise, we have a cache hit and we will select the data on the side where the tag matches occur This is simple enough What is its disadvantages?
Trang 51Disadvantage of Set Associative Cache
First of all, a N-way set associative cache will need
N comparators instead of just one comparator (use the right side of the diagram for direct mapped
cache).
A N-way set associative cache will also be slower than a direct mapped cache because of this extra multiplexer delay.
Finally, for a N-way set associative cache, the data will be available AFTER the hit/miss signal
becomes valid because the hit/miss is needed to control the data MUX.
Trang 52Disadvantage of Set Associative Cache
For a direct mapped cache, that is
everything before the MUX on the right or left side, the cache block will be available BEFORE the hit/miss signal (AND gate
output) because the data does not have to
go through the comparator.
This can be an important consideration
because the processor can now go ahead and use the data without knowing if it is a Hit or Miss
Trang 53Disadvantage of Set Associative Cache
If it assumes that it is a hit; it will be ahead
by 90% of the time as cache hit rate is in the upper 90% range, and for other 10% of the
time that it is wrong, just make sure that it can recover
We cannot play this speculation game with a N-way set - associative cache because as
we said earlier, the data will not be available
to until the hit/miss signal is valid.
Trang 54Allah Hafiz
And Aslam-U-Alacum