Cache operation – overview CPU requests contents of memory location Check cache for this data If present, get from cache fast If not present, read required block from main memory
Trang 1Chương 8
Hệ thống bộ nhớ
Trang 2Nội dung
1 Các cấp bộ nhớ (Memory Hierarchy)
2 Bộ nhớ cache (Cache Memory)
3 Bộ nhớ trong (Main Memory)
4 Bộ nhớ ảo (Virtual Memory)
Trang 3Các cấp bộ nhớ (Memory Hierarchy)
Registers
– In CPU
Internal or Main memory
– May include one or more levels of cache
Trang 4Memory Hierarchy - Diagram
Trang 5 Access time
– Time between presenting the address and getting the valid data
Memory Cycle time
– Time may be required for the memory to “recover” before next access – Cycle time is access + recovery
Transfer Rate
– Rate at which data can be moved
Trang 6Khoa KTMT Thiều Xuân Khánh 6
Trang 9 Physical arrangement of bits into words
Not always obvious
e.g interleaved
Trang 10The Bottom Line
Trang 12So you want fast?
It is possible to build a computer which uses only static RAM (see later)
This would be very fast
This would need no cache
– How can you cache cache?
This would cost a very large amount
Trang 13Locality of Reference
During the course of the execution of a program, memory references tend to cluster
e.g loops
Trang 142 Cache
Tổ ch
Trang 15 Small amount of fast memory
Sits between normal main memory and CPU
May be located on CPU chip or module
Trang 16Cache/Main Memory Structure
Trang 17Cache operation – overview
CPU requests contents of memory location
Check cache for this data
If present, get from cache (fast)
If not present, read required block from main memory to cache
Then deliver from cache to CPU
Cache includes tags to identify which block of main memory
is in each cache slot
Trang 18Cache Read Operation - Flowchart
Trang 20Size does matter
Cost
– More cache is expensive
Speed
– More cache is faster (up to a point)
– Checking cache for data takes time
Trang 21Typical Cache Organization
Trang 22Comparison of Cache Sizes
Processor Type Introduction Year of L1 cachea L2 cache L3 cache
IBM SP High-end server/ supercomputer 2000 64 KB/32 KB 8 MB —
Trang 23Mapping Function
Cache of 64kByte
Cache block of 4 bytes
– i.e cache is 16k (2 14 ) lines of 4 bytes
16MBytes main memory
24 bit address
– (2 24 =16M)
Trang 24Direct Mapping
Each block of main memory maps to only one cache line
– i.e if a block is in cache, it must be in one specific place
Address is in two parts
Least Significant w bits identify unique word
Most Significant s bits specify one memory block
The MSBs are split into a cache line field r and a tag of s-r (most significant)
Trang 25Direct Mapping Address Structure
24 bit address
2 bit word identifier (4 byte block)
22 bit block identifier
– 8 bit tag (=22-14)
– 14 bit slot or line
No two blocks in the same line have the same Tag field
Check contents of cache by finding line and checking Tag
Trang 26Direct Mapping Cache Line Table
Cache line Main Memory blocks held
0 0, m, 2m, 3m…2s-m
1 1,m+1, 2m+1…2s-m+1
m-1 m-1, 2m-1,3m-1…2s-1
Trang 27Direct Mapping Cache Organization
Trang 28Direct Mapping
Example
Trang 29Direct Mapping Summary
Address length = (s + w) bits
Number of addressable units = 2s+w words or bytes
Block size = line size = 2w words or bytes
Number of blocks in main memory = 2s+ w/2w = 2s
Number of lines in cache = m = 2r
Size of tag = (s – r) bits
Trang 30Direct Mapping pros & cons
Simple
Inexpensive
Fixed location for given block
– If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high
Trang 31Associative Mapping
A main memory block can load into any line of cache
Memory address is interpreted as tag and word
Tag uniquely identifies block of memory
Every line’s tag is examined for a match
Cache searching gets expensive
Trang 32Fully Associative Cache Organization
Trang 33Associative
Mapping Example
Trang 34Tag 22 bit Word2 bit
Associative Mapping Address Structure
22 bit tag stored with each 32 bit block of data
Compare tag field with tag entry in cache to check for hit
Least significant 2 bits of address identify which 16 bit word
is required from 32 bit data block
e.g
Trang 35Associative Mapping Summary
Address length = (s + w) bits
Number of addressable units = 2s+w words or bytes
Block size = line size = 2w words or bytes
Number of blocks in main memory = 2s+ w/2w = 2s
Number of lines in cache = undetermined
Size of tag = s bits
Trang 36Set Associative Mapping
Cache is divided into a number of sets
Each set contains a number of lines
A given block maps to any line in a given set
– e.g Block B can be in any line of set i
e.g 2 lines per set
– 2 way associative mapping
– A given block can be in one of 2 lines in only one set
Trang 37Set Associative Mapping
Example
13 bit set number
Block number in main memory is modulo 213
000000, 00A000, 00B000, 00C000 … map to same set
Trang 38Two Way Set Associative Cache
Organization
Trang 39Set Associative Mapping
Address Structure
Use set field to determine cache set to look in
Compare tag field to see if we have a hit
Trang 40Two Way
Set
Associative Mapping Example
Trang 41Set Associative Mapping Summary
Address length = (s + w) bits
Number of addressable units = 2s+w words or bytes
Block size = line size = 2w words or bytes
Number of blocks in main memory = 2d
Number of lines in set = k
Number of sets = v = 2d
Number of lines in cache = kv = k * 2d
Size of tag = (s – d) bits
Trang 42Replacement Algorithms (1)
Direct mapping
No choice
Each block only maps to one line
Replace that line
Trang 43Replacement Algorithms (2) Associative & Set Associative
Hardware implemented algorithm (speed)
Least Recently used (LRU)
e.g in 2 way set associative
– Which of the 2 block is lru?
First in first out (FIFO)
– replace block that has been in cache longest
Least frequently used
– replace block which has had fewest hits
Random
Trang 44Write Policy
Must not overwrite a cache block unless main memory is up to date
Multiple CPUs may have individual caches
I/O may address main memory directly
Trang 45Write through
All writes go to main memory as well as cache
Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to date
Lots of traffic
Slows down writes
Remember bogus write through caches!
Trang 46Write back
Updates initially made in cache only
Update bit for cache slot is set when update occurs
If block is to be replaced, write to main memory only if update bit is set
Other caches get out of sync
I/O must access main memory through cache
N.B 15% of memory references are writes
Trang 47Pentium 4 Cache
80386 – no on chip cache
80486 – 8k using 16 byte lines and four way set associative organization
Pentium (all versions) – two on chip L1 caches
– Data & instructions
Pentium III – L3 cache added off chip
Trang 48Intel Cache Evolution
External memory slower than the system bus. Add external cache using faster memory technology. 386
Increased processor speed results in external bus becoming a
bottleneck for cache access.
Move external cache on-chip, operating at the same speed as the processor.
486
Internal cache is rather small, due to limited space on chip Add external L2 cache using faster technology than main memory 486
Contention occurs when both the Instruction Prefetcher and
the Execution Unit simultaneously require access to the
cache In that case, the Prefetcher is stalled while the
Execution Unit’s data access takes place.
Create separate data and instruction
Increased processor speed results in external bus becoming a
bottleneck for L2 cache access.
Create separate back-side bus that runs at higher speed than the main (front-side) external bus The BSB is dedicated to the L2 cache.
Pentium Pro
Move L2 cache on to the processor
Some applications deal with massive databases and must
have rapid access to large amounts of data The on-chip
caches are too small.
Add external L3 cache Pentium III
Move L3 cache on-chip Pentium 4
Trang 49Pentium 4 Block Diagram
Trang 50Pentium 4 Core Processor
– Fetches instructions from L2 cache
– Decode into micro-ops
– Store micro-ops in L1 cache
– Schedules micro-ops
– Based on data dependence and resources
– May speculatively execute
Trang 51Pentium 4 Design Reasoning
Decodes instructions into RISC like micro-ops before L1 cache
Micro-ops fixed length
– Superscalar pipelining and scheduling
Pentium instructions long & complex
Performance improved by separating decoding from scheduling & pipelining
– (More later – ch14)
Data cache is write back
– Can be configured to write through
L1 cache controlled by 2 bits in register
– CD = cache disable
– NW = not write through
– 2 instructions to invalidate (flush) cache and write back then invalidate
L2 and L3 8-way set-associative
– Line size 128 bytes
Trang 52PowerPC Cache Organization
601 – single 32kb 8 way set associative
603 – 16kb (2 x 8kb) two way set associative
Trang 53PowerPC G5 Block Diagram
Trang 55Khoa KTMT Thiều Xuân Khánh 55