1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Computer architecture Part V Memory System Design

72 9 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Memory System Design
Tác giả Behrooz Parhami
Trường học University of California, Santa Barbara
Chuyên ngành Computer Architecture
Thể loại presentation
Năm xuất bản 2007
Thành phố Santa Barbara
Định dạng
Số trang 72
Dung lượng 1,14 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

V Memory System DesignTopics in This Part Chapter 17 Main Memory Concepts Chapter 18 Cache Memory Organization Chapter 19 Mass Memory Concepts Chapter 20 Virtual Memory and Paging Design

Trang 1

Part V

Memory System Design

Trang 2

About This Presentation

This presentation is intended to support the use of the textbook

Computer Architecture: From Microprocessors to Supercomputers,

Oxford University Press, 2005, ISBN 0-19-515455-X It is updated regularly by the author as part of his teaching of the upper-division course ECE 154, Introduction to Computer Architecture, at the

University of California, Santa Barbara Instructors can use these slides freely in classroom teaching and for other educational

purposes Any other use is strictly prohibited © Behrooz Parhami

Edition Released Revised Revised Revised Revised First July 2003 July 2004 July 2005 Mar 2006 Mar 2007

Trang 3

V Memory System Design

Topics in This Part

Chapter 17 Main Memory Concepts

Chapter 18 Cache Memory Organization

Chapter 19 Mass Memory Concepts

Chapter 20 Virtual Memory and Paging

Design problem – We want a memory unit that:

• Can keep up with the CPU’s processing speed

• Has enough capacity for programs and data

• Is inexpensive, reliable, and energy-efficient

Trang 4

17 Main Memory Concepts

Technologies & organizations for computer’s main memory

• SRAM (cache), DRAM (main), and flash (nonvolatile)

• Interleaving & pipelining to get around “memory wall”

Topics in This Chapter

17.1 Memory Structure and SRAM17.2 DRAM and Refresh Cycles17.3 Hitting the Memory Wall17.4 Interleaved and Pipelined Memory17.5 Nonvolatile Memory

17.6 The Need for a Memory Hierarchy

Trang 5

17.1 Memory Structure and SRAM

Trang 6

Data

in

Data out, byte 3

Data out, byte 2

Data out, byte 1

Data out, byte 0 MSB

Address

Trang 7

SRAM with Bidirectional Data Bus

Fig 17.3 When data input and output of an SRAM chip

are shared or connected to a bidirectional data bus, output

must be disabled during write operations.

Trang 8

17.2 DRAM and Refresh Cycles

DRAM vs SRAM Memory Cell ComplexityWord line

Capacitor

Bit

line

Pass transistor

Word line

Bit line

Com pl bit line Vcc

Fig 17.4 Single-transistor DRAM cell, which is considerably simpler than SRAM cell, leads to dense, high-capacity DRAM memory chips.

Trang 9

Fig 17.5 Variations in the voltage across a DRAM cell capacitor after

writing a 1 and subsequent refresh operations.

DRAM Refresh Cycles and Refresh Rate

Voltage

for 1

Voltage

for 0

Trang 10

Loss of Bandwidth to Refresh Cycles

Example 17.2

A 256 Mb DRAM chip is organized as a 32M × 8 memory externally and as a 16K × 16K array internally Rows must be refreshed at least once every 50 ms to forestall data loss; refreshing a row takes 100 ns What fraction of the total memory bandwidth is lost to refresh cycles?

Row buffer Row

/

g

Data in Address

Data out Output enable Chip

select

(a) SRAM block diagram (b) SRAM read mechanism

Figure 2.10

16K 16K

Trang 11

WE

24 23 22 21 20 19 18 17 16 15 14 13

A4 A5 A6 A7 A8 A9 D3

Vss

A0 A1 A2 A3 A10

24-pin dual in-line package (DIP)

Trang 12

Work- stations

Servers

Super- computers

Trang 13

17.3 Hitting the Memory Wall

Fig 17.8 Memory density and capacity have grown along with the

CPU power and complexity, but memory speed has not kept pace

Trang 14

Bridging the CPU-Memory Speed Gap

Idea: Retrieve more data from memory with each access

Fig 17.9 Two ways of using a wide-access memory to bridge the speed gap between the processor and memory

Wide-access

memory

Narrow bus

to processor

Mux access

Wide-memory

Wide bus

to processor

Mux

(a) Buffer and mult iplex er

at the memory side

(a) Buffer and mult iplex er

at the processor side

Trang 15

17.4 Pipelined and Interleaved Memory

Address

translation

Row decoding

& read out

Column decoding

& selection

Tag comparison

& validation

Fig 17.10 Pipelined cache memory

Memory latency may involve other supporting operations besides the physical access itself

Virtual-to-physical address translation (Chap 20)

Tag comparison to determine cache hit/miss (Chap 18)

Trang 16

Addresses that are 2 mod 4

Addresses that are 1 mod 4

Addresses that are 3 mod 4

Return data Data

in

Data out

Trang 17

17.5 Nonvolatile Memory

ROM

PROM

EPROM

Fig 17.12 Read-only memory organization, with the

fixed contents shown on the right

B i t l i n e s

Word lines

Trang 18

Flash Memory

Fig 17.13 EEPROM or Flash memory organization

Each memory cell is built of a floating-gate MOS transistor

S o u r c e l i n e s

B i t l i n e s

Word lines

n+

n−

p subs- trate

Control gate Floating gate Source

Drain

Trang 19

17.6 The Need for a Memory Hierarchy

The widening speed gap between CPU and main memory

Processor operations take of the order of 1 ns

Memory access requires 10s or even 100s of ns

Memory bandwidth limits the instruction execution rate

Each instruction executed involves at least one memory accessHence, a few to 100s of MIPS is the best that can be achieved

A fast buffer memory can help bridge the CPU-memory gap

The fastest memories are expensive and thus not very large

A second (third?) intermediate cache level is thus often used

Trang 20

Typical Levels in a Hierarchical Memory

Fig 17.14 Names and key characteristics of levels in a memory hierarchy

Trang 21

18 Cache Memory Organization

Processor speed is improving at a faster rate than memory’s

• Processor-memory speed gap has been widening

• Cache is to main as desk drawer is to file cabinet

Topics in This Chapter

18.1 The Need for a Cache18.2 What Makes a Cache Work?

18.3 Direct-Mapped Cache18.4 Set-Associative Cache18.5 Cache and Main Memory18.6 Improving Cache Performance

Trang 22

18.1 The Need for a Cache

/

ALU Data

cache Instr

cache

Next addr

Reg file

op jta

in

0

1

ALUSrc ALUFunc DataWrite

32 /

16

Register input

Data out Func

ALUOvfl Ovfl

fn

(rs) (rt) Address

31

PCSrc PCWrite IRWrite

×4

rt

ALUZero Zero

30

SE imm

Multicycle

ALU cache Data

Instr cache

Next addr

Reg file

500 MHz CPI ≅ 1.1

All three of our

Trang 23

Cache, Hit/Miss Rate, and Effective Access Time

One level of cache with hit rate h

Ceff = hCfast + (1 – h)(Cslow + Cfast) = Cfast + (1 – h)Cslow

(fast)memory

Main(slow)memory

Reg file

Word

Line

Data is in the cache

fraction h of the time

(say, hit rate of 98%)

Go to main 1 – h of the time

(say, cache miss rate of 2%)

Cache is transparent to user;

transfers occur automatically

Trang 24

Multiple Cache Levels

Fig 18.1 Cache memories act as intermediaries between

the superfast processor and the much slower main memory

Level-2 cache

Main memory

Main memory

registers

Level-1 cache

Cleaner and easier to analyze

Trang 25

Performance of a Two-Level Cache System

Example 18.1

A system with L1 and L2 caches has a CPI of 1.2 with no cache miss There are 1.1 memory accesses on average per instruction

What is the effective CPI with cache misses factored in?

What are the effective hit rate and miss penalty overall if L1 and L2 caches are modeled as a single cache?

Level Local hit rate Miss penalty

L1 95 % 8 cycles

L2 80 % 60 cycles

Level-2 cache

Main memo ry

CPU CPU

registers

Level-1 cache

Ceff = Cfast + (1 – h1)[Cmedium + (1 – h2)Cslow]

Because Cfast is included in the CPI of 1.2, we must account for the restCPI = 1.2 + 1.1(1 – 0.95)[8 + (1 – 0.8)60] = 1.2 + 1.1× 0.05 × 20 = 2.3Overall: hit rate 99% (95% + 80% of 5%), miss penalty 60 cycles

Trang 26

Cache Memory Design Parameters

Cache size (in bytes or words) A larger cache can hold more of the

program’s useful data but is more costly and likely to be slower

Block or cache-line size (unit of data transfer between cache and main) With a larger cache line, more data is brought in cache with each miss This can improve the hit rate but also may bring low-utility data in

Placement policy. Determining where an incoming cache line is stored More flexible policies imply higher hardware cost and may or may not have performance benefits (due to more complex data location)

Replacement policy. Determining which of several existing cache blocks (into which a new cache line can be mapped) should be overwritten

Typical policies: choosing a random or the least recently used block

Write policy. Determining if updates to cache words are immediately

forwarded to main (write-through) or modified blocks are copied back to

Trang 27

18.2 What Makes a Cache Work?

Fig 18.2 Assuming no conflict in

address mapping, the cache will

hold a small program loop in its

9-instruction program loop

Address mapping (many-to-one)

Cache memory

Main memory

Cache line/ block (unit of t rans fer between m ain and cache memories)

Temporal locality

Spatial locality

Trang 28

Desktop, Drawer, and File Cabinet Analogy

Fig 18.3 Items on a desktop (register) or in a drawer (cache) are

more readily accessible than those in a file cabinet (main memory)

Main

memory

Register file

Access cabinet

in 30 s

Access desktop in 2 s

Access drawer

in 5 s

Cache memory

Once the “working set” is in the drawer, very few trips to the file cabinet are needed.

Trang 29

Temporal and Spatial Localities

Addresses

Time

From Peter Denning’s CACM paper,

July 2005 (Vol 48, No 7, pp 19-24)

Trang 30

Caching Benefits Related to Amdahl’s Law

Example 18.2

In the drawer & file cabinet analogy, assume a hit rate h in the drawer

Formulate the situation shown in Fig 18.2 in terms of Amdahl’s law

Solution

Without the drawer, a document is accessed in 30 s So, fetching 1000

documents, say, would take 30 000 s The drawer causes a fraction h

of the cases to be done 6 times as fast, with access time unchanged for

the remaining 1 – h Speedup is thus 1/(1 – h + h/6) = 6 / (6 – 5h)

Improving the drawer access time can increase the speedup factor but

as long as the miss rate remains at 1 – h, the speedup can never

exceed 1 / (1 – h) Given h = 0.9, for instance, the speedup is 4, with

the upper bound being 10 for an extremely short drawer access time.Note: Some would place everything on their desktop, thinking that this yields even greater speedup This strategy is not recommended!

Trang 31

Compulsory, Capacity, and Conflict Misses

Compulsory misses: With on-demand fetching, first access to any item

is a miss Some “compulsory” misses can be avoided by prefetching

Capacity misses: We have to oust some items to make room for others This leads to misses that are not incurred with an infinitely large cache

Conflict misses: Occasionally, there is free room, or space occupied by useless data, but the mapping/placement scheme forces us to displace useful items to bring in other items This may lead to misses in future

Given a fixed-size cache, dictated, e.g., by cost factors or availability of space on the processor chip, compulsory and capacity misses are

pretty much fixed Conflict misses, on the other hand, are influenced by the data mapping scheme which is under our control

We study two popular mapping schemes: direct and set-associative

Trang 32

18.3 Direct-Mapped Cache

Fig 18.4 Direct-mapped cache holding 32 words within eight 4-word lines

3-bit line index in cache

2-bit word offs et in line

Main mem ory locations

0-3 4-7 8-11

36-39

32-35 40-43

68-71

64-67 72-75

100-103

96-99 104-107

Tag

Word address

Valid bits

Tags Read tag and

specified word

pare 1,Tag

Com-Data out

Cache m iss

1 if equal

Trang 33

Accessing a Direct-Mapped Cache

Example 18.4

Fig 18.5 Components of the 32-bit address in an example

12-bit line index in cache

4-bit byte offset in line

Show cache addressing for a byte-addressable memory with 32-bit

addresses Cache line W = 16 B Cache size L = 4096 lines (64 KB).

Solution

Byte offset in line is log216 = 4 b Cache line index is log24096 = 12 b.This leaves 32 – 12 – 4 = 16 b for the tag

Byte address in cache

16-bit line tag 32-bit

address

Trang 34

3-bit line index in cache 2-bit word offs et in line

Tag

Word address

Valid bits

s pecified word

pare 1,Tag

Com-Dat Cac

1 if equal

Direct-Mapped

Cache Behavior

Fig 18.4

1: miss, line 3, 2, 1, 0 fetched

7: miss, line 7, 6, 5, 4 fetched

Trang 35

18.4 Set-Associative Cache

Fig 18.6 Two-way set-associative cache holding 32 words of

data within 4-word lines and 2-line sets.

Main memory locations

2-bit set index in cache

2-bit word offset in line

Cache miss

1 if equal

Trang 36

Accessing a Set-Associative Cache

11-bit set index in cache

4-bit byte offset in line

Address in cache used to read out two candidate

17-bit line tag 32-bit

address

Trang 37

Cache Address Mapping

Example 18.6

A 64 KB four-way set-associative cache is byte-addressable and

contains 32 B lines Memory addresses are 32 b wide

a How wide are the tags in this cache?

b Which main memory addresses are mapped to set number 5?

Solution

a Address (32 b) = 5 b byte offset + 9 b set index + 18 b tag

b Addresses that have their 9-bit set index equal to 5 These are of

the general form 214a + 25×5 + b; e.g., 160-191, 16 554-16 575,

Tag Set index Offset

32-bit address

Line width =

Tag width =

Trang 38

18.5 Cache and Main Memory

The writing problem:

Write-through slows down the cache to allow main to catch up

Write-back or copy-back is less problematic, but still hurts

performance due to two main memory accesses in some cases

Solution: Provide write buffers for the cache so that it does not have to wait for main memory to catch up

Harvard architecture: separate instruction and data memories

von Neumann architecture: one memory for instructions and data

Split cache: separate instruction and data caches (L1) Unified cache: holds instructions and data (L1, L2, L3)

Trang 39

Faster Main-Cache Data Transfers

Fig 18.8 A 256 Mb DRAM chip organized as a 32M × 8 memory module: four such chips could form a 128 MB main memory unit

memory matrix Selected row

Column mux

Row address decoder

Trang 40

18.6 Improving Cache Performance

For a given cache size, the following design issues and tradeoffs exist:

Line width (2W ) Too small a value for W causes a lot of main memory

accesses; too large a value increases the miss penalty and may tie up cache space with low-utility items that are replaced before being used

Set size or associativity (2S ) Direct mapping (S = 0) is simple and fast;

greater associativity leads to more complexity, and thus slower access, but tends to reduce conflict misses More on this later

Line replacement policy. Usually LRU (least recently used) algorithm or some approximation thereof; not an issue for direct-mapped caches Somewhat surprisingly, random selection works quite well in practice

Write policy Modern caches are very fast, so that write-through is

seldom a good choice We usually implement write-back or copy-back,

using write buffers to soften the impact of main memory latency

Trang 41

Effect of Associativity on Cache Performance

Fig 18.9 Performance improvement of caches with increased associativity

2-way 8-way 32-way

Trang 42

19 Mass Memory Concepts

Today’s main memory is huge, but still inadequate for all needs

• Magnetic disks provide extended and back-up storage

• Optical disks & disk arrays are other mass storage options

Topics in This Chapter

19.1 Disk Memory Basics19.2 Organizing Data on Disk19.3 Disk Performance

19.4 Disk Caching19.5 Disk Arrays and RAID19.6 Other Types of Mass Memory

Ngày đăng: 11/10/2021, 14:39