Slides kiến trúc máy tính nhóm 7

Sequential consistency Relaxed consistency model Sequential consistency requires the illusion of program order and atomicity to be maintained for all operations Relaxed models typically

Trang 1

Nhóm 7 1.Đỗ Luật Khoa 2.Lương Quang Tùng 3.Trần Thanh Phương 4.Phan Thanh Duy 5.Đặng Thanh Hùng 6.Thái Tiểu Minh 7.Nguyễn Thị Thanh Xuân

1

Trang 2

Multiprocessor system

2

Trang 3

What is multiprocessor system?

 A multiprocessor is a tightly coupled

computer system having two or more processing units ( Multiple Processors ) each

order to simultaneously process programs (theo wikipedia).

 To get high performance and to reduce energy

consumption

3

Trang 4

Flynn Classification ?

 Flynn [1966] proposed a simple model of categorizing all computers

that is still useful today

 He looked at the parallelism in the instruction and data streams

May 20, 1934 (age )

New York City

4

Trang 5

What category can it be in the Flynn

Classification?

 The four classifications defined by Flynn are

based upon the number of concurrent instruction (or control) and data streams available in the architecture:

5

Trang 6

 Single instruction stream, single data stream (SISD)—

This category is the uniprocessor.

 Single instruction stream, multiple data streams (SIMD)—The same instruction is executed by multiple processors using different data streams.

 Multiple instruction streams, single data stream (MISD)—

No commercial multiprocessor of this type has been built

to date

 Multiple instruction streams, multiple data streams (MIMD)—Each processor fetches its own instructions and operates on its own data

6

Trang 7

thread-level parallelism, it is the architecture of choice for general- purpose multiprocessors.

 MIMDs offer flexibility.

 MIMDs can build on the cost-performance

advantages of off-the-shelf processors

7

Trang 8

MIMD Model

 Each processor is executing its own instruction stream

 In many cases, each processor executes a different

process

 A process is a segment of code that may be run

independently contains all the information necessary to execute that program on a processor

program and sharing the code and most of their address space

 MIMD multiprocessor with n processors, we must usually

have at least n threads or processes to execute.

software system  threads consist of hundreds to millions of instructions that may be executed in parallel

8

Trang 9

 centralized shared-memory architectures

 most a few dozen processor chips (and less than 100 cores) in 2006

 For multiprocessors with small processor counts

 By using multiple point-to-point connections, or a switch, and adding additional memory banks

 Because there is a single main memory that has a symmetric relationship to all processors and a uniform access time from any processor  called symmetric (shared-memory)

multiprocessors (SMPs) or uniform memory access (UMA)

9

Trang 10

• This type of symmetric shared-memory architecture is

currently by far the most popular organization

10

Trang 11

MIMD Model

 To support larger processor counts, memory

must be distributed among the processors rather than centralized.

 Distributing the memory among the nodes

has two major benefits :

 First, it is a cost-effective way to scale the memory

 Second, it reduces the latency for accesses to the

local memory

11

Trang 12

• The second group consists of multiprocessors with physically distributed memory.

12

Trang 13

MIMD Model

 The key disadvantages for a

distributedmemory architecture are that:

becomes somewhat more complex

afforded by distributed memories

13

Trang 14

Memory consistency model

14

Trang 15

What is a memory consistency model?

shared-memory multiprocessor speciﬁes how memory behaves with respect to read and write operations

from multiple processors

specify when a written value by one processor can be read by another processor

 System implementation : Hardware , OS , languages , compilers

 Programming correctness

 Performance

15

Trang 16

Sequential Consistency

 Lamport’s deﬁnition: A multiprocessor system

is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some

sequential order, and the operations of each individual processor appear in this sequence

in the order speciﬁed by its program.

 There are two aspects to sequential consistency:

 (1) Maintaining program order among operations from individual processors

 (2) Maintaining a single sequential order among

operations from all processors

16

Trang 17

Sequential Consistency

Conceptual representation of sequential consistency with several processors sharing

a common logical memory

17

Trang 18

Sample programs to illustrate sequential

18

Trang 19

Relaxed Consistency Models: The Basics

 The basic idea behind relaxed memory models is to enable the use of more optimizations by eliminating some of the constraints that sequential consistency places on the overlap and reordering of memory operations.

Sequential consistency Relaxed consistency model

Sequential consistency requires the illusion of program order and atomicity

to be maintained for all operations

Relaxed models typically allow certain memory operations to execute out of program order or non-atomically

Several of the relaxed memory consistency models:

• Relaxing the Write to Read Program Order(total store ordering or

• processor consistency).

• Relaxing the Write to Write Program Order(partial store order).

• Relaxing the Read to Read and Read to Write Program Order(weak ordering, the PowerPC consistency model, and release consistency)

19

Trang 20

Relaxing the Write to Read Program Order

20

SPARC V8 Total Store Ordering (TSO) [SFC91,SUN91]

Trang 21

the outcome (u,v,w,x)=(1,1,0,0)

the outcome (u,v,w,x)=(1,2,0,0)

21

Trang 22

22

Trang 23

Conditions for Processor Consistency (PC)

(c) the outcome (u,v,w)=(1,1,0) (d) the outcome (u,v,w,x)=(1,0,1,0)

Memory sub-operations must execute in a sequential order that satisfies the following conditions:

(a) sub-operations appear in this sequence in the order specified by the program order requirement of Figure ,

(b) the order among sub-operations satisfies the coherence requirement, and

(c) a read sub-operation issued by R(i) returns the value of either the last write sub-operation W(i) to the same

location that appears before the read in this sequence or the last write operation to the location that is before the read in program order, whichever occurs later in the execution sequence

sub-23

Trang 24

Relaxing the Write to Write Program Order

24

Trang 25

Partial store ordering (PSO) model example

PSO allows writes to different locations to complete out

of program order, it also allows the outcomes (u,v)=(0,0)

or (0,1) or (1,0).

25

Trang 26

Relaxing the Read to Read and Read to Write

Program Order

26

By Duboiset al.[DSB86]

Trang 27

Weak ordering model

The two outcomes of (u,v)=(1,0) and (u,v)=(0,1)

27

Trang 28

Relaxing the Read to Read and Read to Write

Program Order

28

[GLL+90, GGH93b]

Trang 29

Release consistency

The result (u,v)=(1,1) If P2 gains access to the critical section before P1, then we can get the result (u,v)=(0,0)

29

Trang 30

Comparing the WO and RC models

30

Trang 31

31

Trang 32

 A thread includes the program counter, the register state,and

the stack It is a lightweight process;whereas threads commonly share a single address space,processes don’t.

 Process:

 A process includes one or more threads,the address

space,and the operating system state.Hence,a process switch usually invokes the operating system ,but not a thread switch.

 Superscalar :

 An advanced pipelining technique that enables the processor

to execute more than one instruction per clock cycle.

32

Trang 33

33

Trang 34

A superscalar with fine-grained multithreading

A superscalar with simultaneous multithreading

The use of issue slots is

limited by a lack of ILP

A major stall, such as

an instruction cache miss, can leave the entire processor idle.

The long stalls are partially hidden by switching to another thread that uses the resources of the processor.

The ILP limitations still lead to idle cycles within each clock cycle Furthermore, since thread switching only occurs when there is a stall and the new thread has a start-

up period, there are likely to be some fully idle cycles remaining

The interleaving of threads eliminates fully empty slots Because only one thread issues instructions in a given clock cycle, however, ILP limitations still lead

Trang 35

Figure: The speed up from using multithreading on one core on an I7 processor averages 1.31 for and the PARSEC benchmarks the energy efficiency

improvement is 1.07.

This data was collected and analyzed by Esmacilzadeh [2011]

35

Trang 36

Cache for Multiprocessors

36

Trang 37

Trang 38

A = 3

Trang 39

A = 3

A  7

Trang 40

A = 3

Trang 41

Cache coherency solution

Trang 42

 Software based vs hardware based

 Software-based:

 Compiler based or with run-time system support

 With or without hardware assist

 Tough problem because perfect information is needed

in the presence of memory aliasing and explicit parallelism

 Focus on hardware based solutions as they are

more common

Trang 43

Hardware Solutions

 The schemes can be classified based on :

 Shared caches vs Snoopy schemes vs Directory schemes

 Write through vs write-back (ownership-based)

protocols

 Update vs invalidation protocols

 Dirty-sharing vs no-dirty-sharing protocols

Trang 44

Snoopy Cache Coherence Schemes

 A distributed cache coherence scheme based on

the notion of a snoop that watches all activity on

a global bus, or is informed about such activity

by some global broadcast mechanism.

 Most commonly used method in commercial

multiprocessors

Trang 45

Write Through Schemes

 All processor writes result in :

 update of local cache and a global bus write that :

 Advantage : Simple to implement

 Disadvantages : Since ~15% of references are

writes, this scheme consumes tremendous bus bandwidth Thus only a few processors can be supported.

=> Need for dual tagging caches in some cases

Trang 46

Write-Back/Ownership Schemes

 When a single cache has ownership of a block,

processor writes do not result in bus writes thus conserving bandwidth.

 Most bus-based multiprocessors nowadays use

such schemes

 Many variants of ownership-based protocols

exist:

 Goodman’s write -once scheme

 Berkley ownership scheme

 Firefly update protocol

Trang 47

Invalidation vs Update Strategies

are invalidated

updated

 single producer and many consumers of data.

 multiple writes by one PE before data is read by another PE.

 Junk data accumulates in large caches (e.g process

migration).

the default

Trang 48

Directory Based Cache Coherence

Key idea : Keep track in a global directory (in main memory) of which processors are caching a location and the state.

Trang 49

Directory-Based Protocol

49

Trang 50

 Every memory block has associated directory

information

 Keeps track of copies of cached blocks and their states

 On miss, find directory entry, look it up, and

communicate only with the nodes that have copies, but only if necessary

 In scalable networks, communication with directory

and copies is through network transactions

Directory Protocol

50

Trang 51

• k processors

• With each cache block in memory:

k presence bits, 1 dirty bit

• With each cache block in cache:

1 valid bit, and 1 dirty (owner) bit

• Read from main memory by processor i:

• If dirty bit OFF then { read from main memory; turn p[i] ON; }

• if dirty bit ON then { recall line from dirty proc (set cache state to shared); update memory; turn dirty bit OFF; turn p[i] ON; supply recalled data to i;}

• Write to main memory by processor i:

• If dirty bit OFF then { supply data to i; send invalidations to all caches that have the block; turn dirty bit ON; turn p[i] ON; }

Directory Protocol on Shared Memory

51

Trang 52

Directory Protocol on Shared Memory

52

Trang 53

 Three states

out-of-date

 In addition to cache state, must track which processors have

data when in the shared state (usually bit vector, 1 if processor has copy)

 Keep it simple(r):

 Writes to non-exclusive data

⇒ Write miss

 Processor blocks until access completes

 Assume messages received and acted upon in order sent

53

Trang 54

 No bus and don’t want to broadcast:

 Interconnect no longer single arbitration point

 All messages have explicit responses

 Terms: typically 3 processors involved

of an address resides

exclusive or shared

 Example messages on next slide:

P = processor number, A = address

54

Trang 55

Directory Protocol Messages

 Processor P reads data at address A;

make P a read sharer and request data

 Processor P has a write miss at address A;

make P the exclusive owner and request data

 Invalidate a shared copy at address A

 Fetch the block at address A and send it to its home directory;

change the state of A in the remote cache to shared

 Fetch the block at address A and send it to its home directory;

invalidate the block in the cache

 Return a data value from the home memory (read miss response)

 Write back a data value for address A (invalidate response)

55

Trang 56

State-Transition Diagram for One Block in Directory-Based

System

 States identical to snoopy case; transactions very similar

 Transitions caused by read misses, write misses,

invalidates, data fetch requests

 Generates read miss & write miss message to home

Định dạng
Số trang	103
Dung lượng	1,99 MB