Slides kiến trúc máy tính nhóm 8 multiprocessor

Flynn’s TaxonomyMIMD: Multiple I, Multiple D Streams  Each processor executes its own instructions and operates on its own data  Includes multi-core processors  Use for: General purp

Trang 1

MultiProcessor

Nhóm 8:

Nguyễn phúc Ánh – 13070221

Lê Minh Nam – 13070249

Nguyễn Hữu Hiếu – 12073119

Mai Văn Tinh – 13070270

Lê Thanh Phương – 13070254

Lý Đoàn Duy Khánh – 13070238

Trang 3

Introduce MultiProcessor

1

Trang 4

Introduce MultiProcessor System

A multiprocessor is a tightly coupled computer system having two or more processing units (Multiple

Processors) each sharing main memory and

peripherals, in order to simultaneously process

programs complete system.

Trang 5

Why do we need multiprocessors ?

 Need to improve system performance

 Uniprocessor speed keeps improving but will be limited.

 Growth in data-intensive applications: database, file server

 Improved understanding in how to use multiprocessors

effectively.

=> Solution

 Improve performance by connecting multiple microprocessors together.

Trang 6

Trang 7

Flynn’s Taxonomy

Flynn’s Taxonomy of Parallel Machines classified into four categories based on

 How many Instruction streams?

 How many Data streams?

 Two possible states: Single or Multiple

 Four category of Flynn classification:

• SISD

• SIMD

• MISD

• MIMD

Trang 8

SISD: Single I Stream, Single D Stream

 A uniprocessor

 Single instruction: only one instruction stream is being

acted on by the CPU during any one clock cycle

 Single data: only one data stream is being used as input during any one clock cycle

 Instructions are executed sequentially

 IBM 701, IBM 1620, IBM 7090

Trang 9

SIMD: Single I, Multiple D Streams

 The same instruction is executed by multiple

Trang 10

MISD: Multiple I, Single D Stream

 Not used much, use for special purpose computations

 multiple cryptography algorithms attempting to crack a

single coded message

Trang 11

MIMD: Multiple I, Multiple D Streams

 Each processor executes its own instructions and

operates on its own data

 Includes multi-core processors

 Use for: General purpose parallel computers

 IBM 370/168 MP; Univac 1100/80

Trang 12

Synchronization

2

Trang 13

Typical use of a lock:

while (!acquire (lock)) /*spin*/

/* some computation on shared data (critical section)

*/

release (lock)

Acquire based on primitive: Read-Modify-Write

 Basic principle: “Atomic exchange”

 Test-and-set

 Fetch-and-increment

Trang 14

Issues for Synchronization:

memory (atomic operation)

primitive

bottleneck; techniques to reduce contention and latency of synchronization

Trang 15

Uninterruptable Instruction to Fetch and Update Memory

Atomic exchange: interchange a value in a

register for a value in memory

 0 => synchronization variable is free

 1 => synchronization variable is locked and unavailable

o Set register to 1 & swap

o New value in register determines success in getting lock

 0 if you succeeded in setting the lock (you were first)

 1 if other processor had already claimed access

o Key is that exchange operation is indivisible

o Release the lock simply by writing a 0

o Note that every execution requires a read and a

write

Trang 16

 Test-and-set: tests a value and sets it if the value passes the test

memory location and atomically increments it

0 => synchronization variable is free

Uninterruptable Instruction to Fetch and Update Memory

Trang 17

Load linked & store conditional

Hard to have read & write in 1 instruction

(needed for atomic exchange and others)

operations

– Makes coherence more difficult, since hardware cannot allow any operations between the read and write, and yet

must not deadlock

So, use 2 instructions instead.

Load linked (or load locked) + store conditional

Trang 18

 LL r,x loads the value of x into register r, and saves

the address x into a link register.

 SC r,x stores r into address x only if it is the first

store (after LL r,x) The success is reported by

returning a value (r=1) Otherwise, the store fails, and (r=0) is returned.

Trang 19

BEQZ R3,try ; branch if store fails (R3 = 0)

MOV R4,R2 ; put load value in R4

Trang 20

 Example doing fetch & increment with LL & SC:

Trang 21

Spin Locks

 Processor continuously tries to acquire,

spinning around a loop trying to get the lock

li R2, #1

lockit:

exch R2,0(R1) ; atomic exchange

bnez R2,lockit ;already locked?

Trang 22

 All processes have to wait at a

synchronization point

End of parallel do loops

 Processes don’t progress until they all reach the barrier

 Phase (i+1) does not begin until every

process completes phase i.

Trang 23

 Low-performance implementation: use a counter initialized with the number of

processes

counter (atomically fetch-and-add (-1)) and busy

waits

progress (broadcast)

Trang 24

Synchronization Mechanisms for Larger-Scale

Have a race condition for acquiring a lock that has just been released.

 All waiting processors will suffer read and write miss

 O(n2) bus transactions for n contending processes.

Potential improvements

 Exponential backoff

 Queuing Locks (software or hardware)

Trang 26

Queuing Locks

Basic idea: a queue of waiting processors

is maintained in shared-memory for each lock (best for bus-based machines)

o Each processor performs an atomic operation to obtain a

memory location (element of an array) on which to spin

o Upon a release, the lock can be directly handed off to the next waiting processor

Trang 28

 Eager release consistency

 Lazy release consistency

 Entry consistency

Trang 29

What?

Trang 30

Memory consistency

different memory locations

the compiler, and the programmer

• Hardware and compiler will not violate the ordering specified

• The programmer will not assume a stricter order than that of

the model

mechanisms so the user can enforce a stricter

order than that provided by the model

Trang 31

Relaxed Consistency Models

 The key idea in relaxed consistency models

is to allow reads and writes to complete out

of order, but to use synchronization operations to enforce ordering, so that a synchronized program behaves as if the processor were sequentially consistent.

 There are a variety of relaxed models that are classified according to what read and write orderings they relax.

Trang 32

 Relaxing the W R ordering yields a model →

known as total store ordering or processor consistency

 Relaxing the W W ordering yields a model →

known as partial store order.

 Relaxing the R W and R R orderings yields → →

a variety of models including weak ordering, the PowerPC consistency model, and release consistency, depending on the details of the ordering restrictions and how

synchronization operations enforce

ordering.

Trang 33

Sequential Consistency

 Sequential Consistency (Lamport*) “A

multiprocessor is sequentially consistent if

the result of any execution is the same as if

the operations of all the processors were

executed in some sequential order, and the

operations of each individual processor occur

in this sequence in the order specified by its

Trang 34

 Sequential consistency says the machine

behaves as if processors take turns in an

arbitrary order

 The program behaves as if the threads take turns executing instructions (not in any fair order), i.e., only one thread executes an

instruction at a given time.

Trang 35

Trang 36

 Every process issues memory operation in program order.

waits for the write to complete before issuing its next operation ( w  issuing next)

waits for the read to complete, and for the write whose value is being returned by read to complete, before issuing its next operation That is, if the write whose value is being returned has performed with respect to this processor ( as it must have if its value is being returned) then the processor should wait until the write has preformed with respect to all processors.

Trang 37

Processor Consistency

 Before a read is allowed to perform with respect to any other processor, all previous read must be performed ( R  R )

 Before a write is allowed to performed with respect to other processor all previous accesses (reads and writes) must be performed (W  R,W)

Trang 38

Example

Trang 39

 Release Consistency

 Eager release consistency

 Lazy release consistency

 Entry consistency

Trang 40

Weak Consistency

 Ordinary shares accesses and synchronization accesses

 Conditions for weak consistency:

 Before an ordinary read/write access is allowed to perform with

respect to any other processor, all previous synchronization accesses must be performed

 Before a synchronization access is allowed to performed with respect to any other processor, all previous ordinary read/write accesses must be performed

 Synchronization accesses are sequentially consistent

Trang 41

Example

Trang 42

Release Consistency

A problem with weak consistency is that when

a synchronization variable is accessed, the data store does not know whether it is done because the process is finished writing shared data or is about to start reading data Release consistency provides this knowledge by differentiating between entering and leaving a critical region These are provided by:

 acquire, which tells the data store that a critical region is

about to be entered

 release, which tells the data store that a critical region has

just been exited

Trang 43

Categorization of shared memory accesses

Trang 44

Release Consistency: Properly-labeled Programs

Trang 45

The rules of release consistency are as follows:

performed, all previous acquires done by the process must have completed successfully.

reads and writes done by the process must have been completed.

to one another

Trang 46

Example

Trang 47

There are two forms of release consistency

 The one is eager release consistency , in

which all updates are propagated

immediately when a process executes a

release.

 The other form is lazy release consistency ,

in which updates are not propagated on

performing a release

Trang 48

Trang 49

Eager versus Lazy

Trang 50

Entry Consistency

Formally, a memory exhibits entry consistency if it

meets all the following conditions :

not allowed to perform with respect to a process

until all updates to the guarded shared data have

been performed with respect to that process.

synchronization variable by a process is allowed to perform with respect to that process, no other

process may hold the synchronization variable, not even in non-exclusive mode.

synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until

it has performed with respect to that variable's

owner.

Trang 51

Entry Consistency: Example

Trang 53

Definition and Characteristics of Superscalar

 Superscalar processing is the ability to initiate

multiple instructions during the same clock cycle.

 A typical Superscalar processor fetches and

decodes the incoming instruction stream several instructions at a time.

 Superscalar architecture exploit the potential of

ILP(Instruction Level Parallelism).

53

Trang 54

Definition and Characteristics of Superscalar

54

Trang 55

Uninterrupted stream of instructions

 The outcomes of conditional branch instructions are

usually predicted in advance to ensure uninterrupted stream of instructions

 Instructions are initiated for execution in parallel

based on the availability of operand data, rather than their original program sequence This is referred to

as dynamic instruction scheduling.

 Upon completion instruction results are

resequenced in the original order.

55

Trang 56

Superscalar Execution Example

Trang 57

Complicated Example

Optimizing the Complicated Example

Trang 58

Superscalar Execution Example

- With Register Renaming for WAR

and WAW dependencies.

Trang 59

COMPARISON BETWEEN PIPELINING & SUPERSCALAR

divides an instruction into steps, and

since each step is executed in a

different part of the processor, multiple

instructions can be in different

"phases" each clock.

involves the processor being able to issue multiple instructions in a

single clock with redundant facilities

to execute an instruction within a single core

once one instruction was done

decoding and went on towards the

next execution subunit

multiple execution subunits able to

do the same thing in parallel

Sequencing unrelated activities such

that they use different components at

the same time

Multiple sub-components capable of doing the same task simultaneously, but with the processor deciding how

to do it.

Trang 60

Limitation superscalar

Available performance improvement from

superscalar techniques is limited by two key areas:

 The degree of intrinsic parallelism in the

instruction stream, i.e limited amount of

instruction-level parallelism, and

 The complexity and time cost of the dispatcher and associated dependency checking logic

Trang 61

Cache Coherence

5

6

Trang 64

Cache Coherence

• PEA adds 1 to x x is in

PEA's cache, so there's a

cache hit

• If PEB reads x again

(perhaps after synchronising

with PEA), it will also see a

cache hit However it will

read a stale value of x

Trang 65

Cache coherence hardware

This problem is avoided by adding snooping

hardware to the system interface This hardware monitors the bus for transactions which affect

locations cached in this processor.

• The cache also needs to generate invalidate transactions

when it writes to shared locations.

Trang 66

cache generates

an invalidate transaction.

hardware sees the

invalidate x transaction,

it finds a copy of x in its

cache and marks it

invalid.

cause a cache miss and

initiate a databus

transaction to read x

from main memory

Trang 67

When PEA's snooping

hardware sees the

memory read for x, it

detects the modified

a retry response,

transaction.

Trang 68

 PEA now writes (flushes)

the modified cache line

to main memory

 PEB continues its

suspended transaction

and reads the correct

value from main

memory

Trang 69

Directory-Based Protocol

6

Trang 70

7 0

Scalable Approach: Directories

 Every memory block has associated

directory information

communicate only with the nodes that have copies, but only if necessary

copies is through network transactions

 Many alternatives for organizing directory

information

Trang 71

7 1

Basic Operation of Directory

• k processors

• With each cache block in memory:

k presence bits, 1 dirty bit

• With each cache block in cache:

1 valid bit, and 1 dirty (owner) bit

• Read from main memory by processor i:

• If dirty bit OFF then { read from main memory; turn p[i] ON; }

• if dirty bit ON then { recall line from dirty proc (set cache state

to shared); update memory; turn dirty bit OFF; turn p[i] ON;

supply recalled data to i;}

• Write to main memory by processor i:

• If dirty bit OFF then { supply data to i; send invalidations to all

caches that have the block; turn dirty bit ON; turn p[i] ON;

Trang 72

7 2

Directory Protocol

 Three states

 Shared : ≥ 1 processors have data, memory up-to-date

 Uncached (no processor has it; not valid in any cache)

 Exclusive : 1 processor (owner) has data; memory out-of-date

 In addition to cache state, must track which processors have data when in the shared

state (usually bit vector, 1 if processor has

copy)

 Keep it simple(r):

 Writes to non-exclusive data

⇒ Write miss

 Processor blocks until access completes

 Assume messages received and acted upon in order sent

Định dạng
Số trang	108
Dung lượng	1,29 MB