Advanced Computer Architecture - Lecture 36: Multiprocessors

Advanced Computer Architecture - Lecture 36: Multiprocessors. This lecture will cover the following: cache coherence problem; example of invalidation scheme; coherence in distributed memory architecture; performance of cache coherence schemes; implementation complications; snooping cache contention; directory based protocoldistributed shared memory;...

Trang 1

CS 704

Advanced Computer Architecture

Lecture 36

Multiprocessors

(Cache Coherence Problem … Cont’d )

Prof Dr M Ashraf Chughtai

Trang 2

Today’s Topics

Recap:

Example of Invalidation Scheme

Coherence in Distributed Memory

Architecture

Performance of Cache Coherence

Schemes

Summary

Trang 3

Recap: Cache Coherence Problem

Last time

caches for multi-processing in the

symmetric shared-memory architecture,

wherein each processor has the same

relationship to the single memory

data and shared data , i.e.,

 the data used by a single processor and

 the data replicated in the caches of the multiple

processors for their simultaneous use

MAC/VU-Advanced

Computer Architecture Lec 36 Multiprocessor (3) 3

Trang 4

Recap: Cache Coherence Problem

problem in symmetric shared memory

conflict in caching of shared data, being read by the multiple processors

simultaneously

with the help of a typical shared memory architecture where each of the processor

Trang 5

Recap: Cache Coherency Problem

In write-back caches, values written back to memory depend on which cache flushes or writes back the value and when?

We noticed that the cache coherency

problem exists even on uniprocessors due interaction between caches and I/O devices

However, in multiprocessors the problem is performance-critical where the order among multiple processes is crucial, i.e.,

MAC/VU-Advanced

Trang 6

Recap: Order among multiple processes

For single shared memory, with no caches,

a serial or total order is imposed on

operations to the location; and for

single shared memory, with caches, the

serial order be consistent, i.e., all

processors must see writes to the location

in the same order

Considering this we can say that in a

Trang 7

Recap: Order among multiple processes

– the operations issued by any particular

process occur in the order issued by that process, and

– the value returned by a read is the value

written by the last write to that location in the serial order

Then we talked about write propagation and write serialization as the two

features of the coherent system

MAC/VU-Advanced

Trang 8

Recap: Multiprocessor cache Coherence

We also noticed that to implement cache

coherence the multiprocessors extend both the bus transaction and state transition

The cache controller snoops on bus events (write transactions) and invalidate / update cache

Then we discussed the cache coherence

protocols, which use different techniques to track the sharing status and maintain

Trang 9

Recap: Coherency Solutions

The two fundamental classes of Coherence protocols are:

– Snooping Protocols

All cache controllers monitor or snoop (spy) on the bus to determine whether or not they have a copy of the block that is requested on the bus

– Directory-Based Protocols

The sharing status of a block of physical

memory is kept in one location, called directory

MAC/VU-Advanced

Trang 10

Recap: Basic Snooping Protocols The snooping protocols are implemented using two techniques: write invalidate and write broadcast

The Write Invalidate method ensures that processor has exclusive access to the data item before it write that item and all other cached copies are invalidated or canceled

on write

The write broadcast approach, on the other hand, updates all the cached copies of a

Trang 11

Recap: Write Invalidate versus Broadcast

We noticed that

– Invalidate requires one transaction for

multiple writes to the same word; and it

uses spatial locality, i.e., one transaction for write to different words in the same

block; and

– Broadcast has lower latency between write

and read

Then we discussed the finite state machine

controller implementing the snooping protocols

MAC/VU-Advanced

Trang 12

Recap: An Example Snooping Protocol

This controller responds to the request from the processor and from the bus based on:

– the type of the request

– Its hit or miss status in the cache; and

– State of the cache block specified in the request

Furthermore, each block of memory is in

one of the three states: Shared, Exclusive or Invalid (Not in any caches) and each cache

Trang 13

Example: Working of Finite State Machine Controller

Today we will continue our discussion on the finite state machine controller for the implementation of snooping protocol;

and will try to understand its working with the help of example

Here, we assume that two processors P1 and P2 each having its own cache, share the main memory connected on bus

MAC/VU-Advanced

Trang 14

The status of the processors, bus

transaction and the memory is depicted in a table for each step of the state machine

Here, the state of the machine for each

processor and cache address and value

cached, the bus action and shared-memory status is shown for each step of operation

Initially the cache state is invalid (i.e., the

block of memory is not in the cache); and …

Trang 15

memory blocks A1 and A2 map to the same cache block where the address A1 is not

equal to A2

At Step 1 – P1 writes 10 to A1

write miss on bus occurs and the state

transition from invalid to exclusive takes

place

MAC/VU-Advanced

Trang 16

Trang 17

Trang 18

At Step 3: P2 reads A1

i) As P2 is initially in invalid state, therefore, read miss on the bus occurs; the controller state changes from invalid to Shared

Trang 19

MAC/VU-Advanced

Trang 20

ii) P1 being in Exclusive state , remote read

write-back is asserted and the state

iii) the value (10) is read 1 from the

shared-memory at address A1, into P1 and P2 caches at A1; and both P1 and P2

controllers are in shared state

Trang 21

At Step 4: P2 write 20 to A2

i) P1 find a remote write, so the state of the

controller changes from shared to Invalid

ii) P2 find a CPU write, so places write miss on the bus and changes the state from shared

to exclusive and writes value 20 to A1

iii) The memory address to A1 with value A1

MAC/VU-Advanced

Trang 22

Trang 23

Trang 24

Trang 25

Implementation Complications

With this example, we have observed that the finite state machine implementation of the snooping protocols works well

However, the following implementation

complications have been observed

Trang 26

Write Races occur when one processor

wants to update the cache but another

processor may get bus first and then write the same cache block!

We know that bus transaction is a two step process:

Arbitrate for bus

Place miss on bus and complete operation

If miss occurs to block while waiting for

Trang 27

Furthermore, to overcome the write races, split transaction bus, so that

it can have multiple outstanding

transactions for a block

Multiple misses can interleave, allowing two caches to grab block in the

Trang 28

Snooping Cache Conflict

In snooping cache method, the CPU assess the cache and the bus transaction checks the cache tags

Processors continuously snoop on address bus and if the address matches tag, it either invalidate or update

Since every bus transaction checks cache tags; therefore there could be interference

Trang 29

Snooping Cache Contention

There are two ways to reduce the

interference; the methods are:

1: duplicate set of tags for L1 caches

– CPU uses a different set of tags

– The CPU gets stalled during cache access

when snoop has detected a copy in the

cache and tags need to be updated

MAC/VU-Advanced

Trang 30

2: Multi-level caches with inclusion:

i.e., L2 cache already duplicate, provided L2 obeys inclusion with L1 cache; here

– Content of primary cache (L1) is in

secondary cache (L2)

– Most CPU activity directed to L1

– Snoop activity directed to L2

Trang 31

– If snoop gets a hit then it arbitrates L1 to

update and possibly get data; this will

stall CPU

– Can be combined with “duplicate tags”

approach to further reduce contention

MAC/VU-Advanced

Trang 32

Snooping Cache Variations

Trang 33

Snooping Cache Variations

Bus serializes writes, getting bus ensures

no one else can perform memory operation

On a miss in a write back cache, may have the desired copy and its dirty, so must reply

Add extra state bit to cache to determine

Trang 34

Four State Machine

Bus serializes writes, getting bus ensures

no one else can perform memory operation

On a miss in a write back cache, may have the desired copy and its dirty, so must reply

Add extra state bit to cache to determine

shared or not

Add 4th state Modify that Modifies for

exclusive writes

Trang 35

The main idea is to allow cache to cache

transfers on the shared bus

It adds the notion of “owner”

the cache that has the block in a Dirty state is the owner of that block:

The last one who writes, is the owner

The owner responsible to transfer data if read

occurs and to update main memory; If a block is not owned by any cache, memory is the owner

MAC/VU-Advanced

Snooping Cache Variations:

Berkeley Protocol

Trang 36

Summary Snooping Cache

Variations: Summary

Berkeley Protocol

Owned Exclusive Owned Shared

Shared Invalid

Trang 37

Summary Snooping Cache

• If read sourced from memory, then Private Clean

• if read sourced from other cache, then Shared

• Can write in cache if held private clean or dirty

Trang 38

Snoop Cache Extensions

Remote Read

Place Data

on Bus?

Remote Write

or Miss due to

address conflict

Write back block

Remote Write or Miss due to address conflict

CPU Write

Place Write Miss on Bus?

CPU read hit

CPU write hit

Exclusive (read/only)

Remote Read Write back block

A B C

Trang 39

Snoop Cache Extensions

Extensions:

A: Berkeley Protocol

– Fourth State: Ownership

– Shared-> Modified, need invalidate only (upgrade

request), don’t read memory

B: MESI Protocol

– Clean exclusive state (no miss for private data on

write)

C: Illinois Protocol

– Cache supplies data when shared state

(no memory access)

MAC/VU-Advanced

Trang 40

Larger Microprocessors

Use separate Memory per Processor

Local or Remote access via memory

Trang 41

Larger Microprocessors

The use of information per memory block

vs per cache block has some plus and

minus points

– PLUS: In memory => simpler protocol as

compared to centralized/one location

– MINUS: In memory => directory is

function of memory size) as compared to simple protocol where director is function

of cache size

MAC/VU-Advanced

Trang 42

Directory Based Protocol

Distributed Shared Memory

Trang 43

The director base protocol is similar to

Snoopy Protocol:

The Three states of the protocol are:

– Shared:  1 processors have data, memory

Trang 44

In addition to cache state, must track which processors have data when in the shared

state (usually bit vector, 1 if processor has copy)

Keep it simple(r):

– Writes to non-exclusive data => write miss

– Processor blocks until access completes

– Assume messages received and acted upon in

Trang 45

Directory Protocol … Cont’d

No bus and don’t want to broadcast:

– interconnect no longer single arbitration point

– all messages have explicit responses

Typically 3 processors involved

– Local node where a request originates

– Home node where the memory location

of an address resides

– Remote node has a copy of a cache block,

whether exclusive or shared

MAC/VU-Advanced

Trang 46

Directory Protocol … Cont’d

Example messages are as follows: Here P is used for processor number, A for address

Message type Source Destination Msg Content

Read miss Local cache Home directory P, A

Processor P reads data at address A; make P a read sharer and arrange to send data back

Write miss Local cache Home directory P, A

Processor P writes data at address A; make P the

Trang 47

Directory Protocol Messages

home directory

home directory; invalidate the block in the cache

MAC/VU-Advanced

Trang 48

Directory Protocol Messages

(read miss response)

(invalidate response)

Trang 49

State Transition Diagram for an

Individual Cache Block in a

Directory Based System

States identical to snoopy case;

transactions very similar.

Transactions are caused by read misses,

write misses, invalidates, data fetch

Trang 50

State Transition Diagram for an

Individual Cache Block in a

Directory Based System

Write misses that were broadcast on the

bus for snooping results in explicit

invalidate & data fetch requests.

Note: on a write, a cache block is bigger, so need to read the full cache block

Trang 51

CPU -Cache State Machine

Invalid (read/only Shared

)

Exclusive (read/writ e)

CPU Read

CPU Read hit

Send Read Miss message

to home directory

CPU read hit

CPU write hit

Trang 52

State Transition Diagram for the Directory

Here, the same states & structure is shown

as the transition diagram for an individual cache

Two actions performed are:

1: update of directory state and

2: send messages to satisfy requests

The controller tracks all copies of memory block; and also indicates an action that

updates the sharing set, called Sharers, as

Trang 53

Directory State Machine

(Write back block)

Uncached Shared (read

only)

Exclusive (read/writ e)

Write Miss:

Sharers = {P};

send Data Value Reply msg

Trang 54

Example Directory Protocol

Message sent to directory causes two

actions:

– Update the directory

– More messages to satisfy request

Block is in Uncached state: the copy in

memory is the current value; only possible requests for that block are:

– Read miss

Trang 55

– Read miss:

requesting processor sent data from

memory & requestor made only sharing node; state of block made Shared

– Write miss:

requesting processor is sent the value & becomes the Sharing node The block is made Exclusive to indicate that the only valid copy is cached Sharers indicates the identity of the owner

MAC/VU-Advanced

Trang 56

Block is Shared state => the memory value is to-date; the read miss and write miss activities

up-are:

– Read miss: requesting processor is sent back

the data from memory & requesting processor

is added to the sharing set.

– Write miss: requesting processor is sent the

value All processors in the set Sharers are sent invalidate messages, & Sharers is set to identity

Tiêu đề	multiprocessors
Người hướng dẫn	Prof. Dr. M. Ashraf Chughtai
Trường học	mac/vu
Chuyên ngành	advanced computer architecture
Thể loại	lecture

Định dạng
Số trang	62
Dung lượng	1,71 MB