Sequential consistency Relaxed consistency model Sequential consistency requires the illusion of program order and atomicity to be maintained for all operations Relaxed models typically
Trang 1Nhóm 7 1.Đỗ Luật Khoa 2.Lương Quang Tùng 3.Trần Thanh Phương 4.Phan Thanh Duy 5.Đặng Thanh Hùng 6.Thái Tiểu Minh 7.Nguyễn Thị Thanh Xuân
1
Trang 2Multiprocessor system
2
Trang 3What is multiprocessor system?
A multiprocessor is a tightly coupled
computer system having two or more processing units ( Multiple Processors ) each
order to simultaneously process programs (theo wikipedia).
To get high performance and to reduce energy
consumption
3
Trang 4Flynn Classification ?
Flynn [1966] proposed a simple model of categorizing all computers
that is still useful today
He looked at the parallelism in the instruction and data streams
May 20, 1934 (age )
New York City
4
Trang 5What category can it be in the Flynn
Classification?
The four classifications defined by Flynn are
based upon the number of concurrent instruction (or control) and data streams available in the architecture:
5
Trang 6 Single instruction stream, single data stream (SISD)—
This category is the uniprocessor.
Single instruction stream, multiple data streams (SIMD)—The same instruction is executed by multiple processors using different data streams.
Multiple instruction streams, single data stream (MISD)—
No commercial multiprocessor of this type has been built
to date
Multiple instruction streams, multiple data streams (MIMD)—Each processor fetches its own instructions and operates on its own data
6
Trang 7thread-level parallelism, it is the architecture of choice for general- purpose multiprocessors.
MIMDs offer flexibility.
MIMDs can build on the cost-performance
advantages of off-the-shelf processors
7
Trang 8MIMD Model
Each processor is executing its own instruction stream
In many cases, each processor executes a different
process
A process is a segment of code that may be run
independently contains all the information necessary to execute that program on a processor
program and sharing the code and most of their address space
MIMD multiprocessor with n processors, we must usually
have at least n threads or processes to execute.
software system threads consist of hundreds to millions of instructions that may be executed in parallel
8
Trang 9 centralized shared-memory architectures
most a few dozen processor chips (and less than 100 cores) in 2006
For multiprocessors with small processor counts
By using multiple point-to-point connections, or a switch, and adding additional memory banks
Because there is a single main memory that has a symmetric relationship to all processors and a uniform access time from any processor called symmetric (shared-memory)
multiprocessors (SMPs) or uniform memory access (UMA)
9
Trang 10• This type of symmetric shared-memory architecture is
currently by far the most popular organization
10
Trang 11MIMD Model
To support larger processor counts, memory
must be distributed among the processors rather than centralized.
Distributing the memory among the nodes
has two major benefits :
First, it is a cost-effective way to scale the memory
Second, it reduces the latency for accesses to the
local memory
11
Trang 12• The second group consists of multiprocessors with physically distributed memory.
12
Trang 13MIMD Model
The key disadvantages for a
distributedmemory architecture are that:
becomes somewhat more complex
afforded by distributed memories
13
Trang 14Memory consistency model
14
Trang 15What is a memory consistency model?
shared-memory multiprocessor specifies how memory behaves with respect to read and write operations
from multiple processors
specify when a written value by one processor can be read by another processor
System implementation : Hardware , OS , languages , compilers
Programming correctness
Performance
15
Trang 16Sequential Consistency
Lamport’s definition: A multiprocessor system
is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some
sequential order, and the operations of each individual processor appear in this sequence
in the order specified by its program.
There are two aspects to sequential consistency:
(1) Maintaining program order among operations from individual processors
(2) Maintaining a single sequential order among
operations from all processors
16
Trang 17Sequential Consistency
Conceptual representation of sequential consistency with several processors sharing
a common logical memory
17
Trang 18Sample programs to illustrate sequential
18
Trang 19Relaxed Consistency Models: The Basics
The basic idea behind relaxed memory models is to enable the use of more optimizations by eliminating some of the constraints that sequential consistency places on the overlap and reordering of memory operations.
Sequential consistency Relaxed consistency model
Sequential consistency requires the illusion of program order and atomicity
to be maintained for all operations
Relaxed models typically allow certain memory operations to execute out of program order or non-atomically
Several of the relaxed memory consistency models:
• Relaxing the Write to Read Program Order(total store ordering or
• processor consistency).
• Relaxing the Write to Write Program Order(partial store order).
• Relaxing the Read to Read and Read to Write Program Order(weak ordering, the PowerPC consistency model, and release consistency)
19
Trang 20Relaxing the Write to Read Program Order
20
SPARC V8 Total Store Ordering (TSO) [SFC91,SUN91]
Trang 21Relaxing the Write to Read Program Order
the outcome (u,v,w,x)=(1,1,0,0)
the outcome (u,v,w,x)=(1,2,0,0)
21
Trang 22Relaxing the Write to Read Program Order
22
Trang 23Conditions for Processor Consistency (PC)
(c) the outcome (u,v,w)=(1,1,0) (d) the outcome (u,v,w,x)=(1,0,1,0)
Memory sub-operations must execute in a sequential order that satisfies the following conditions:
(a) sub-operations appear in this sequence in the order specified by the program order requirement of Figure ,
(b) the order among sub-operations satisfies the coherence requirement, and
(c) a read sub-operation issued by R(i) returns the value of either the last write sub-operation W(i) to the same
location that appears before the read in this sequence or the last write operation to the location that is before the read in program order, whichever occurs later in the execution sequence
sub-23
Trang 24Relaxing the Write to Write Program Order
24
Trang 25Partial store ordering (PSO) model example
PSO allows writes to different locations to complete out
of program order, it also allows the outcomes (u,v)=(0,0)
or (0,1) or (1,0).
25
Trang 26Relaxing the Read to Read and Read to Write
Program Order
26
By Duboiset al.[DSB86]
Trang 27Weak ordering model
The two outcomes of (u,v)=(1,0) and (u,v)=(0,1)
27
Trang 28Relaxing the Read to Read and Read to Write
Program Order
28
[GLL+90, GGH93b]
Trang 29Release consistency
The result (u,v)=(1,1) If P2 gains access to the critical section before P1, then we can get the result (u,v)=(0,0)
29
Trang 30Comparing the WO and RC models
30
Trang 3131
Trang 32 A thread includes the program counter, the register state,and
the stack It is a lightweight process;whereas threads commonly share a single address space,processes don’t.
Process:
A process includes one or more threads,the address
space,and the operating system state.Hence,a process switch usually invokes the operating system ,but not a thread switch.
Superscalar :
An advanced pipelining technique that enables the processor
to execute more than one instruction per clock cycle.
32
Trang 3333
Trang 34A superscalar with fine-grained multithreading
A superscalar with simultaneous multithreading
The use of issue slots is
limited by a lack of ILP
A major stall, such as
an instruction cache miss, can leave the entire processor idle.
The long stalls are partially hidden by switching to another thread that uses the resources of the processor.
The ILP limitations still lead to idle cycles within each clock cycle Furthermore, since thread switching only occurs when there is a stall and the new thread has a start-
up period, there are likely to be some fully idle cycles remaining
The interleaving of threads eliminates fully empty slots Because only one thread issues instructions in a given clock cycle, however, ILP limitations still lead
Trang 35Figure: The speed up from using multithreading on one core on an I7 processor averages 1.31 for and the PARSEC benchmarks the energy efficiency
improvement is 1.07.
This data was collected and analyzed by Esmacilzadeh [2011]
35
Trang 36Cache for Multiprocessors
36
Trang 37Cache for Multiprocessors
Trang 38A = 3
A = 3
A = 3
Trang 39Cache for Multiprocessors
A = 3
A = 3
A = 3
A 7
Trang 40Cache for Multiprocessors
A = 3
A = 3
Trang 41Cache coherency solution
Trang 42 Software based vs hardware based
Software-based:
Compiler based or with run-time system support
With or without hardware assist
Tough problem because perfect information is needed
in the presence of memory aliasing and explicit parallelism
Focus on hardware based solutions as they are
more common
Trang 43Hardware Solutions
The schemes can be classified based on :
Shared caches vs Snoopy schemes vs Directory schemes
Write through vs write-back (ownership-based)
protocols
Update vs invalidation protocols
Dirty-sharing vs no-dirty-sharing protocols
Trang 44Snoopy Cache Coherence Schemes
A distributed cache coherence scheme based on
the notion of a snoop that watches all activity on
a global bus, or is informed about such activity
by some global broadcast mechanism.
Most commonly used method in commercial
multiprocessors
Trang 45Write Through Schemes
All processor writes result in :
update of local cache and a global bus write that :
Advantage : Simple to implement
Disadvantages : Since ~15% of references are
writes, this scheme consumes tremendous bus bandwidth Thus only a few processors can be supported.
=> Need for dual tagging caches in some cases
Trang 46Write-Back/Ownership Schemes
When a single cache has ownership of a block,
processor writes do not result in bus writes thus conserving bandwidth.
Most bus-based multiprocessors nowadays use
such schemes
Many variants of ownership-based protocols
exist:
Goodman’s write -once scheme
Berkley ownership scheme
Firefly update protocol
Trang 47Invalidation vs Update Strategies
are invalidated
updated
single producer and many consumers of data.
multiple writes by one PE before data is read by another PE.
Junk data accumulates in large caches (e.g process
migration).
the default
Trang 48Directory Based Cache Coherence
Key idea : Keep track in a global directory (in main memory) of which processors are caching a location and the state.
Trang 49Directory-Based Protocol
49
Trang 50 Every memory block has associated directory
information
Keeps track of copies of cached blocks and their states
On miss, find directory entry, look it up, and
communicate only with the nodes that have copies, but only if necessary
In scalable networks, communication with directory
and copies is through network transactions
Directory Protocol
50
Trang 51• k processors
• With each cache block in memory:
k presence bits, 1 dirty bit
• With each cache block in cache:
1 valid bit, and 1 dirty (owner) bit
• Read from main memory by processor i:
• If dirty bit OFF then { read from main memory; turn p[i] ON; }
• if dirty bit ON then { recall line from dirty proc (set cache state to shared); update memory; turn dirty bit OFF; turn p[i] ON; supply recalled data to i;}
• Write to main memory by processor i:
• If dirty bit OFF then { supply data to i; send invalidations to all caches that have the block; turn dirty bit ON; turn p[i] ON; }
Directory Protocol on Shared Memory
51
Trang 52Directory Protocol on Shared Memory
52
Trang 53Directory Protocol
Three states
out-of-date
In addition to cache state, must track which processors have
data when in the shared state (usually bit vector, 1 if processor has copy)
Keep it simple(r):
Writes to non-exclusive data
⇒ Write miss
Processor blocks until access completes
Assume messages received and acted upon in order sent
53
Trang 54Directory Protocol
No bus and don’t want to broadcast:
Interconnect no longer single arbitration point
All messages have explicit responses
Terms: typically 3 processors involved
of an address resides
exclusive or shared
Example messages on next slide:
P = processor number, A = address
54
Trang 55Directory Protocol Messages
Processor P reads data at address A;
make P a read sharer and request data
Processor P has a write miss at address A;
make P the exclusive owner and request data
Invalidate a shared copy at address A
Fetch the block at address A and send it to its home directory;
change the state of A in the remote cache to shared
Fetch the block at address A and send it to its home directory;
invalidate the block in the cache
Return a data value from the home memory (read miss response)
Write back a data value for address A (invalidate response)
55
Trang 56State-Transition Diagram for One Block in Directory-Based
System
States identical to snoopy case; transactions very similar
Transitions caused by read misses, write misses,
invalidates, data fetch requests
Generates read miss & write miss message to home
directory
Write misses that were broadcast on the bus for snooping ⇒ explicit invalidate & data fetch requests
Note: on a write, a cache block is bigger, so need to read
the full cache block
56
Trang 57CPU write miss:
Send Data Write Back message and Write Miss to home directory
CPU-Cache State Machine
State machine for CPU requests for each
CPU Read
CPU Read hit
Send Read Miss message
CPU Write:
Send Write Miss msg to home directory
CPU Write: Send Write Miss message
to home directory
CPU read hit
CPU write hit
Fetch: Send Data Write Back message to home directory
CPU read miss:
Send Read Miss
CPU read miss: Send Data Write Back message and read miss to home directory
Shared (read/only)
57
Trang 58Transition Diagram for Directory
Same states & structure as the transition
diagram for an individual cache
2 actions
1 Update directory state
2 Send messages to satisfy requests
Tracks all copies of memory block
Also indicates an action that updates the
sharing set, Sharers , as well as sending a message
58
Trang 59Directory State Machine
State machine for Directory requests for each
(Write back block)
Exclusive (read/write)
Write Miss:
Sharers = {P};
send Data Value Reply msg
Trang 60Example Directory Protocol
Message sent to directory causes two actions:
Update directory
More messages to satisfy request
If block Uncached : copy in memory is current value; only possible requests for that block are:
Read miss: requesting processor sent data from memory & requestor
made only sharing node; state of block made Shared.
Write miss: requesting processor is sent value & becomes the Sharing
node Block made Exclusive to indicate only valid copy is cached
Sharers indicates identity of owner
Block is Shared ⇒ memory value is up to date:
Read miss : requesting processor is sent data from memory & added to sharing set.
Write miss : requesting processor is sent value All processors in Sharers are sent invalidate messages, & Sharers set to just requesting processor State of block is made Exclusive.
60