Flynn’s TaxonomyMIMD: Multiple I, Multiple D Streams Each processor executes its own instructions and operates on its own data Includes multi-core processors Use for: General purp
Trang 1MultiProcessor
Nhóm 8:
Nguyễn phúc Ánh – 13070221
Lê Minh Nam – 13070249
Nguyễn Hữu Hiếu – 12073119
Mai Văn Tinh – 13070270
Lê Thanh Phương – 13070254
Lý Đoàn Duy Khánh – 13070238
Trang 3Introduce MultiProcessor
1
Trang 4Introduce MultiProcessor System
A multiprocessor is a tightly coupled computer system having two or more processing units (Multiple
Processors) each sharing main memory and
peripherals, in order to simultaneously process
programs complete system.
Trang 5Introduce MultiProcessor System
Why do we need multiprocessors ?
Need to improve system performance
Uniprocessor speed keeps improving but will be limited.
Growth in data-intensive applications: database, file server
Improved understanding in how to use multiprocessors
effectively.
=> Solution
Improve performance by connecting multiple microprocessors together.
Trang 6Introduce MultiProcessor System
Trang 7Flynn’s Taxonomy
Flynn’s Taxonomy of Parallel Machines classified into four categories based on
How many Instruction streams?
How many Data streams?
Two possible states: Single or Multiple
Four category of Flynn classification:
• SISD
• SIMD
• MISD
• MIMD
Trang 8Flynn’s Taxonomy
SISD: Single I Stream, Single D Stream
A uniprocessor
Single instruction: only one instruction stream is being
acted on by the CPU during any one clock cycle
Single data: only one data stream is being used as input during any one clock cycle
Instructions are executed sequentially
IBM 701, IBM 1620, IBM 7090
Trang 9Flynn’s Taxonomy
SIMD: Single I, Multiple D Streams
The same instruction is executed by multiple
Trang 10Flynn’s Taxonomy
MISD: Multiple I, Single D Stream
Not used much, use for special purpose computations
multiple cryptography algorithms attempting to crack a
single coded message
Trang 11Flynn’s Taxonomy
MIMD: Multiple I, Multiple D Streams
Each processor executes its own instructions and
operates on its own data
Includes multi-core processors
Use for: General purpose parallel computers
IBM 370/168 MP; Univac 1100/80
Trang 12Synchronization
2
Trang 13Typical use of a lock:
while (!acquire (lock)) /*spin*/
/* some computation on shared data (critical section)
*/
release (lock)
Acquire based on primitive: Read-Modify-Write
Basic principle: “Atomic exchange”
Test-and-set
Fetch-and-increment
Trang 14Issues for Synchronization:
memory (atomic operation)
primitive
bottleneck; techniques to reduce contention and latency of synchronization
Trang 15Uninterruptable Instruction to Fetch and Update Memory
Atomic exchange: interchange a value in a
register for a value in memory
0 => synchronization variable is free
1 => synchronization variable is locked and unavailable
o Set register to 1 & swap
o New value in register determines success in getting lock
0 if you succeeded in setting the lock (you were first)
1 if other processor had already claimed access
o Key is that exchange operation is indivisible
o Release the lock simply by writing a 0
o Note that every execution requires a read and a
write
Trang 16 Test-and-set: tests a value and sets it if the value passes the test
memory location and atomically increments it
0 => synchronization variable is free
Uninterruptable Instruction to Fetch and Update Memory
Trang 17Load linked & store conditional
Hard to have read & write in 1 instruction
(needed for atomic exchange and others)
operations
– Makes coherence more difficult, since hardware cannot allow any operations between the read and write, and yet
must not deadlock
So, use 2 instructions instead.
Load linked (or load locked) + store conditional
Trang 18Load linked & store conditional
LL r,x loads the value of x into register r, and saves
the address x into a link register.
SC r,x stores r into address x only if it is the first
store (after LL r,x) The success is reported by
returning a value (r=1) Otherwise, the store fails, and (r=0) is returned.
Trang 19Load linked & store conditional
BEQZ R3,try ; branch if store fails (R3 = 0)
MOV R4,R2 ; put load value in R4
Trang 20Load linked & store conditional
Example doing fetch & increment with LL & SC:
Trang 21Spin Locks
Processor continuously tries to acquire,
spinning around a loop trying to get the lock
li R2, #1
lockit:
exch R2,0(R1) ; atomic exchange
bnez R2,lockit ;already locked?
Trang 22 All processes have to wait at a
synchronization point
End of parallel do loops
Processes don’t progress until they all reach the barrier
Phase (i+1) does not begin until every
process completes phase i.
Trang 23 Low-performance implementation: use a counter initialized with the number of
processes
counter (atomically fetch-and-add (-1)) and busy
waits
progress (broadcast)
Trang 24Synchronization Mechanisms for Larger-Scale
Have a race condition for acquiring a lock that has just been released.
All waiting processors will suffer read and write miss
O(n2) bus transactions for n contending processes.
Potential improvements
Exponential backoff
Queuing Locks (software or hardware)
Trang 26Queuing Locks
Basic idea: a queue of waiting processors
is maintained in shared-memory for each lock (best for bus-based machines)
o Each processor performs an atomic operation to obtain a
memory location (element of an array) on which to spin
o Upon a release, the lock can be directly handed off to the next waiting processor
Trang 28 Eager release consistency
Lazy release consistency
Entry consistency
Trang 29What?
Trang 30Memory consistency
different memory locations
the compiler, and the programmer
• Hardware and compiler will not violate the ordering specified
• The programmer will not assume a stricter order than that of
the model
mechanisms so the user can enforce a stricter
order than that provided by the model
Trang 31Relaxed Consistency Models
The key idea in relaxed consistency models
is to allow reads and writes to complete out
of order, but to use synchronization operations to enforce ordering, so that a synchronized program behaves as if the processor were sequentially consistent.
There are a variety of relaxed models that are classified according to what read and write orderings they relax.
Trang 32Relaxed Consistency Models
Relaxing the W R ordering yields a model →
known as total store ordering or processor consistency
Relaxing the W W ordering yields a model →
known as partial store order.
Relaxing the R W and R R orderings yields → →
a variety of models including weak ordering, the PowerPC consistency model, and release consistency, depending on the details of the ordering restrictions and how
synchronization operations enforce
ordering.
Trang 33Sequential Consistency
Sequential Consistency (Lamport*) “A
multiprocessor is sequentially consistent if
the result of any execution is the same as if
the operations of all the processors were
executed in some sequential order, and the
operations of each individual processor occur
in this sequence in the order specified by its
Trang 34Sequential Consistency
Sequential consistency says the machine
behaves as if processors take turns in an
arbitrary order
The program behaves as if the threads take turns executing instructions (not in any fair order), i.e., only one thread executes an
instruction at a given time.
Trang 35Sequential Consistency
Trang 36 Every process issues memory operation in program order.
waits for the write to complete before issuing its next operation ( w issuing next)
waits for the read to complete, and for the write whose value is being returned by read to complete, before issuing its next operation That is, if the write whose value is being returned has performed with respect to this processor ( as it must have if its value is being returned) then the processor should wait until the write has preformed with respect to all processors.
Trang 37Processor Consistency
Before a read is allowed to perform with respect to any other processor, all previous read must be performed ( R R )
Before a write is allowed to performed with respect to other processor all previous accesses (reads and writes) must be performed (W R,W)
Trang 38Example
Trang 39Relaxed Consistency Models
Release Consistency
Eager release consistency
Lazy release consistency
Entry consistency
Trang 40Weak Consistency
Ordinary shares accesses and synchronization accesses
Conditions for weak consistency:
Before an ordinary read/write access is allowed to perform with
respect to any other processor, all previous synchronization accesses must be performed
Before a synchronization access is allowed to performed with respect to any other processor, all previous ordinary read/write accesses must be performed
Synchronization accesses are sequentially consistent
Trang 41Example
Trang 42Release Consistency
A problem with weak consistency is that when
a synchronization variable is accessed, the data store does not know whether it is done because the process is finished writing shared data or is about to start reading data Release consistency provides this knowledge by differentiating between entering and leaving a critical region These are provided by:
acquire, which tells the data store that a critical region is
about to be entered
release, which tells the data store that a critical region has
just been exited
Trang 43Release Consistency
Categorization of shared memory accesses
Trang 44Release Consistency: Properly-labeled Programs
Trang 45Release Consistency
The rules of release consistency are as follows:
performed, all previous acquires done by the process must have completed successfully.
reads and writes done by the process must have been completed.
to one another
Trang 46Example
Trang 47Release Consistency
There are two forms of release consistency
The one is eager release consistency , in
which all updates are propagated
immediately when a process executes a
release.
The other form is lazy release consistency ,
in which updates are not propagated on
performing a release
Trang 48Release Consistency
Trang 49Eager versus Lazy
Trang 50Entry Consistency
Formally, a memory exhibits entry consistency if it
meets all the following conditions :
not allowed to perform with respect to a process
until all updates to the guarded shared data have
been performed with respect to that process.
synchronization variable by a process is allowed to perform with respect to that process, no other
process may hold the synchronization variable, not even in non-exclusive mode.
synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until
it has performed with respect to that variable's
owner.
Trang 51Entry Consistency: Example
Trang 53Definition and Characteristics of Superscalar
Superscalar processing is the ability to initiate
multiple instructions during the same clock cycle.
A typical Superscalar processor fetches and
decodes the incoming instruction stream several instructions at a time.
Superscalar architecture exploit the potential of
ILP(Instruction Level Parallelism).
53
Trang 54Definition and Characteristics of Superscalar
54
Trang 55Uninterrupted stream of instructions
The outcomes of conditional branch instructions are
usually predicted in advance to ensure uninterrupted stream of instructions
Instructions are initiated for execution in parallel
based on the availability of operand data, rather than their original program sequence This is referred to
as dynamic instruction scheduling.
Upon completion instruction results are
resequenced in the original order.
55
Trang 56Superscalar Execution Example
Trang 57Complicated Example
Optimizing the Complicated Example
Trang 58Superscalar Execution Example
- With Register Renaming for WAR
and WAW dependencies.
Trang 59COMPARISON BETWEEN PIPELINING & SUPERSCALAR
divides an instruction into steps, and
since each step is executed in a
different part of the processor, multiple
instructions can be in different
"phases" each clock.
involves the processor being able to issue multiple instructions in a
single clock with redundant facilities
to execute an instruction within a single core
once one instruction was done
decoding and went on towards the
next execution subunit
multiple execution subunits able to
do the same thing in parallel
Sequencing unrelated activities such
that they use different components at
the same time
Multiple sub-components capable of doing the same task simultaneously, but with the processor deciding how
to do it.
Trang 60Limitation superscalar
Available performance improvement from
superscalar techniques is limited by two key areas:
The degree of intrinsic parallelism in the
instruction stream, i.e limited amount of
instruction-level parallelism, and
The complexity and time cost of the dispatcher and associated dependency checking logic
Trang 61Cache Coherence
5
6
Trang 64Cache Coherence
• PEA adds 1 to x x is in
PEA's cache, so there's a
cache hit
• If PEB reads x again
(perhaps after synchronising
with PEA), it will also see a
cache hit However it will
read a stale value of x
Trang 65Cache coherence hardware
This problem is avoided by adding snooping
hardware to the system interface This hardware monitors the bus for transactions which affect
locations cached in this processor.
• The cache also needs to generate invalidate transactions
when it writes to shared locations.
Trang 66Cache coherence hardware
cache generates
an invalidate transaction.
hardware sees the
invalidate x transaction,
it finds a copy of x in its
cache and marks it
invalid.
cause a cache miss and
initiate a databus
transaction to read x
from main memory
Trang 67Cache coherence hardware
When PEA's snooping
hardware sees the
memory read for x, it
detects the modified
a retry response,
transaction.
Trang 68Cache coherence hardware
PEA now writes (flushes)
the modified cache line
to main memory
PEB continues its
suspended transaction
and reads the correct
value from main
memory
Trang 69Directory-Based Protocol
6
Trang 707 0
Scalable Approach: Directories
Every memory block has associated
directory information
communicate only with the nodes that have copies, but only if necessary
copies is through network transactions
Many alternatives for organizing directory
information
Trang 717 1
Basic Operation of Directory
• k processors
• With each cache block in memory:
k presence bits, 1 dirty bit
• With each cache block in cache:
1 valid bit, and 1 dirty (owner) bit
• Read from main memory by processor i:
• If dirty bit OFF then { read from main memory; turn p[i] ON; }
• if dirty bit ON then { recall line from dirty proc (set cache state
to shared); update memory; turn dirty bit OFF; turn p[i] ON;
supply recalled data to i;}
• Write to main memory by processor i:
• If dirty bit OFF then { supply data to i; send invalidations to all
caches that have the block; turn dirty bit ON; turn p[i] ON;
Trang 727 2
Directory Protocol
Three states
Shared : ≥ 1 processors have data, memory up-to-date
Uncached (no processor has it; not valid in any cache)
Exclusive : 1 processor (owner) has data; memory out-of-date
In addition to cache state, must track which processors have data when in the shared
state (usually bit vector, 1 if processor has
copy)
Keep it simple(r):
Writes to non-exclusive data
⇒ Write miss
Processor blocks until access completes
Assume messages received and acted upon in order sent