• Cache State Bits: describe the state of every cache line invalid, valid, shared, dirty… • Bus Monitor: monitoring hardware that can independently update the state of cache lines, • Bu
Trang 1Group 3:
• 13070223 – Võ Thanh Biết
• 13070229 – Lưu Nguyễn Hoàng Hạnh
• 13070232 – Nguyễn Duy Hoàng
• 13070243 _ Trần Duy Linh
• 13070244 – Nguyễn Thị Thúy Loan
• 13070251 – Phạm Ích Trí Nhân
• 13070258 – Nguyễn Anh Quốc
• 13070269 – Lê Thị Minh Thùy
Trang 2• Multi processor
• What is a multiprocessor system?
• What category can it be in the Flynn Classification?
• Synchronization: state some techniques: spin lock, barrier, advantage/disadvantage
Synchronization for large scale multiprocessor
• Memory consistency: state the relaxed consistency models
• Multithreading: how to multithreading improve the performance of a uniprocessor without superscalar? With superscalar?
• Cache coherent problem in multicore systems
• Why keeping cache coherence on multiprocessor is needed
• Brief explain directory-based protocol? Where is it most applicable
• Explain snoopy-based protocol? Where is it most applicable
• Listing some popular protocols in modern processors
• What is MESI protocol
• Sample
Contents
Trang 3• What is a multiprocessor system ?
• Multiprocessing is a type of processing in which two or more processors work together to process more than one program simultaneously
• Advantages of Multiprocessor Systems:
• Reduced Cost
• Increased Reliability
• Increased Throughput
Trang 4Flynn classification
• Based on notions of instruction and data streams
• SISD (Single Instruction stream over a Single Data stream )
• SIMD (Single Instruction stream over Multiple Data streams )
• MISD (Multiple Instruction streams over a Single Data stream)
• MIMD (Multiple Instruction streams over Multiple Data stream)
Trang 5Flynn classification
Trang 6• Why Synchronize?
• Need to know when it is safe for different processes running on different processors to use shared data
Trang 7P2 Lock(L) Load sharedvar Modify sharedvar Store sharedvar Release(L)
Trang 8• Hardware support for synchronization
• Atomic instruction to fetch and update memory (atomic operation)
• returns the value of a memory location and atomically increments it after the fetch is done
• Atomic Read and Write for Multiprocessors
• load-linked(LL) and store-conditional(SC)
Trang 10• spin locks: locks that a processor continuously tries to acquire,
spinning around a loop until it succeeds
Trang 11• spin locks:
Trang 12• Spin locks : (using test and set)
void spin_lock (spinlock_t *s)
{
while (test_and_set (s) != 0) while (*s != 0) ;
Trang 13• Synchronization for large-scale multiprocessor:
• For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of
synchronization.
• Problem:
• Ex: 20 processors spin on lock held by 1 proc, 50 cycles for bus
Read miss by all waiting processors to fetch lock (20x50) 1000 Write miss by releasing processor and invalidates 50 Read miss by all waiting processors (20x50) 1000 Write miss by all waiting processors
one successful lock (50) & invalidate all copies (19x50) 1000 Total time for 1 proc to acquire & release lock 3050
Each time one gets a lock, it drops out of competition, so avg.=1525
20 x 1525 = 30,000 cycles for 20 processors to pass through the lock Problem is contention for lock and serialization of lock access: once lock is free, all compete to see who gets it
• Solution:
• spin lock with exponential back-off
• queuing lock
Trang 14• Barrier Synchronization
• A very common synchronization primitive
• Wait until all threads have reached a point in the program before any are allowed to proceed further
• Uses two shared variables
• A counter that counts how many have arrived
• flag that is set when the last processor arrives
computation;
barrier() communication;
barrier() repeat:
Trang 15count=0; /* Reset counter */
}else { /* Wait for more to come */
spin(release==1); /* Wait for release to be 1*/ }
Trang 16• Barrier with many processors
• Have to update counter one by one – takes a long time
• Solution: use a combining tree of barriers
• Example: using a binary tree
• Pair up processors, each pair has its own barrier
• E.g at level 1 processors 0 and 1 synchronize on one barrier, processors 2 and 3 on another, etc.
• At next level, pair up pairs
• Processors 0 and 2 increment a count a level 2, processors 1 and 3 just wait for it to be released
• At level 3, 0 and 4 increment counter, while 1, 2, 3, 5, 6, and 7 just spin until this level 3 barrier is released
• At the highest level all processes will spin and a few “representatives” will be counted.
• Works well because each level fast and few levels
• Only 2 increments per level, log2(numProc) levels
• For large numProc, 2*log2(numProc) still reasonably small
Trang 17• Contention even with test-and-test-and-set
• Every write goes to many, many spinning procs
• Making everybody test less often reduces contention for contention locks but hurts for low-contention locks
high-• Solution: exponential back-off
• If we have waited for a long time, lock is probably high-contention
• Every time we check and fail, double the time between checks
• Fast low-contention locks (checks frequent at first)
• Scalable high-contention locks (checks infrequent in long waits)
• – Hardware support
Trang 18Cache Coherence
In a shared memory multiprocessor with a separate cache memory for each processor, it is possible to have many copies of any one-
instruction operand.
One copy in the main memory and one in each cache memory When
one copy of an operand is changed, the other copies of the operand
must be changed
==> Coherence ensure reading a location
should return the latest value written to that location
Trang 19Cache Coherence
Trang 21Snooping protocol
• Used in systems with a shared bus between the processors
and memory modules
• Rely on a common channel (or bus) connecting the processors
to main memory
• This enables all cache controllers to observe (or snoop) the
activities of all other processors and take appropriate actions
to prevent the processor from obtaining old data
Trang 22Cache Coherence
Trang 23• Cache State Bits: describe the state of every cache line
(invalid, valid, shared, dirty…)
• Bus Monitor: monitoring hardware that can independently update the state of cache lines,
• Bus Cache Cycles: broadcast invalidates or updates They may or may not be part of bus read/write cycles
Trang 24Cache Coherence
Two possible solutions:
1.Update copy in the cache of processors (Write-Invalidate)
2.Invalidate copy in the cache of processors.
(Write-Invalidate)
Trang 25Cache Coherence
• Cache State Bits: describe the state of every cache line
(invalid, valid, shared, dirty…)
• Bus Monitor: monitoring hardware that can independently update the state of cache lines,
• Bus Cache Cycles: broadcast invalidates or updates They may or may not be part of bus read/write cycles
Trang 27Cache Coherence
Trang 28Cache Coherence
Trang 30Cache Coherence
Trang 32Directory-based protocol
Interconnection Network Directory
Local Memory
Cache CPU 0
Directory
Local Memory
Cache CPU 1
Directory
Local Memory
Cache CPU 2
Trang 33• To implement the operations, a directory must track the state of each cache block:
• Shared (S): one or more processors have the block cached, and the value is up-to-date
• Un-cached (U): no processor has a copy of the cache block
• Modified/Executed (E): exactly one processor has a copy of the cache block The processor is called the owner of the block
Directory-based protocol
Trang 34Interconnection Network
7 X
Caches
Memories
Bit Vector
Trang 35CPU 0 Reads X
Interconnection Network
7 X
Caches
Memories
Trang 36CPU 0 Reads X
Interconnection Network
7 X
Caches
Memories
7 X
Trang 37CPU 2 Reads X
Interconnection Network
7 X
Caches
Memories
7 X
Read Miss
Trang 38CPU 2 Reads X
Interconnection Network
7 X
Caches
Memories
7 X
Trang 39CPU 2 Reads X
Interconnection Network
7 X
Trang 40CPU 0 Writes 6 to X
Interconnection Network
7 X
Trang 41CPU 0 Writes 6 to X
Interconnection Network
7 X
Trang 42CPU 0 Writes 6 to X
Interconnection Network
7 X
Caches
Memories
6 X
Trang 43CPU 1 Reads X
Interconnection Network
7 X
Caches
Memories
6 X
Read Miss
Trang 44CPU 1 Reads X
Interconnection Network
7 X
Caches
Memories
6 X Switch to Shared
Trang 45CPU 1 Reads X
Interconnection Network
6 X
Caches
Memories
6 X
Trang 46CPU 1 Reads X
Interconnection Network
6 X
Trang 47CPU 2 Writes 5 to X
Interconnection Network
6 X
Trang 48CPU 2 Writes 5 to X
Interconnection Network
6 X
Trang 49CPU 2 Writes 5 to X (Write back)
Interconnection Network
6 X
Caches
Memories
5 X
Trang 50CPU 0 Writes 4 to X
Interconnection Network
6 X
Caches
Memories
5 X
Write Miss
Trang 51CPU 0 Writes 4 to X
Interconnection Network
6 X
Trang 52CPU 0 Writes 4 to X
Interconnection Network
5 X
Caches
Memories
5 X
Trang 53CPU 0 Writes 4 to X
Interconnection Network
5 X
Caches
Memories
Trang 54CPU 0 Writes 4 to X
Interconnection Network
5 X
Caches
Memories
5 X
Trang 55CPU 0 Writes 4 to X
Interconnection Network
5 X
Caches
Memories
4 X
Trang 57MESI Protocol (1)
• A practical multiprocessor invalidate protocol which attempts to minimize bus usage.
• Allows usage of a ‘write back’ scheme - i.e main memory not updated until ‘dirty’
cache line is displaced
• Extension of usual cache tags, i.e invalid tag and ‘dirty’ tag in normal write back cache.
Trang 58MESI Protocol (2)
Any cache line can be in one of 4 states (2 bits)
• Modified - cache line has been modified, is
different from main memory - is the only cached copy (multiprocessor ‘dirty’)
• Exclusive - cache line is the same as main memory
and is the only cached copy
• Shared - Same as main memory but copies may
exist in other caches.
• Invalid - Line data is not valid (as in simple cache)
Trang 59MESI Protocol (3)
• Cache line changes state as a function of memory access events.
• Event may be either
• Due to local processor activity (i.e cache access)
• Due to bus activity - as a result of snooping
• Cache line has its own state affected only if address matches
Trang 61MESI Local Read Hit
• Line must be in one of MES
• This must be correct local value (if M it must have been modified locally)
• Simply return value
• No state change
Trang 62MESI Local Read Miss (1)
• Processor makes bus request to memory
• Value read to local cache, marked E
• Processor makes bus request to memory
• Snooping cache puts copy value on the bus
• Memory access is abandoned
• Local processor caches value
• Both lines set to S
Trang 63MESI Local Read Miss (2)
• Several caches have S copy
• Processor makes bus request to memory
• One cache puts copy value on the bus (arbitrated)
• Memory access is abandoned
• Local processor caches value
• Local copy set to S
• Other copies remain S
Trang 64MESI Local Read Miss (3)
• One cache has M copy
• Processor makes bus request to memory
• Snooping cache puts copy value on the bus
• Memory access is abandoned
• Local processor caches value
• Local copy tagged S
• Source (M) value copied back to memory
• Source value M -> S
Trang 65MESI Local Write Hit (1)
Line must be one of MES
• line is exclusive and already ‘dirty’
• Update local cache value
• no state change
• E
• Update local cache value
• State E -> M
Trang 66MESI Local Write Hit (2)
• S
• Processor broadcasts an invalidate on bus
• Snooping processors with S copy change S->I
• Local cache value is updated
• Local state change S->M
Trang 67MESI Local Write Miss (1)
Detailed action depends on copies in other processors
Trang 68MESI Local Write Miss (2)
• Other copies, either one in state E or more in state S
• Value read from memory to local cache - bus transaction marked RWITM (read with intent to modify)
• Snooping processors see this and set their copy state to I
• Local copy updated & state set to M
Trang 69MESI Local Write Miss (3)
Another copy in state M
• Processor issues bus transaction marked RWITM
• Snooping processor sees this
• Takes control of bus
• Writes back its copy to memory
• Sets its copy state to I
Trang 70MESI Local Write Miss (4)
Another copy in state M (continued)
• Original local processor re-issues RWITM request
• Is now simple no-copy case
• Value read from memory to local cache
• Local copy value updated
• Local copy state set to M
Trang 71Putting it all together
• All of this information can be described compactly using a state transition diagram
• Diagram shows what happens to a cache line in a processor as
Trang 72ReadMiss(sh)
ReadMiss(ex)
WriteHit
WriteHit
WriteHit
WriteMiss
Mem Read
Mem Read
= bus transaction
Trang 73RWITM RWITM
= copy back
Trang 74MESI notes
• There are minor variations (particularly to do with write miss)
• Normal ‘write back’ when cache line is evicted is done if line state is M
• Multi-level caches
• If caches are inclusive, only the lowest level cache needs to snoop on the bus
Trang 75Directory Schemes
broadcast
memory block, and then using point-to-point messages to maintain coherence
network
Trang 76Basic Scheme (Censier &
Feautrier)
• Assume "k" processors
• With each cache-block in memory:
k presence-bits, and 1 dirty-bit
• With each cache-block in cache: 1valid bit, and 1 dirty (owner) bit
• • •
Int erconnection Network
• Read from main memory by PE-i:
• If dirty-bit is OFF then { read from main memory; turn p[i] ON; }
• if dirty-bit is ON then { recall line from dirty PE (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to PE-i; }
• Write to main memory:
• If bit OFF then { send invalidations to all PEs caching that block; turn bit ON; turn P[i] ON; }
dirty-•
Trang 77MEMORY
CONSISTENC Y
Trang 78What is memory consistency
model?
Memory consistency model: Order in which memory operations will appear to execute
⇒ What value can a read return ?
Affects ease-of-programming and performance
Trang 79Implicit Memory Model
Sequential consistency (SC): Result of an
Trang 80Understanding Program Order
Initially Flag1 = Flag2 = 0
Trang 81Understanding Program Order
Initially Flag1 = Flag2 = 0
Trang 82Understanding Program Order
Initially Flag1 = Flag2 = 0
(Operation, Location, Value) (Operation, Location, Value)
Write, Flag1, 1 Write, Flag2, 1
Read, Flag2, 0 Read, Flag1, 0
Trang 83Understanding Program Order
Can happen if
or compiler
• Allocate Flag1 or Flag2 in registers
On AlphaServer, NUMA-Q, T3D/T3E, Ultra
Enterprise Server
Trang 84Understanding Atomicity
Trang 85Cache Coherence Protocols
How to propagate write?
Invalidate Remove old copies from other caches
Update Update old copies in other caches to new values
Trang 88Understanding Program Order: Summary
SC limits program order relaxation:
• Write => Read
• Write => Write
• Read => Read, Write
• Read others’ write early
• Read own write early
• Unserializedwrites to the same location
Alternative
• Give up sequential consistency
• Use relaxed models
Trang 89Classification for Relaxed Models
Typically described as system optimizations -system-centric
Optimizations
Program order relaxation:
• Write => Read
• Write => Write
• Read => Read, Write
Read others’ write early
Read own write early
All models provide safety net
All models maintain uniprocessor data and control dependences, write serialization
Trang 90Some Current System-Centric Models
Trang 91System-Centric Models: Assessment
System-centric models provide higher performance than SC
Trang 93• A thread is placeholder information associated with a single
use of the smallest sequence of programmed instructions that can be managed independently by an operating
system scheduler
• Multi-threading processor execute multiple threads
concurrently within the context of a single process and share the resources of a single core: the computing units, the CPU caches and the translation look aside buffer
(TLB), while different processes do not share these
resources
• On a single processor, multi-threading is generally
implemented by time-division multiplexing
• Multi-threading aims to increase utilization of a single core
by using thread-level as well as instruction-level
parallelism
What is multi-threading?