Chapter 18 - Recovery and fault tolerance. This chapter discusses recovery and fault tolerance techniques used in a distributed operating system. Resiliency, which is a technique for minimizing the impact of a fault, is also discussed.
Trang 2OS control functions in a distributed environment
• Special features of distributed OS control functions
* Check whether all processes of a computation, which may operate
in different computers, have completed
– Election
* Elect a coordinator for a privileged function like resource allocation
Trang 3Nature of a distributed control algorithm
• A distributed control function offers services to both
system and user processes
– It operates in parallel with its clients
• Following terminology is used to differentiate between
the distributed control algorithm and its clients
– Basic computation: Operation of a client
* Interprocess messages used by it are called basic messages
– Control computation: Operation of the control algorithm
* Interprocess messages exchanged in the control computation are
called control messages
– Basic part and control part of a process
Trang 4Basic and control parts of a process Pi
• The basic part of P i interacts with basic parts of other processes through
basic messages; analogously for control part of P i
• The control part provides services such as resource allocation to the
basic part
Trang 5Correctness of a distributed control algorithm
• Processes of a distributed control algorithm exchange
control data and coordinate through control messages
– New correctness issues arise because
* Exchange of control messages incurs delays
Control data used in processes may become stale or may appear inconsistent
– Hence correctness has two new facets
Trang 6Liveness and safety of distributed control algorithms
Trang 7Distributed mutual exclusion algorithms
• At any time, at most one process may be in a CS for a
data item ds
– Permission-based algorithms
* A process seeks the permission of a set of processes and enters a
CS only when all processes in the set have granted the permission
Trang 8Ricart–Agrawala algorithm
• Steps of the algorithm
1 Process wishing to enter CS sends time-stamped requests to all other processes
2 When a process receives a request
a If it is not interested in entering CS, it sends a ‘go ahead’
immediately
b If it is also interested in entering CS, it sends a ‘go ahead’ only if the received request’s time-stamp < its own time-stamp
c If it is in a CS, it adds the request to the pending list
3 When a process receives n -1 ‘go ahead’ replies, it enters CS
4 When a process exits a CS, it sends ‘go ahead’ replies to each request in its pending list
Trang 9Basic and control actions in Ricart–Agrawala algorithm
1, 2(b), 3
Trang 10Maekawa algorithm
• Each process has a request set of processes; it seeks the permission of only processes in the request set (Rirepresents the request set of process Pi)
– Correctness is ensured through the following rules:
* For all P i : P i is included in R i
* For all P i , P j : R i ∩ R j is non-null
– The algorithm requires 2 x √n messages per CS entry
Trang 11Token-based algorithm for a ring
• Only the process holding the token can enter a CS
– An abstract ring is superimposed on a system (see next slide)
– A process wishing to enter a CS sends its request along the ring and enters the CS only when it receives the token
* A Process not holding the token simply forwards the request to the next process
* If the process holds the token and is not in a CS
It sends the token to the next process
The token travels over the ring until it reaches the requester
* If the process holds the token and is in a CS
The request is entered in a request queue in the token
When the process finishes the CS, it sends token to the next process for delivery to the first process in its request queue
Trang 12Token-based algorithm for a ring
(a) The system
(b) Abstract ring for the system: P4 has the token, requests by P2 and
P6 exist in the token’s request queue
Trang 13Raymond’s token-based algorithm
• Features of the algorithm
– The algorithm uses an abstract inverted tree to reduce the
number of messages It has three invariants
* Each process in the system belongs to the tree
* Each process other than the P holder has only one edge, which points
to its parent in the tree
– Each process has a local request queue
* When it receives a request, it puts the requestor’s id in the queue
* When it makes a request, it puts its own id in the queue
Trang 14Raymond’s token-based algorithm
(a) A system
(b) Abstract inverted tree for the system: P5 holds the token
Trang 15Raymond’s token-based algorithm
1 Process wishes to make a request:
a It enters the request in its local queue
b Sends it on the out-edge if it has not sent a request earlier
2 When a process receives a request:
a It enters the request in its local queue
b Sends it on the out-edge if it has not sent a request earlier
3 When a process completes execution of a CS:
a It removes first requester id from its queue,
b sends the token to that process and inverts the edge to it
4 When a process receives the token:
a It removes first requester id from its queue
b If it is its own id, it enters a CS; otherwise, it sends the token to the
Trang 16An example of Raymond's algorithm
(a) Process P5 is in CS. Requests made by P3 and P1 have reached it
(b) Process P5 finishes execution of CS and passes token to P4
Trang 17Distributed deadlock handling
• A distributed computation may wish to use resources
located in many nodes of the system
– Information about allocated resources and pending requests in many nodes has to be collected
– Correctness problems may arise due to
* Delays in obtaining information
* Consistency of information
– Consider building a global wait-for graph (WFG) by collecting
information about wait-for relationships in all nodes
* Inconsistent information due to delays may cause phantom
deadlocks, i.e., declaration of deadlock when none exists
Trang 19Diffusion computation-based deadlock detection
• Diffusion computation: used to collect info about nodes
– Diffusion phase
* Computation that has originated in one node, spreads to other nodes
A control message called a query is sent along each edge
The first query received by a node is called an engaging query
On receiving it, the node sends queries along all its out-edges
– Information collection phase
* Each node sends information in response to each query
It sends a dummy reply that contains null information for a
non-engaging query
It collects information from all replies it received, adds its own information, and sends the result as the reply to the engaging
Trang 20Diffusion computation-based deadlock detection
1 When a process becomes blocked on a resource request: It initiates a diffusion computation as follows:
a Send queries along all out-edges in WFG
b Remember the number of queries sent; await replies
c After receiving a matching number of replies, declare a deadlock if
it has been continuously in blocked state after sending queries
2 When a process receives an engaging query: If it is blocked, it performs the following actions:
a Send queries along all out-edges in WFG
b Remember the number of queries sent; await replies
c After receiving a matching number of replies, compute and send
an engaging reply, if continuously blocked after sending queries
3 When a process receives a non-engaging query:
a Send dummy reply if continuously blocked after sending queries
Trang 21Illustration of diffusion computation-based
distributed deadlock detection
• P2, P3 are blocked. P1 becomes blocked and sends a query but does
not receive a reply because P 4 is not blocked
• P 4 requests a resource held by P1, becomes blocked and sends a query
It would receive a reply and declare a deadlock
Trang 22Mitchell–Merritt algorithm for distributed deadlock detection
• It is an edge chasing algorithm—control messages are
sent over WFG edges to detect cycles
– A provision is made to ensure that the cycle has not been
broken before it was detected
* Each process is assigned a public label and a private label
The labels are identical when a process is created
The public label of a process changes when it gets blocked on
Trang 23Rules of Mitchell–Merritt algorithm
• Block rule changes the labels of a process when it blocks; z = inc(u, x),
where inc generates a unique label larger than both u and x
• The transmit rule changes public label of a process waiting for a process
Trang 24Distributed deadlock prevention
• Cycles are prevented as follows:
– A pair (local time, node id) is used to time-stamp creation of a process
Trang 25Distributed scheduling algorithms
• Computational load in nodes is balanced through the
technique of process migration
Trang 26Distributed scheduling algorithms
• Issues in distributed scheduling
– Kinds of process migration
* Preemptive migration requires transfer of state—hard to implement
* Non-preemptive migration is performed while creating a process—avoids need to transfer state
– Identifying nodes for process migration by quantifying ‘load’
* Heavily loaded nodes become sender nodes, lightly loaded nodes become receiver nodes
Use CPU utilization as the criterion: causes overhead
Use length of ready queue: easier to implement
– Stability of a scheduling algorithm: An algorithm is unstable if, under some conditions, its overhead is unbounded
* Excessive shuffling of processes between nodes causes instability
Trang 27Distributed scheduling algorithms
• Three kinds of distributed scheduling algorithms
– Sender initiated algorithms
* Thresholds on load are used to identify senders and receivers
* A sender node migrates a process non-preemptively at its creation
Sender node polls other nodes to identify a receiver node
Instability at high load: prevent by limiting the amount of polling
– Receiver initiated algorithm
* When a process completes, the node checks whether it has become a receiver and migrates a process preemptively to itself
* No instability
At high load, senders are easy to find
Trang 28Distributed scheduling algorithms
• Three kinds of distributed scheduling algorithms (contd)
– Symmetrically initiated algorithms
* Has features of both sender and receiver initiated algorithms
Behaves like sender initiated algorithm at low loads
Behaves like receiver initiated algorithm at high loads
– Outline of a symmetrically initiated algorithm
* Each node maintains lists of senders, receivers and ok nodes
* A sender node polls nodes in the receivers list
If the node is a receiver node, a process is migrated
If the node is not a receiver, it is put into appropriate list
* Analogously, a receiver node polls nodes in the senders list
Trang 29Performance of distributed scheduling algorithms
Trang 30Distributed termination detection
• Processes of a distributed computation execute in
different nodes of a distributed system
– These processes perform work assigned to them
* A process is active when it is performing work, and passive when it
has no work
* Work is assigned to a process through a message
Hence it may become active on receiving a message
– Distributed termination condition (DTC) detects whether such a
computation has completed It consists of two parts
* All processes of the computation are passive
* No basic messages are in transit
Trang 31Distributed termination detection
• A diffusion computation-based algorithm
– Following assumptions are made
* Processes are not created or destroyed dynamically during operation of the algorithm
* Interprocess communication channels are FIFO
* Processes communicate through synchronous communication, i.e., the process sending a message is blocked until a response is
received
Trang 32Distributed termination detection—
A diffusion computation-based algorithm
1 When a process becomes passive
a Sends “shall I declare termination?” messages on all edges
b After receiving replies to all messages: Declares termination if all replies are “yes”
2 When a process receives an engaging query
a Send queries on all edges, except the one along which it received engaging query
b After receiving replies to all messages: Send a “yes” reply to process from which it received engaging query if all received replies are “yes”; otherwise, send a “no” reply
3 When a process receives a non-engaging query
a Send a “yes” reply
Trang 33Election algorithms
• Algorithms for a ring topology
– Algorithm 1:
1 Process P i initiates by sending (“elect me”, P i) message
2 Every process P j receiving an (“elect me”, P i) message sends an
(“elect me”, P j ) message and then forwards P i’s message
other process; it elects the highest priority process as leader
a It sends a “new coordinator” message to inform others
– Algorithm 2: Refinement of algorithm 1
* In Step 2, P j sends only one message: Its own message if its
priority is higher than P i ’s; otherwise, it sends P i’s message
Trang 34Election algorithm
• Bully algorithm
a If a time-out occurs, it sends a “new coordinator” message to lower priority processes
b If it receives a “don’t you dare” message from a higher priority
process P j , it starts another time-out interval T 2
i If a time-out occurs, it assumes that all high priority processes have failed and starts a new election
priority process
a It sends a “don’t you dare” message to its sender
b Starts a new election by sending (“elect me”, P j) messages
Trang 35Resource allocation in a distributed system
1. P i requests resource allocator for a specific resource
2. Resource allocator consults name server, finds id of the resource
3. Resource allocator informs requester and resource manager of resource
Trang 36Process migration
• Process migration is performed for load balancing
– Difficulties
* Process state is distributed in various data structures of the OS
* Process id’s may change due to migration
Process id’s are used in interprocess communication
Solution: Use global process ids as in Sun cluster
* Delivery of messages requires a special provision
A node receiving a message would redirect it if the destination process has migrated out of it
» This residual state causes poor reliability
Alternatively, all processes may be informed when a process is migrated
» Requires a complex protocol