4 Distributed Systems Raise Network Complexity ...4 Reactive Systems Can Heal Themselves, But Not Network Partitions...5 The Problem .... The next section explores how networking problem
Trang 1How To Resolve Network Partitions In Seconds With Lightbend Enterprise Suite
Strategies To Seamlessly Recover
From Split Brain Scenarios
WHITE PAPER
Trang 2Table Of Contents
Executive Summary And Key Takeaways 3
The Question Is When, Not If, The Network Will Fail 4
Distributed Systems Raise Network Complexity 4
Reactive Systems Can Heal Themselves, But Not Network Partitions 5
The Problem 7
High-Level Solution 11
Four Strategies To Resolve Network Partitions 14
Strategy 1 - Keep Majority 15
Strategy 2 - Static Quorum 15
Strategy 3 - Keep Oldest 16
Strategy 4 - Keep Referee 17
Split Brain Resolution From Lightbend 17
Akka SBR 17
Cluster Management SBR 18
The Benefits 21
Serve Customers Better 21
Eliminate Expensive Downtime 21
Immediate Time-To-Value 21
Summary 21
Trang 3Executive Summary And Key Takeaways
In the era of highly-distributed, real-time applications, network issues often result in so-called “split brain” scenarios, where one or more nodes in a cluster becomes suddenly unresponsive This results in data inconsistencies between distributed services that can cause cascading failures and downtime
While the industry has turned to Reactive systems to solve issues of application/service level resilience, elasticity, and consistency, network partitions occur outside of the area of concern addressed by these architectures
Given the inevitability of network partitions and the impossibility of truly eradicating them, the best tion is to have predetermined strategies that fit business requirements for dealing with this recurring issue quickly and with minimal disruption to overall system responsiveness
solu-Lightbend Enterprise Suite, the commercial component of solu-Lightbend Reactive Platform, offers four such strategies as part of its Split Brain Resolver (SBR) feature These strategies, called “Keep Majority,” “Static Quorum,” “Keep Oldest,” and “Keep Referee” can be seamlessly executed in a matter of seconds if and when network partitions occur
Users can take advantage of the SBR feature either in development at the code level or at the production/operations level
By avoiding data inconsistencies, failures, and downtime, users of Lightbend Enterprise Suite can better serve their customers, leading to increased retention, growth, and market share
Trang 4The Question Is When, Not If, The Network Will Fail
“The network is reliable.”
One Of The Fallacies Of Distributed Computing
Network issues are unavoidable in today’s complex environments To put it more colloquially: networks are flaky Most users are aware of this and even accustomed to it, and are willing to handle random unresponsiveness from time to time However, user tolerance for network flakiness has limits If a spe-cific website or app repeatedly experiences problems, patience wears thin As network issues mount, it becomes increasingly likely that a user will interact with the offending website or app less often or even abandon it altogether
This is not to say that users are the only ones impacted by network issues In a world full of APIs and interconnected systems, network problems affecting one system can easily impact other connected or dependent systems Users interacting with one of those applications through a front-end will likely notice the problem within a short amount of time, but in some cases, it might take many hours or even days for such problems to become apparent For the sake of convenience, this paper will use user experience to highlight the pernicious effects of network issues
Most websites and apps access a database or have some form of a data persistence layer tion from the application layer to the persistence layer is often over a network Thus, for the duration of network issues, problems, and outages, the application becomes unable to perform its normal duties and user experience starts to suffer
Communica-Network problems can also span a variety of locations They can be widespread across the entire work, occur locally within data centers, or even arise in a single router or on-premise server A complete outage is the nightmare scenario, but even small network hiccups can result in lost revenue For example, high network traffic often creates very slow response times In many cases, these slowdowns are actu-ally worse than broken connections because, even with proper monitoring tools, the offending issue is non-obvious and difficult to diagnose and fix
net-Distributed Systems Raise Network Complexity
With distributed systems, various application components (e.g individual microservices and Fast Data pipelines) communicate with each other via some form of messaging One component asks another com-ponent for some information A component may communicate to other components a variety of informa-
Trang 5D
A
Figure 1 - Component Messaging
Given that, distributed applications can serve to make network issues better or worse, depending on how well they are designed Poorly-designed systems crumble when network problems occur Well-designed systems recover gracefully when the impacted components stop responding or respond very slowly The latest breed of distributed systems are designed from the ground up to be fully prepared for inevitable network issues These systems typically have well-defined default or compensating actions that activate when needed This allows the overall application to continue to function for users even when an applica-tion component stops working This new breed of systems is known as Reactive systems
Reactive Systems Can Heal Themselves, But Not Network Partitions
Reactive systems are designed to maintain a level of responsiveness at all times, elastically scaling to meet fluctuations in demand, and remain highly resilient against failures with built-in self-healing capabilities Lightbend, a leader in the Reactive movement, codified the Reactive principles of responsiveness, resil-ience, and elasticity, all backed by a message-driven architecture, with the Reactive Manifesto in 2013 Since then, the topic of Reactive has gone from being a virtually unacknowledged technique for building applications—used by only fringe projects within a select few corporations—to becoming part of the overall platform strategy for some of the biggest companies across the world
Compared to a traditional system, in which small failures can cause a system-wide crash, Reactive tems are designed to isolate the offending application or cluster node and restart a new instance some-where else However, at the overall network level, which may span across the entire globe, there exists
Trang 6sys-a fundsys-amentsys-al problem with network psys-artitions in distributed systems: it’s impossible to tell if sys-an sponsive node is the result of a partition in the network (known as a “split brain” scenario) or due to an actual machine crash
unre-Network partitions and node failures are indistinguishable to the observer: a node may see that there is a problem with another node, but it cannot tell if it has crashed and will never be available again or if there is a network issue that might heal after some time Processes may also become unresponsive for other reasons, such as overload, CPU starvation, or long garbage collection pauses, leading to further confusion
As such, even the most well-designed Reactive systems require additional tooling to quickly and
decisive-ly tackle large scale network issues The next section explores how networking problems, in particular network partitions, impact Reactive systems
Trang 7The Problem
“The network is homogeneous.”
One Of The Fallacies Of Distributed Computing
In Reactive systems, challenges arise when heterogeneous software components–such as collaborating groups of individual microservices–exchange important messages with each other Important messages must be delivered and processed, and any failure to deliver and process an important message will result
in some form of inconsistent state
When the network fails in a distributed system environment, this effectively causes a partition between the systems on each side of the network outage In most cases, the network has failed while all of the systems are still running The systems on each side of the network outage can no longer communicate across the partition It is as if an impenetrable wall has been placed between the systems on both sides
of the network outage This is known as a split brain scenario
Figure 2 - Network Partition
As shown in Figure 2, the network between the left and right nodes is broken The connections between the nodes on each side of the partition are cut
To illustrate the impact of network partitions, let’s consider two examples
In the first one, we’ll look at an order processing system - one that consists of just two microservices:
order and customer The responsibility of the order microservice is to create new orders and the
custom-er microscustom-ervice is responsible for rescustom-erving customcustom-er credit
Trang 8When users interact with this system and place an order, the order service creates a new order and sends
an order created message to the customer service The customer service receives the order created sage and reserves the credit It then sends a customer credit reserved message back to the order service The order service receives the message and changes the order state to approved
mes-Let’s now consider the impact of a network partition on this system
To begin with, the order service sends the customer service an order created message The customer service then receives the message and reserves the credit as it should
Order
Order Created Customer
Figure 3 - Send Message Successfully
The customer service then attempts to send a credit reserved message back to the order service But suddenly the the customer service falls off the network and the message cannot be sent
Order
Credit Reserved
Customer
Figure 4 - Send Message Fails
The order service never hears back from the customer service, so it resends the order created message again It receives no response, so it retries repeatedly to send the message
Trang 9Order Customer
Order Created Retry
Figure 5 - Message Send Retry Loop
While the order service is caught in this retry loop, the network detects that the customer service is offline and efforts begin to bring it back online When that eventually occurs, the order service successfully sends the order created message and the customer service receives it But a naive implementation of the customer service would then reserve the credit again, which is the incorrect course of action
Order
Order Created Customer
Credit Reserved
Figure 6 - Message Sent Again
As demonstrated by this example, in the absence of a network partition handling strategy, businesses must make sure to incorporate a robust at-least-once delivery mechanism into the design
Unfortunately, implementing such a mechanism is not trivial For example, the common retry loop proach is brittle and has a number of complexities that, if not handled properly, will result in the system converting to an inconsistent state In these circumstances and many others that are beyond the scope of this paper, the only viable option is to have an effective network partition / split brain resolution strategy
Trang 10ap-In the second illustrative example, we will use an in-person meeting with a group of seven coworkers.
To begin with, all seven members of the meeting are freely communicating back and forth
Figure 7 - People in Meeting
Suddenly, a wall appears that splits the meeting in two, dividing the co-workers into one group of four people and another of three The wall is solid and soundproof, preventing any communication between the two groups No one on either side of the wall can ascertain what the other group is planning, making collaboration impossible
Figure 8 - Wall Splits the Group
Let’s say that this is a very important meeting and it must continue, regardless of the presence of a giant impenetrable wall What should each group do? Both groups could sit idly and wait for the wall to disap-pear Or they could try determine a strategy that would get the entire group back together
Most responses to this situation would result in some confusion and disruption to the meeting What if each group decided to continue the meeting on their own? That could result in decisions being made by these two split groups based on incomplete information, information that is only known to people on the other side of the wall
Trang 11Figure 9 - Group B Joins Group A
A key point here is that both groups had to independently arrive at the same conclusion The majority stays where they are and the minority group moves to rejoin the majority This works for an odd number
of people, but what if there is an even split? Say there were eight in the meeting and the split was four and four In this situation, tie-breaking rules could help Perhaps the plan is to go to the side with the high-est-ranking employee
Returning to the first example with the order and customer microservices we see that a network partition that cuts off communication between them will interrupt the normal order processing workflow
Trang 12Order Customer
Figure 10 - Network Partition Between Services
When that happens, just as with the meeting room example, each side needs to independently detect and resolve the issue In other words, the system must be capable of running on each side of the parti-tion, detecting that there is a problem, and deciding which side stays up and which side shuts down The winning side should also be capable of restarting all the processes that were running on the losing side
Figure 11 - Cluster Network Partition
Now consider a more realistic example Say there are 20 microservices running on a cluster of five nodes (a node, in this case, could be a real server or VM), with four microservices running on each node (see Figure 11 above) The partition has cut off three of the nodes on one side and two nodes on the other side
of the partition
In order to detect and resolve a network partition in an environment like this, there are number of things that must occur:
Trang 133 Each node must be constantly checking to see if it can talk to the other nodes in the cluster.
4 When a network partition occurs, each of the monitoring components on each node needs to determine which nodes are still accessible and which ones are not
In an environment where all of the above is in place, here’s what should happen when the network tition occurs: the node level monitoring detects that three nodes can still communicate with each other
par-on par-one side of the partitipar-on and the other two nodes detect that they can par-only communicate with each other
Figure 12 - Split Brain Recovery
The two nodes on the minority side each shut down the microservices that are currently running on those nodes The three nodes on the majority side move the eight processes that were running on the minority side over and begin to run them, if there is sufficient capacity on the majority side to run them See Figure
12 above
If there is insufficient capacity to host these additional microservices, it will be necessary to add one
or more nodes to the majority side without administrator intervention, which in turn requires that the system possess the ability to scale automatically
In short, based on the environment, its capabilities, and any constraints it is subject to, there can be multiple strategies to resolve issues caused by network partitions The next section explores four such strategies