IT training white paper resolve network partitions split brain lightbend enterprise suite khotailieu

4 Distributed Systems Raise Network Complexity ...4 Reactive Systems Can Heal Themselves, But Not Network Partitions...5 The Problem .... The next section explores how networking problem

Trang 1

How To Resolve Network Partitions In Seconds With Lightbend Enterprise Suite

Strategies To Seamlessly Recover

From Split Brain Scenarios

WHITE PAPER

Trang 2

Table Of Contents

Executive Summary And Key Takeaways 3

The Question Is When, Not If, The Network Will Fail 4

Distributed Systems Raise Network Complexity 4

Reactive Systems Can Heal Themselves, But Not Network Partitions 5

The Problem 7

High-Level Solution 11

Four Strategies To Resolve Network Partitions 14

Strategy 1 - Keep Majority 15

Strategy 2 - Static Quorum 15

Strategy 3 - Keep Oldest 16

Strategy 4 - Keep Referee 17

Split Brain Resolution From Lightbend 17

Akka SBR 17

Cluster Management SBR 18

The Benefits 21

Serve Customers Better 21

Eliminate Expensive Downtime 21

Immediate Time-To-Value 21

Summary 21

Trang 3

Executive Summary And Key Takeaways

In the era of highly-distributed, real-time applications, network issues often result in so-called “split brain” scenarios, where one or more nodes in a cluster becomes suddenly unresponsive This results in data inconsistencies between distributed services that can cause cascading failures and downtime

While the industry has turned to Reactive systems to solve issues of application/service level resilience, elasticity, and consistency, network partitions occur outside of the area of concern addressed by these architectures

Given the inevitability of network partitions and the impossibility of truly eradicating them, the best tion is to have predetermined strategies that fit business requirements for dealing with this recurring issue quickly and with minimal disruption to overall system responsiveness

solu-Lightbend Enterprise Suite, the commercial component of solu-Lightbend Reactive Platform, offers four such strategies as part of its Split Brain Resolver (SBR) feature These strategies, called “Keep Majority,” “Static Quorum,” “Keep Oldest,” and “Keep Referee” can be seamlessly executed in a matter of seconds if and when network partitions occur

Users can take advantage of the SBR feature either in development at the code level or at the production/operations level

By avoiding data inconsistencies, failures, and downtime, users of Lightbend Enterprise Suite can better serve their customers, leading to increased retention, growth, and market share

Trang 4

The Question Is When, Not If, The Network Will Fail

“The network is reliable.”

One Of The Fallacies Of Distributed Computing

Network issues are unavoidable in today’s complex environments To put it more colloquially: networks are flaky Most users are aware of this and even accustomed to it, and are willing to handle random unresponsiveness from time to time However, user tolerance for network flakiness has limits If a spe-cific website or app repeatedly experiences problems, patience wears thin As network issues mount, it becomes increasingly likely that a user will interact with the offending website or app less often or even abandon it altogether

This is not to say that users are the only ones impacted by network issues In a world full of APIs and interconnected systems, network problems affecting one system can easily impact other connected or dependent systems Users interacting with one of those applications through a front-end will likely notice the problem within a short amount of time, but in some cases, it might take many hours or even days for such problems to become apparent For the sake of convenience, this paper will use user experience to highlight the pernicious effects of network issues

Most websites and apps access a database or have some form of a data persistence layer tion from the application layer to the persistence layer is often over a network Thus, for the duration of network issues, problems, and outages, the application becomes unable to perform its normal duties and user experience starts to suffer

Communica-Network problems can also span a variety of locations They can be widespread across the entire work, occur locally within data centers, or even arise in a single router or on-premise server A complete outage is the nightmare scenario, but even small network hiccups can result in lost revenue For example, high network traffic often creates very slow response times In many cases, these slowdowns are actu-ally worse than broken connections because, even with proper monitoring tools, the offending issue is non-obvious and difficult to diagnose and fix

net-Distributed Systems Raise Network Complexity

With distributed systems, various application components (e.g individual microservices and Fast Data pipelines) communicate with each other via some form of messaging One component asks another com-ponent for some information A component may communicate to other components a variety of informa-

Trang 5

D

A

Figure 1 - Component Messaging

Given that, distributed applications can serve to make network issues better or worse, depending on how well they are designed Poorly-designed systems crumble when network problems occur Well-designed systems recover gracefully when the impacted components stop responding or respond very slowly The latest breed of distributed systems are designed from the ground up to be fully prepared for inevitable network issues These systems typically have well-defined default or compensating actions that activate when needed This allows the overall application to continue to function for users even when an applica-tion component stops working This new breed of systems is known as Reactive systems

Reactive Systems Can Heal Themselves, But Not Network Partitions

Reactive systems are designed to maintain a level of responsiveness at all times, elastically scaling to meet fluctuations in demand, and remain highly resilient against failures with built-in self-healing capabilities Lightbend, a leader in the Reactive movement, codified the Reactive principles of responsiveness, resil-ience, and elasticity, all backed by a message-driven architecture, with the Reactive Manifesto in 2013 Since then, the topic of Reactive has gone from being a virtually unacknowledged technique for building applications—used by only fringe projects within a select few corporations—to becoming part of the overall platform strategy for some of the biggest companies across the world

Compared to a traditional system, in which small failures can cause a system-wide crash, Reactive tems are designed to isolate the offending application or cluster node and restart a new instance some-where else However, at the overall network level, which may span across the entire globe, there exists

Trang 6

sys-a fundsys-amentsys-al problem with network psys-artitions in distributed systems: it’s impossible to tell if sys-an sponsive node is the result of a partition in the network (known as a “split brain” scenario) or due to an actual machine crash

unre-Network partitions and node failures are indistinguishable to the observer: a node may see that there is a problem with another node, but it cannot tell if it has crashed and will never be available again or if there is a network issue that might heal after some time Processes may also become unresponsive for other reasons, such as overload, CPU starvation, or long garbage collection pauses, leading to further confusion

As such, even the most well-designed Reactive systems require additional tooling to quickly and

decisive-ly tackle large scale network issues The next section explores how networking problems, in particular network partitions, impact Reactive systems

Trang 7

The Problem

“The network is homogeneous.”

One Of The Fallacies Of Distributed Computing

In Reactive systems, challenges arise when heterogeneous software components–such as collaborating groups of individual microservices–exchange important messages with each other Important messages must be delivered and processed, and any failure to deliver and process an important message will result

in some form of inconsistent state

When the network fails in a distributed system environment, this effectively causes a partition between the systems on each side of the network outage In most cases, the network has failed while all of the systems are still running The systems on each side of the network outage can no longer communicate across the partition It is as if an impenetrable wall has been placed between the systems on both sides

of the network outage This is known as a split brain scenario

Figure 2 - Network Partition

As shown in Figure 2, the network between the left and right nodes is broken The connections between the nodes on each side of the partition are cut

To illustrate the impact of network partitions, let’s consider two examples

In the first one, we’ll look at an order processing system - one that consists of just two microservices:

order and customer The responsibility of the order microservice is to create new orders and the

custom-er microscustom-ervice is responsible for rescustom-erving customcustom-er credit

Trang 8

When users interact with this system and place an order, the order service creates a new order and sends

an order created message to the customer service The customer service receives the order created sage and reserves the credit It then sends a customer credit reserved message back to the order service The order service receives the message and changes the order state to approved

mes-Let’s now consider the impact of a network partition on this system

To begin with, the order service sends the customer service an order created message The customer service then receives the message and reserves the credit as it should

Order

Order Created Customer

Figure 3 - Send Message Successfully

The customer service then attempts to send a credit reserved message back to the order service But suddenly the the customer service falls off the network and the message cannot be sent

Order

Credit Reserved

Customer

Figure 4 - Send Message Fails

The order service never hears back from the customer service, so it resends the order created message again It receives no response, so it retries repeatedly to send the message

Trang 9

Order Customer

Order Created Retry

Figure 5 - Message Send Retry Loop

While the order service is caught in this retry loop, the network detects that the customer service is offline and efforts begin to bring it back online When that eventually occurs, the order service successfully sends the order created message and the customer service receives it But a naive implementation of the customer service would then reserve the credit again, which is the incorrect course of action

Order

Order Created Customer

Credit Reserved

Figure 6 - Message Sent Again

As demonstrated by this example, in the absence of a network partition handling strategy, businesses must make sure to incorporate a robust at-least-once delivery mechanism into the design

Unfortunately, implementing such a mechanism is not trivial For example, the common retry loop proach is brittle and has a number of complexities that, if not handled properly, will result in the system converting to an inconsistent state In these circumstances and many others that are beyond the scope of this paper, the only viable option is to have an effective network partition / split brain resolution strategy

Trang 10

ap-In the second illustrative example, we will use an in-person meeting with a group of seven coworkers.

To begin with, all seven members of the meeting are freely communicating back and forth

Figure 7 - People in Meeting

Suddenly, a wall appears that splits the meeting in two, dividing the co-workers into one group of four people and another of three The wall is solid and soundproof, preventing any communication between the two groups No one on either side of the wall can ascertain what the other group is planning, making collaboration impossible

Figure 8 - Wall Splits the Group

Let’s say that this is a very important meeting and it must continue, regardless of the presence of a giant impenetrable wall What should each group do? Both groups could sit idly and wait for the wall to disap-pear Or they could try determine a strategy that would get the entire group back together

Most responses to this situation would result in some confusion and disruption to the meeting What if each group decided to continue the meeting on their own? That could result in decisions being made by these two split groups based on incomplete information, information that is only known to people on the other side of the wall

Trang 11

Figure 9 - Group B Joins Group A

A key point here is that both groups had to independently arrive at the same conclusion The majority stays where they are and the minority group moves to rejoin the majority This works for an odd number

of people, but what if there is an even split? Say there were eight in the meeting and the split was four and four In this situation, tie-breaking rules could help Perhaps the plan is to go to the side with the high-est-ranking employee

Returning to the first example with the order and customer microservices we see that a network partition that cuts off communication between them will interrupt the normal order processing workflow

Trang 12

Order Customer

Figure 10 - Network Partition Between Services

When that happens, just as with the meeting room example, each side needs to independently detect and resolve the issue In other words, the system must be capable of running on each side of the parti-tion, detecting that there is a problem, and deciding which side stays up and which side shuts down The winning side should also be capable of restarting all the processes that were running on the losing side

Figure 11 - Cluster Network Partition

Now consider a more realistic example Say there are 20 microservices running on a cluster of five nodes (a node, in this case, could be a real server or VM), with four microservices running on each node (see Figure 11 above) The partition has cut off three of the nodes on one side and two nodes on the other side

of the partition

In order to detect and resolve a network partition in an environment like this, there are number of things that must occur:

Trang 13

3 Each node must be constantly checking to see if it can talk to the other nodes in the cluster.

4 When a network partition occurs, each of the monitoring components on each node needs to determine which nodes are still accessible and which ones are not

In an environment where all of the above is in place, here’s what should happen when the network tition occurs: the node level monitoring detects that three nodes can still communicate with each other

par-on par-one side of the partitipar-on and the other two nodes detect that they can par-only communicate with each other

Figure 12 - Split Brain Recovery

The two nodes on the minority side each shut down the microservices that are currently running on those nodes The three nodes on the majority side move the eight processes that were running on the minority side over and begin to run them, if there is sufficient capacity on the majority side to run them See Figure

12 above

If there is insufficient capacity to host these additional microservices, it will be necessary to add one

or more nodes to the majority side without administrator intervention, which in turn requires that the system possess the ability to scale automatically

In short, based on the environment, its capabilities, and any constraints it is subject to, there can be multiple strategies to resolve issues caused by network partitions The next section explores four such strategies

Định dạng
Số trang	22
Dung lượng	1,71 MB