IT training data where you want it ebook khotailieu

1 Goals of Modern Geo-Distributed Systems 3 Moving Data: Replication and Mirroring 4 Clouds and Geo-distribution 10 Global Data Governance 13 Containers for Big Data 16 Use Case: Geo-Rep

Trang 3

Ted Dunning and Ellen Friedman

Data Where You Want It

Geo-Distribution of Big Data

and Analytics

Boston Farnham Sebastopol Tokyo

Beijing Boston Farnham Sebastopol Tokyo

Beijing

Trang 4

[LSI]

Data Where You Want It

by Ted Dunning and Ellen Friedman

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com/safari) For more information, contact our corporate/institutional sales department: 800-998-9938 or

corporate@oreilly.com.

Editor: Shannon Cutt

Copyeditor: Holly Bauer Forsyth

Interior Designer: David Futato

Cover Designer: Randy Comer February 2017: First Edition

Revision History for the First Edition

Trang 5

Table of Contents

Why Geo-Distribution Matters 1

Goals of Modern Geo-Distributed Systems 3

Moving Data: Replication and Mirroring 4

Clouds and Geo-distribution 10

Global Data Governance 13

Containers for Big Data 16

Use Case: Geo-Replication in Telecom 20

It’s Actually Broken Even If It Works Most of the Time 20

Use Case: Shared Inventory in Ad Tech 22

Additional Resources 29

v

Trang 7

Why Geo-Distribution Matters

“Data where you want it; compute where you need it.”

Thirty years ago, if someone in North America or Europe men‐tioned they had accepted a position at a Tokyo-based firm, your firstquestion likely would’ve been, “When are you relocating to Tokyo?”Today the question has become, “Are you planning to relocate?”Remote communication combined with frequent flights have left thequestion open in global business teams

Just as we now think differently about how people work together, sotoo a shift is needed in how we build and use global data infrastruc‐ture in order to address modern business challenges We need sys‐tems that allow data to live where it should We should be able tothink of data—on premise or in the cloud—as part of a global sys‐

tem In short, our data infrastructure should give us data that can be accessed where, when, and by whom we want, and not by anyone else.

The idea of working with data centers that are geographically distantisn’t new There’s an emerging need among many big data organiza‐tions for globally distributed but still cohesive data that meets thechallenges of consistency, low latency, ease of administration, andlow cost, even at huge scale

In the past, many people built organizations around a central head‐quarters plus regional teams that functioned independently, eachwith its own regional database, possibly reporting back to HQmonthly or weekly Data was copied and backed up at a remote loca‐tion for disaster recovery, typically daily, particularly if the data wascritical

1

Trang 8

But these hierarchical approaches aren’t good enough anymore.With a global view via the Internet, people expect to touch andrespond to business from anywhere, at any time To really get thisdone, you need data to reside in many places, with low latency coor‐dination of updates, and still be robust against communication fail‐ures on the Internet Data infrastructure often needs to be shared bymany applications We may need a data object to live in more than

one place—that’s geo-distributed data This includes cloud comput‐

ing, because cloud is really just one more “place,” as suggested by

Figure 1-1

Figure 1-1 Emerging technologies address the need for data that can

be shared and updated globally, at massive scale,with very low latency and high consistency There’s also a need for fine-grained control over who has access Here, on-premise data centers are shown as rectangles that share data directly to distant locations or form a hybrid system with public or private cloud.

The challenges posed by the required scale and speed alone are sub‐stantial For example, IoT sensor data systems commonly move data

at a rate of tens of gigabits per second, and some exceed terabits persecond

The need for truly geo-distributed data—both in terms of storageand computation—requires new approaches, and these newapproaches are the topic of this report These approaches haveemerged bit by bit rather than all at once, but the accumulated

Trang 9

change is now large enough to warrant even experienced practition‐ers to take new look.

Previously, systems that needed data to be available globally wouldhave explicitly copied data from place to place instead of usingplatform-level capabilities to automatically propagate changes Inpractice, however, it pays to think of data as a global system in which

the same data objects are shared across different locations This

geo-distribution of data combined with appropriate design patterns canmake it much simpler to build applications and can result in morereliable, scalable systems that span the globe Similarly, the manage‐ment of computation in global systems has historically been veryhaphazard This is improving with containers, which allow precisedeployment of a precisely known distribution of code

In this report, we describe the requirements and challenges of such asystem as well as examples of specific technologies designed to meetthem We also include a collection of real-world use cases wherelow-latency geo-distribution of very large-scale data and computa‐tion provide a competitive edge

Goals of Modern Geo-Distributed Systems

As we examine some of the emerging technologies that address theneed for geo-distributed systems, keep these goals in mind Manymodern systems need to take into account the facts that:

• Data comes from many sources, in many forms, and from manylocations

• Sometimes data has to stay in a particular location—for exam‐ple, to meet government regulations or to optimize perfor‐mance

• In many other cases, a global view is required—for example, forreporting or machine learning in which global models are builtfrom much more than just local activity

• Central data often needs to be distributed in order to be shared,

as for inventory control, model deployment, accurate predictivemaintenance for large-scale industrial systems, or for disasterrecovery

• Computation (and the associated data) sometimes needs to benear the edge, such as in industrial control systems, and simul‐

Goals of Modern Geo-Distributed Systems | 3

Trang 10

taneously in a central analytics facility that has access to datafrom across the entire enterprise

• To stay competitive in modern settings, data agility and a micro‐services approach may be required

With these demands, how do new technologies meet the challenges?

Global View: Data Storage and Computation

One reason for geo-distribution of data is to put data at a remote site

as part of a disaster recovery plan, but movement of data betweenactive data centers is also key for efficient use of data in many situa‐tions It’s generally more efficient to access data locally, so storingthe same data in multiple places and being able to replicate updateswith low latency are valuable capabilities One key is to be able tospecify how and where data should end up without saying exactlywhich bits to move We discuss new options for data movement inthe next section of this report Although local access is generallydesirable, you should have the choice of accessing data remotely aswell, and we discuss some of the ways that can be done more easily

in “Global Data Governance” on page 13

Data movement is the most important issue in geo-distributed sys‐tems, but computation is a factor, too This is especially true wherevery high volume and rate of data production occurs, as with IoTsensor data Edge computing is becoming increasingly important insuch situations In the free report, Data: Emerging Trends and Tech‐nologies, Alistair Croll refers to “…a renewed interest in computing

at the edges—Cisco calls it ‘fog computing…’” An application thathas been developed and tested at a central data center needs to bedeployed to multiple locations How can you do this efficiently andwith confidence that it will perform as needed at the new locations?

We address that topic in “Containers for Big Data” on page 16

Moving Data: Replication and Mirroring

Remote mirroring and geo-distributed data replication can be done

in a variety of ways Traditional methods for data movement werenot designed to provide the large scale, low latency, and low costthat modern systems demand We’ll examine capabilities needed to

do this efficiently, the challenges in doing so, and how some emerg‐ing technologies begin to address those challenges

Trang 11

Remote Mirroring

One of the most basic reasons for moving data across a distance is tomirror data to a cluster at a remote site as part of a disaster recoveryplan Data mirroring is generally a scheduled event In efficient sys‐tems, after the initial mirror is established, subsequent mirrors areincremental This incremental updating is particularly desirable forlarge-scale systems because moving fewer bytes decreases the timerequired and chance for error

It’s also useful to guarantee that the mirror copy is a fully consistentimage of the source That is to say that if multiple files are changing

in the source unit being mirrored, there needs to be a mechanism to

ensure that the mirror copy reflects the exact state of the source at a point in time, rather than different states for different files at differ‐

ent times as the mirroring was done

An example of how this can be accomplished efficiently is seen inthe design of the mirroring process of the MapR Converged DataPlatform Mirroring is done at the level of a volume, which acts like

a directory and can contain files, directories, NoSQL database tables,and message streams First, point-in-time snapshots of the sourceand destination volumes are made Blocks in the source that havechanged since the last mirror update are transferred to the destina‐tion snapshot Once all changed blocks have been transferred, thedestination mirror and the destination snapshot are swapped in asingle atomic operation The use of a snapshot also means that thesource volume remains available for reads and writes throughoutthe mirroring process The effect is point-in-time updates to themirror destination

Remote Replication

Another common requirement with geo-distributed data is toupdate a remote database with near–real time replication Tradition‐ally, this was often done by exporting report tables in regional datacenters and then copying them to central centers to be importedinto a central database To get data on a more real-time basis, changedata capture (CDC) systems were used to set up master-slave repli‐cation of databases This required copying a snapshot of the localtable to the remote system followed by starting a log-shipping sys‐tem to copy updates to the remote replica The initial copy couldtake considerable time, so several rounds of incremental copying

Moving Data: Replication and Mirroring | 5

Trang 12

might be needed to get an up-to-date clone of the database Settingthese systems up has traditionally been a bit tricky, and they areoften difficult to keep up.

Increasingly, however, systems need to respond to changing situa‐tions in seconds or milliseconds Similarly, it isn’t just a matter ofmoving data to the center; data has to flow outward as well Like‐wise, substantial amounts of computation need to happen at theedge near the source of data

These new requirements are increasing the complexity of the repli‐cation patterns, and that is making it harder to maintain data consis‐tency in these systems Also, we want to be able to separateconcerns, leaving content questions to application developers anddata motion questions to administrators who should not need toknow much about exactly what data is being moved Systems alsomust be resilient to interruptions caused by network partitions (andmaintain consistency as much as possible), so we generally want

multi-master replication, where updates can be made to different

copies of the data There’s also the issue of near–real time replication

of data objects other than databases, such as message streams Thesecapabilities are just beginning to be available in some emerging bigdata technologies, as we discuss in later in the report

Why Multi-Master Geo-Replication Matters

Multi-master, bi-directional data replication across data centers canreduce latency, simplify architectures, and protect against data lossdue to network issues, as shown in Figure 1-2 This style of replica‐tion allows faster access to data, but replication loops must be avoi‐ded

If a system does not have a built-in way to detect and break loops,manual effort is required to change replication patterns when fail‐ures happen in order to maintain consistency In MapR table repli‐cation, for example, updates remember their origin so that updateloops can be avoided Table 1-1 shows some benefits of multi-masterupdates

Trang 13

Figure 1-2 Advantages of multi-master geo-replication In slave replication (A), data sources only write to one master and rely on replication to move data to the other location But a network partition could easily prevent insertion, resulting in data loss In multi-master replication (B), data is ingested near the source with less chance for loss Bi-directional replication updates both databases in near–real time.

master-Table 1-1 Types of geo-distributed replication for different tools

Trang 14

Conflict Resolution: The Question of Consistency

Achieving consistency in a distributed system while maintainingperformance (low latency) is not easy In addition, maintainingavailability requires some compromises Consider two approachestaken by big data technologies that offer multi-master replication:Cassandra and MapR Converged Data Platform

Eventual consistency: Cassandra

Cassandra deals with the tradeoffs between consistency, availability,and performance by allowing replicas of data to be temporarilyinconsistent in order to increase availability of the database A con‐figurable option allows you to specify how many replicas must bewritten as well as how many must be read This is true for local clus‐ters or with geo-replication With the the default setting, these num‐bers are set low to improve performance, at some cost inconsistency It is assumed that in practice, applications will useparameters such that inconsistencies will be detected on read andcorrected—providing eventual consistency

Part of the difficulty with this approach is the partial dependency onthe correct configuration for consistency: you have to configure theapplications correctly or it may not work For this reason, thedefault setting may be considered somewhat dangerous, and if dataloss is a particular concern, then non-default settings are preferred(see the Cassandra documentation or testing sites such as

Aphyr.com for more in-depth explanations) The goal of Cassandra’s

design is to generally prioritize availability over consistency.

Strict local consistency: MapR Converged Data Platform

MapR’s Converged Data Platform takes a different tack by makingconsistency within a data center non-negotiable while allowing tablereplication to fall behind on network failures Conflict resolutionbetween delayed updates from different sources is done at the plat‐form level using timestamp resolution for updates—the last writewins Replication between tables is achieved at the lowest level byshipping logs to replicas, but the origin of each update is recorded sothat loops in the replication pattern can be broken

To achieve strict local consistency, there can be a slight reduction inavailability (possibly seconds per year) if machines or local networklinks fail, which may not be a concern

Trang 15

Beyond Database Replication: Streaming Data

The increasing interest in streaming data from continuous events,such as IoT sensor data or clickstream data from web traffic, under‐lines the need for easy, reliable geo-distributed replication of mes‐sage streams As with real-time replication of database updates, it’spreferable to have the capability for multi-master stream replicationacross data centers

Apache Kafka is a popular tool for stream transport Streaming data

is assigned to a topic, which in turn is divided into partitions forload balancing and improved performance Order of messages ismaintained within each topic partition Kafka is generally run on aseparate cluster from the stream processing application technologiessuch as Apache Spark Streaming, Apache Apex, or Apache Flink,plus the main data persistence In the Kafka cluster, a single topicpartition replica must fit on a single machine, but multiple parti‐tions generally reside on each machine

Kafka addresses the need to move data between data centers via aprogram called MirrorMaker that consumes messages from a topicand re-publishes them to a remote Kafka cluster MirrorMaker can’treplicate a single topic bi-directionally (so no multi-master replica‐tion), and message offsets are not maintained between copies.Another technology for streaming data message transport is MapRStreams The MapR Converged Data Platform supports streams,directories, files, and tables in the same system Typically, the samecluster is used for both data persistence and stream processingapplications (such as Spark, Apex, Flink, etc.) MapR Streams sup‐ports the Kafka API, and like Kafka, streaming data is assigned to atopic that has topic partitions Unlike Kafka, a topic partition inMapR is distributed across the cluster Furthermore, with MapR,

many topics are collected together into an object known as a Stream.

Policies such as time-to-live, data access control via ACES (AccessControl Expressions), or geo-replication are applied at the Streamlevel

Like MapR direct table replication, geo-replication of MapR Streams

is near–real time, bi-directional, and multi-master with loop avoid‐ance The correct sequence of messages in a topic partition is main‐tained during geo-replication; message and consumer offsets arepreserved This is true for replication between on-premise clusters

in distant data centers as well as between an on-premise cluster and

Moving Data: Replication and Mirroring | 9

Trang 16

a cloud cluster for hybrid cloud streaming architecture Other exam‐ples of geo-replication of MapR streams are found in the telecomand ad-tech use cases at the end of this report.

Clouds and Geo-distribution

Cloud computing is one of the major forces driving common adop‐tion of geo-distributed computing The simple reason is that publicclouds allow just about anybody to have multiple data centers Andonce you have multiple data centers, the practical choice is to eitherhave good support for geo-distributed data or wind up with datachaos

The Core Trend to Cloud

The reason public clouds lead to multiple data centers is that it hasalways been desirable to have multiple data centers (for disasterrecovery if for no other reason), and public clouds make it easier tohave multiple data centers since you don’t have to go to unfamiliarplaces to provision the hardware or staff them Global business pres‐ence and widely distributed sources of data make multiple data cen‐ters even more attractive since you can have data close to your users,thus making it easier for them to get data as well as provide it Incontrast, having a single centralized data center introduces longlatencies and decreases reliability due to the distance that data musttraverse

Not all clouds are public clouds The point of cloud computingdoesn’t require that the cloud be run by an external team The reallycore idea is to treat computing as a fungible commodity that canquickly and efficiently be repurposed to different needs using virtu‐alization or container technology More and more companies arereorganizing their on-premise computing as private clouds toincrease resource utilization There are also specialized clouds avail‐able that meet special needs such as heightened levels of security, tel‐ecommunications support, or prebuilt healthcare systems

Cloud Neutrality

As more options become available for cloud computing, both pub‐

licly and privately, the concept of cloud-neutrality is becoming very

important The idea is that having multiple cloud options onlymakes a difference if you can change your mind and aren’t com‐

Trang 17

pletely locked into just one of them If your applications are neutral, then you can take advantage of the competition betweenvarious public cloud vendors and in-house facilities by using eachkind just for what they do best In addition, having all of your cloud-based computation handled by a single vendor runs the risk of cata‐strophic failures due to a common platform vulnerability or acoordinated failure Spreading critical functions across multiplevendors can mitigate this risk.

cloud-The common result of all this is that many large companies haveone or more on-premise data centers and are increasingly looking toalso have significant presence in the clouds provided by one or evenmultiple vendors

Cloud Bursting and Load Leveling

Commercial clouds have a very different cost model (by the hour orminute) than the fixed-asset depreciation model of on-premise datacenters or private clouds This difference means that commercialclouds can be very cost effective for loads that have a high peak-to-valley ratio, or that are intermittent with low-duty cycle On-premisesystems are often much more cost effective for relatively constantprocessing loads with high utilization Architectures such as the oneshown in Figure 1-3 can make use of the arbitrage between thesecost models by pushing the variable compute loads into a commer‐cial cloud and retaining constant loads in-house

Clouds and Geo-distribution | 11

Trang 18

Figure 1-3 This diagram shows how a variety of compute models can

be used together if geo-distributed data movement is available An premise data center can auto-magically replicate data to a core cloud- based cluster That core cluster can burst to larger sizes when necessary

on-in response to changon-ing compute needs Similar techniques can be used

to extend data structures to or from a private cloud.

During peak loads, additional cloud resources can be recruited tohandle the higher loads (a temporary cloud burst), then theseresources can be released as the peak drops A core cloud clustertypically remains after the burst to provide a durable home for anydata required for the next burst

The residue of work that remains after off-loading variable loadsinto a commercial cloud is a very steady compute load that makesideal use of the cost advantages of the on-premise or private cloudcomputing The overall cost savings can easily be 2:1 or more rela‐tive to a pure on-premise or a pure commercial cloud strategy.Making this hybrid cloud architecture work, however, requires theability to replicate data between private cloud, on-premise, corecloud, and burst compute systems Without good platform support,this can be a show stopper for these architectures Strict cloud neu‐trality of at least some applications is also very helpful

Định dạng
Số trang	37
Dung lượng	2,66 MB