IT training mesosphere ebook oreilly streaming data khotailieu

Streaming Data: Concepts That Drive Innovative Analytics.. 1 From Data to Insight 3 Example of a Complete Project 8 Tracing the Movements of Big Data Processing 9 Machine Learning 17 New

Trang 3

Andy Oram

Streaming Data

Concepts that Drive Innovative Analytics

Trang 4

[LSI]

Streaming Data

by Andy Oram

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com) For more infor‐ mation, contact our corporate/institutional sales department: 800-998-9938 or

corporate@oreilly.com.

Acquisitions Editor: Rachel Roumeliotis

Developmental Editor: Jeff Bleiel

Production Editor: Christopher Faucher

Copyeditor: Octal Publishing, LLC

Proofreader: Nan Barber

Interior Designer: David Futato

Cover Designer: Karen Montgomery

Illustrator: Rebecca Demarest March 2019: First Edition

Revision History for the First Edition

2019-03-15: First Release

See http://oreilly.com/catalog/errata.csp?isbn=9781492038092 for release details.

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Streaming Data,

the cover image, and related trade dress are trademarks of O’Reilly Media, Inc The views expressed in this work are those of the author, and do not represent the publisher’s views While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, includ‐ ing without limitation responsibility for damages resulting from the use of or reli‐ ance on this work Use of the information and instructions contained in this work is

at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of oth‐ ers, it is your responsibility to ensure that your use thereof complies with such licen‐ ses and/or rights.

This work is part of a collaboration between O’Reilly and Mesosphere See our state‐ ment of editorial independence

Trang 5

Table of Contents

1 Streaming Data: Concepts That Drive Innovative Analytics 1

From Data to Insight 3

Example of a Complete Project 8

Tracing the Movements of Big Data Processing 9

Machine Learning 17

New Architectures Lead to New Development Patterns 25

Conclusion 27

Trang 7

CHAPTER 1 Streaming Data: Concepts That

Drive Innovative Analytics

Managers and staff who are responsible for planning, hiring, andresource allocation throughout an organization need to consider thefast-growing impact of data and analytics If this topic doesn’t fillyou with excitement, you can at least take on its study out of a well-justified concern that your industry is undergoing profound change.Organizations everywhere are disrupting business, government, andsociety through the use of these analytics You need to understandthe field to help your organization survive—and hopefully to growand contribute to the common good

This report deals in particular with streaming data, a set of tools and

practices for quick decision making in response to fast-changingevents Here are a few examples of how such analytics are changingbusinesses:

• Online services for movies and other content, such as Netflixand Amazon.com, analyze customer behavior to make betterpredictions, as made famous by the Netflix Prize

• Walmart, a long-term champion of efficiency, is using thesetechnologies to improve the customer experience in multipleways

• Banks are detecting fraud in the use of credit cards and mobilepayments

Trang 8

• John Deere classifies plants in the fields in order to reduce pesti‐cide use, while robots record their collection of fruit to predictyields.

It is difficult to quantify the growth of streaming analytics, for manyreasons Observers don’t track it as a distinct discipline, and itinvolves not a simple purchase, but a continuous process of chang‐ing educational and organizational roles Furthermore, many busi‐nesses would like to move much faster into streaming analytics butare hampered by the shortage of qualified staff As an indication ofthe attention analytics in general are getting, it’s worth noting anassessment from mid-2018 by the highly respected InternationalData Corporation (IDC), estimating an annual growth of nearly 12%

in “big data and business analytics (BDA) solutions.”

Trade publications highlight many benefits—and sometimes risks—

of streaming data but offer the general reader little insight into whatthese things actually do and what companies need to do to makethem work On the other hand, people who try to delve into thetechnology quickly find themselves in a maze of overlapping pro‐grams and tools (“You should use Kubernetes to manage Docker.”)This report tries to bring streaming data into focus for the averageprofessional reader It uses everyday situations and terms to describewhat these technologies are meant to achieve and how they achievethose ends Along the way, the report reveals the challenges raised

by trying to organize and process data You will learn the following:

• The essential steps in turning raw data into insights

• Major flaws in the data available to most organizations and theprocessing required to make the data usable

• A history of the goals adopted by data-driven organizations andhow tools evolved along with these goals

• Major categories of streaming data tools, with examples in eachcategory and how they contribute to the essential goals

As a high-level introduction for people without deep knowledge ofprogramming or digital technology, this report doesn’t delve intomathematics or programming details I don’t recommend particulartechnologies, although I mention some of the popular or historicallysignificant ones Most of all, I don’t suggest how you can use stream‐ing data in your organization, because that is highly dependent on

Trang 9

your industry, the expertise of your employees, your needs, and thedirection in which you want to take the organization.

From Data to Insight

We will now begin our journey toward extracting insights from data.More and more of it is being collected each year, presenting institu‐tions with the question of which data they will find useful Here are

a few common sources of data:

Logs from computer services

Because servers usually write information about the activitiesthey perform to a log file, these files become a valuable source ofinformation for assessing popularity and improving content aswell as for performance and security

Postings to social media and other popular websites

For instance, you can easily ask Twitter to send you all postingsthat contain certain words

Data generated by devices

This includes data generated in settings such as medical andindustrial

Streaming data also often operates on batches of data sent as files.Sales information and other transactions often come through thisway

Techniques for streaming data have been developed to address thevast acceleration that has taken place over the past couple of deca‐des, both in the accumulation of data and in the speed with whichbusiness decisions are made Information from online purchasescan be incorporated into marketing campaigns within milliseconds,and walk-in retail stores are adding data from their point-of-salesystems to speed up their decisions, too Just-in-time (JIT) manufac‐turing was invented in the 1970s, but has recently become far moresophisticated thanks to new data-processing techniques Therefore,although some decisions are made in traditional ways by saving updata and processing it as a batch, this process must be supplemented

by methods that incorporate recent data that has just streamed in.The same data can be handled in a streaming and a nonstreamingmanner For instance, online sales can be used in a streaming man‐ner to alter what appears on the home page of the retailer (“Hottest

Trang 10

items in our catalog!”) and then be saved for use in more long-termprojects such as planning the next product to make So when I usethe term “streaming data,” it applies to data that is used in a certainway—and also to the technical and organizational methods thatuncover new value in that data.

Sometimes, regulations in your industry as well as general laws such

as the European General Data Protection Regulation (GDPR) mightplace limits on how you can use data These will have an effect onwhat data you exclude from consideration or the data that you de-identify or aggregate over large groups of people

Having chosen data sources, two challenges you face at the very startare fixing quality problems with the data and extracting the usefulitems from the raw data We examine those challenges in the follow‐ing subsections

Issues of Data Quality

Data scientists routinely report that 80% or even 90% of their time isspent getting data into a state where it can be used to inform busi‐ness decisions We can’t understand the reality of data analyticswithout grasping what a messy world we live in and how many qual‐ity problems are presented by the data that comes in for analysis.This is true whether we’re dealing with data entered by humans intodatabases, data collected from devices in the Internet of Things(IoT), or data left as a trail when people visit websites and make pur‐chases

Consider a few sources of error in order to see why errors are sopervasive:

• People entering data will routinely make mistakes For instance,someone unfamiliar with the Spanish name “Aguilar” can easilyenter it as “Agular” or “Agiular.” People can also put data in thewrong fields, such as typing a name into a field meant for thesocial security number Because addresses vary so much, theymight choose different fields for some parts of an address, such

as an apartment number or rural route This inevitable inaccur‐acy is a source of suffering for people whose governments orbusinesses insist that their ID exactly matches some officialrecord Computer processing is no panacea, because the pro‐grammer rarely anticipates all variations in input Several sys‐

Trang 11

tems that process names reject common characters such asapostrophes, leading to situations where the name “O’Reilly” isrejected or automatically changed to “OReilly” or “Oreilly.” Theprogram might reject addresses due to an arbitrary rule thatevery address must include a ZIP code, which doesn’t exist in allcountries And, of course, a buggy program can simply insertcorrupted data.

• Devices can stop collecting or sending results for arbitrary peri‐ods of time, leaving gaps in the incoming stream of data Theycan also produce wildly incorrect results, perhaps because theywere jostled or because the environment became too hot or coldfor them

• Large streams of incoming data can overwhelm the network orthe intake buffers in the systems receiving the data, causing data

to be lost

• Glitches in processing, or even problems on the storage media,can lead to lost or truncated data This can even affect a singlerecord

The impacts on analytics can range from minuscule (leaving apotentially valuable customer out of a campaign) to dire (producingtotally wrong results because of some corrupt inputs) These are thesorts of things that keep data scientists up late at night, figuring outways to straighten out bad data Numerous commercial tools havealso sprung up to automate the task If you find 50 instances of

“Aguilar” in a field, you can guess that “Agiular” was meant to be

“Aguilar,” and either have it corrected automatically or present it to ahuman for further investigation Similarly, if you find “Chicago” inthe state field, you can move it to the city field Modern analytics,sometimes using sophisticated artificial intelligence (AI), can takeunknown datasets and generate the rules that fix such inconsisten‐cies

Extracting Kernels of Truth

Although some data is entered in a way that’s easy to extract andprocess, a majority of data is unstructured and difficult to makesense of To help you get on the right path, let’s walk through a com‐

mon example of how to extract data from a standardized log format,

also known as Combined Log Format, that monitors a visitor’s expe‐rience on your website Web servers write a line of information to

Trang 12

the log file about each request and response the servers handle Anexample of such a line follows (broken into several lines to fit thepage):

74.208.4.93 - andyo [04/Dec/2018:08:24:12 -0500]

"GET /somenewwidget.php HTTP/2.0"

200 465 https://www.bing.com/search?q=some+new+widget

"Mozilla/5.0 (Android; Mobile; rv:13.1)"

The first two fields (ignoring the hyphen) provide a pretty good way

to identify a unique visitor In this case, a visitor named andyo vis‐ited from IP address 74.208.4.93 Because andyo is most likelybehind a firewall (and might have a different IP address tomorrow),our application focuses on what he does during the first few minutes

he visits your website The IP address and username should stay thesame during that time

From subsequent fields, you know that he found your websitethrough a search for “some new widget” on the Bing search engine(see the URL on the second line), and that after visiting Bing hepulled up the appropriate web page on your site for that widget Youcan learn these things because the web page that generated the logentry is shown on the first line after the word GET The URL on thesecond line shows the referring URL or “Referer” (spelled withoutthe doubled “r”) This is the previous page andyo was on beforecoming to the page that generated the log entry

Suppose that you’re interested in the sequence of page visits thatlead to a purchase Take a look at the rest of this session (which ismuch simpler than most real-life visits to retail sites):

"Mozilla/5.0 (Android; Mobile; rv:13.1)"

In this scenario, andyo checked the widget’s price in the first entry,put a widget into his shopping cart in the second entry, and thenbought it in the third entry, receiving your “thank you” page andyo

Trang 13

might have visited other pages on your site after this, but your cur‐rent application cares only about the sequence that led to the sale If

a lot of people go through a similar sequence, you might maketweaks to your website For example, you can put the price of aproduct on that product’s page so that visitors can see the pricewithout undergoing an extra click and waiting for an extra page toload

The Combined Log Format is a little difficult to read, but it is laidout in a fairly formal way so that a programmer with modest skillscan extract the desired fields The application just described needsonly the IP address, username, Referer URL, and URL of the pagethe user is currently looking at So, you might want to write a pro‐gram that extracts those fields and restructure them in some formatthat’s easy for an application to parse and search through The mostpopular such format is JSON, supported by libraries in virtuallyevery programming language as well as many database engines Thedata you extract might then look like this:

}

By reducing the noisy input to a few items in a strict structure suit‐able for automated processing, your streaming data application canaccept millions of data items in that format and quickly processthem Besides extracting fields from the log format, the applicationhas parsed one field (the GET command) to extract from it the URLyou want

This section has hopefully given you a sense of what streaming datalooks like, particularly how you can capture input and put in a use‐ful format for extracting insights Now, let’s look at the process ofhandling big data, which will help you to choose the proper datasetsand tools Instead of a complex example using advanced tools,which would be confusing at this stage, let’s take a simple researchproject that demonstrates the same basic steps that streaming dataprojects go through

Trang 14

Example of a Complete Project

We can illustrate the general steps involved in data processing, start‐ing with data acquisition and cleaning, through a small project I car‐ried out using simple, conventional analytics (which produces aneasier example than streaming data) My goal was to illustrate thecommon observation that the rise and fall of large businesses takeplace faster nowadays than they did in earlier decades The stepswere as follows:

Step 1: Find a source of data

This can be a major research task But in this case, a web searchturned up a reputable source for businesses in the Fortune 500for every year since the index was started I could have lookedfor other data sources—for instance, many analysts prefer theDow Jones index for such comparisons—but I decided Fortune

500 was acceptable

Step 2: Ingest the data

For streaming data, this might involve some means of transfer‐ring files from your source into your own data warehouse ordata lake Whether you use files or some real-time source, youthen connect the source to the processing tool that you use,such as Spark Having found a website with the desired data, Ihad to grab a bunch of web pages formatted as tables in HTML

I wrote a program that extracted the information I wanted andcreated a simple format that showed the name of the businessand the year in which it appeared in the Fortune 500 If therehad been a need to clean the data, such as looking for misspelledcompany names, I would have done that also during this step

Step 3: Run the analysis

The first step in analysis was to find each business name in eachyear that it appeared in the Fortune 500 The analysis couldhave followed each business’s rise and fall more precisely (forinstance, if it went gradually from 400 to 200 and then backdown), but I contented myself with determining the earliest andlatest years of a continuous appearance in the Fortune 500 Theearliest and latest dates determined the length of time that thebusiness was on the index After doing this for each business, Iaggregated the data for all businesses to show the average tenure

of businesses over the years

Trang 15

Step 4: Storing results

For this project, I could store results as a simple table, but morecomplex storage would be required for large sets of output

Step 5: Presenting results

Having created a simple two-dimensional dataset—averagelength of time that businesses had been on the index, versus theyear—I wrote a program to generate a GIF of a chart, with theyears on the X axis and the length of the business’s tenure on the

Y axis

What was the result of these analytics? I succeeded at producing achart showing the results we had expected: a downward trend inlength of time that businesses were on the Fortune 500 But therewas an enormous spike in the middle—not a small upturn thatcould be the result of a legitimate trend, but an impossibly highvalue that stood out like a bobby pin in one year on my chart Thisshows once again that every dataset has quality problems Mostlikely, the HTML on a web page for a single row was distorted in away that caused my simple parsing tool to insert a meaningless datavalue Every data processing project will deal with such anomalies.Now that we’ve covered the basic issues that organizations musthandle during data processing, let’s launch into the tools for stream‐ing data These have evolved rapidly over a period lasting a littlemore than one decade because businesses are constantly trying tofind better ways to collect and use data Different tools were created

to solve different, evolving needs, and many of these tools coexisttoday while each continues to develop We cover the various tasksrequired by streaming data and how tools have evolved to solve theproblems organizations face as their use of data changes

Tracing the Movements of Big Data Processing

The following sections introduce the categories of tools used forstreaming data (with a handful of examples for each category) andexplain their purpose As organizations face new challenges, they gothrough cycles of coming up against the limitations of existing toolsand having to make a choice between altering those tools and devel‐oping new ones As we walk through the decade-long history of bigdata, notice how new categories of tools might be appropriate foryour organization Our history looks at the following:

Trang 16

• Batch processing of big datasets

• Data storage for big data

• New architectures for pipelined applications

• Streaming analytics

• Orchestrating computing requirements

Batch Processing of Big Datasets: MapReduce and Hadoop

Modern data sizes have reached such proportion that the computerfield must continually find new Greek prefixes to describe them(exabytes, zettabytes, etc.) Organizations found around the turn ofthe millennium that they could not handle this data just by scaling

up the processes they already used In addition, the data is relateddifferently, and new computing algorithms can extract value muchmore quickly In this section, we go back to a critical development incomputing with big data

We could probably trace big data’s start to a radical innovation atGoogle in the mid-2000s For its central data-processing task, index‐ing the web, the company abandoned typical queries into relational

databases for a sort of brute-force data filtering Called MapReduce,

this paradigm is documented in a classic 2004 paper by Jeffrey Deanand Sanjay Ghemawat

What motivated MapReduce? Business people often talk about

“deriving insights from data.” Technically, that generally involvesaggregating thousands or even millions of pieces of data, or reduc‐ing these millions of pieces to one key number You might be doingsomething as simple as summing (what were our expenses this pastquarter?) or averaging (how much did we earn on a single item?).However, a reduction operation could also be much more complex,such as determining the expected time to failure for a factory part

In any case, you are trying to reduce a welter of data to a singlenumber that means something to you Hence the Reduce part of theMapReduce algorithm family

The input to your Reduce function must be very consistent andordered, so you must usually start your application by doing somesort of extraction, counting, or other operation on each inputrecord We saw this in our log file example—and this is where the

Trang 17

Map function comes in In Google’s original application, a Mapfunction crawled through an enormous number of web pages to cre‐ate a huge list of word–document pairs showing where each wordoccurred The Reduce function would then show how many timeseach word appeared in each document (Dean and Ghemawat said,

“Our abstraction is inspired by the map and reduce primitivespresent in Lisp and many other functional languages.”)

Because a Map operation works on items of data one by one, it pro‐duces output in an unpredictable order For instance, if you are try‐ing to find customers in sales data, and the data is ordered in someother way such as date of purchase, you will end up with multiplerecords for a single customer scattered randomly through thestream of output from the Map function Thus, another step, involv‐ing sorting or “shuffling,” lies between the Map and the Reduce Thisstep is particularly important if you run multiple Reduce jobs thattake data with different characteristics; the Shuffle step can feed theright data to each Reduce job For instance, you might filter the out‐put so that sales data from each store goes to a different Reduce job.The earlier Fortune 500 example contained Map operations toextract individual pairs of business–year combinations, and aReduce operation to find the average length of time spent by a busi‐ness on the index

MapReduce relies on massive compute power Hence, it is designedfor distributed, parallel processing A free and open source version,Hadoop, was soon released and quickly became a key tool to cite on

a programmer’s résumé The Apache Foundation, which had sprunginto being years earlier to coordinate the development of the opensource Apache web server, took over administration of the Hadoopproject That brought the foundation into the age of big data, and itnow hosts many of projects mentioned in this report

Hadoop was by no means the last word in processing big data You’llsoon see that streaming data introduces even more demands Butfirst, let’s look at how all this data can be stored

Data Storage for Big Data: NoSQL

As we saw from our earlier example of log-file processing, moderndata applications structure and process data in unique ways Thisdata can be oddly structured, with different fields in each record,and poorly structured, containing text or images To store and

Định dạng
Số trang	34
Dung lượng	3,87 MB