1. Trang chủ
  2. » Công Nghệ Thông Tin

IT training the big data transformation khotailieu

55 43 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 55
Dung lượng 31,31 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

9 Aligning Technologists and Business Stakeholders 10 Achieving the “Outrageous” with Big Data 11 Monetizing Big Data 13 Why Vertica?. 41 Don’t Forget to Consider Your End User When Desi

Trang 4

Boston Farnham Sebastopol Tokyo

Beijing Boston Farnham Sebastopol Tokyo

Beijing

Trang 5

[LSI]

The Big Data Transformation

by Alice LaPlante

Copyright © 2017 O’Reilly Media Inc All rights reserved.

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://safaribooksonline.com) For more information, contact our corporate/institutional sales department:

800-998-9938 or corporate@oreilly.com.

Editors: Tim McGovern and

Debbie Hardin

Production Editor: Colleen Lobner

Copyeditor: Octal Publishing Inc.

Interior Designer: David Futato

Cover Designer: Randy Comer

Illustrator: Rebecca Demarest November 2016: First Edition

Revision History for the First Edition

2016-11-03: First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc The Big Data

Transformation, the cover image, and related trade dress are trademarks of O’Reilly

Media, Inc.

While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limi‐ tation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsi‐ bility to ensure that your use thereof complies with such licenses and/or rights.

Trang 6

Table of Contents

1 Introduction 1

Big Data: A Brief Primer 1

A Crowded Marketplace for Big Data Analytical Databases 2

Yes, You Need Another Database: Finding the Right Tool for the Job 4

Sorting Through the Hype 7

2 Where Do You Start? Follow the Example of This Data-Storage Company 9

Aligning Technologists and Business Stakeholders 10

Achieving the “Outrageous” with Big Data 11

Monetizing Big Data 13

Why Vertica? 13

Choosing the Right Analytical Database 14

Look for the Hot Buttons 16

3 The Center of Excellence Model: Advice from Criteo 17

Keeping the Business on the Right Big-Data Path 20

The Risks of Not Having a CoE 22

The Best Candidates for a Big Data CoE 22

4 Is Hadoop a Panacea for All Things Big Data? YPSM Says No 23

YP Transforms Itself Through Big Data 25

5 Cerner Scales for Success 29

A Mammoth Proof of Concept 30

Providing Better Patient Outcomes 32

v

Trang 7

Vertica: Helping to Keep the LightsOn 33

Crunching the Numbers 35

6 Whatever You Do, Don’t Do This, Warns Etsy 41

Don’t Forget to Consider Your End User When Designing Your Analytics System 41

Don’t Underestimate Demand for Big-Data Analytics 42

Don’t Be Nạve About How Fast Big-Data Grows 43

Don’t Discard Data 44

Don’t Get Burdened with Too Much “Technical Debt” 44

Don’t Forget to Consider How You’re Going to Get Data into Your New Database 45

Don’t Build the Great Wall of China Between Your Data Engineering Department and the Rest of the Company 46

Don’t Go Big Before You’ve Tried It Small 47

Don’t Think Big Data Is Simply a Technical Shift 47

vi | Table of Contents

Trang 8

CHAPTER 1

Introduction

We are in the age of data Recorded data is doubling in size everytwo years, and by 2020 we will have captured as many digital bits asthere are stars in the universe, reaching a staggering 44 zettabytes, or

44 trillion gigabytes Included in these figures is the business datagenerated by enterprise applications as well as the human data gen‐erated by social media sites like Facebook, LinkedIn, Twitter, andYouTube

Big Data: A Brief Primer

Gartner’s description of big data—which focuses on the “three Vs”:volume, velocity, and variety—has become commonplace Big datahas all of these characteristics There’s a lot of it, it moves swiftly,and it comes from a diverse range of sources

A more pragmatic definition is this: you know you have big datawhen you possess diverse datasets from multiple sources that are toolarge to cost-effectively manage and analyze within a reasonabletimeframe when using your traditional IT infrastructures This datacan include structured data as found in relational databases as well

as unstructured data such as documents, audio, and video

IDG estimates that big data will drive the transformation of ITthrough 2025 Key decision-makers at enterprises understand this

Eighty percent of enterprises have initiated big data–driven projects

as top strategic priorities And these projects are happening acrossvirtually all industries Table 1-1 lists just a few examples

1

Trang 9

Table 1-1 Transforming business processes across industries

Industry Big data use cases

Automotive Auto sensors reporting vehicle location problems

Financial services Risk, fraud detection, portfolio analysis, new product development

Manufacturing Quality assurance, warranty analyses

Healthcare Patient sensors, monitoring, electronic health records, quality of care

Oil and gas Drilling exploration sensor analyses

Retail Consumer sentiment analyses, optimized marketing, personalized targeting,

market basket analysis, intelligent forecasting, inventory management Utilities Smart meter analyses for network capacity, smart grid

Law enforcement Threat analysis, social media monitoring, photo analysis, traffic optimization Advertising Customer targeting, location-based advertising, personalized retargeting, churn

to have the right tool for the job Gartner calls this “best-fit engi‐neering.”

This is especially true when it comes to databases Databases formthe heart of big data They’ve been around for a half century Butthey have evolved almost beyond recognition during that time.Today’s databases for big data analytics are completely different ani‐mals than the mainframe databases from the 1960s and 1970s,although SQL has been a constant for the last 20 to 30 years

There have been four primary waves in this database evolution

Mainframe databases

The first databases were fairly simple and used by government,financial services, and telecommunications organizations toprocess what (at the time) they thought were large volumes oftransactions But, there was no attempt to optimize eitherputting the data into the databases or getting it out again Andthey were expensive—not every business could afford one

2 | Chapter 1: Introduction

Trang 10

Online transactional processing (OLTP) databases

The birth of the relational database using the client/servermodel finally brought affordable computing to all businesses.These databases became even more widely accessible throughthe Internet in the form of dynamic web applications and cus‐tomer relationship management (CRM), enterprise resourcemanagement (ERP), and ecommerce systems

Data warehouses

The next wave enabled businesses to combine transactional data

—for example, from human resources, sales, and finance—together with operational software to gain analytical insight intotheir customers, employees, and operations Several databasevendors seized leadership roles during this time Some werenew and some were extensions of traditional OLTP databases

In addition, an entire industry that brought forth business intel‐ligence (BI) as well as extract, transform, and load (ETL) toolswas born

Big data analytics platforms

During the fourth wave, leading businesses began recognizingthat data is their most important asset But handling the vol‐ume, variety, and velocity of big data far outstripped the capa‐bilities of traditional data warehouses In particular, previouswaves of databases had focused on optimizing how to get data

into the databases These new databases were centered on get‐ ting actionable insight out of them The result: today’s analytical

databases can analyze massive volumes of data, both structuredand unstructured, at unprecedented speeds Users can easilyquery the data, extract reports, and otherwise access the data tomake better business decisions much faster than was possiblepreviously (Think hours instead of days and seconds/minutesinstead of hours.)

One example of an analytical database—the one we’ll explore in thisdocument—is Vertica from Hewlett Packard Enterprise (HPE).Vertica is a massively parallel processing (MPP) database, whichmeans it spreads the data across a cluster of servers, making it possi‐ble for systems to share the query-processing workload Created bylegendary database guru and Turing award winner Michael Stone‐braker, and then acquired by HP, the Vertica Analytics Platform waspurpose-built from its very first line of code to optimize big-dataanalytics

A Crowded Marketplace for Big Data Analytical Databases | 3

Trang 11

Three things in particular set Vertica apart, according to Colin Mah‐ony, senior vice president and general manager for HPE SoftwareBig Data:

• Its creators saw how rapidly the volume of data was growing,and designed a system capable of scaling to handle it from theground up

• They also understood all the different analytical workloads thatbusinesses would want to run against their data

• They realized that getting superb performance from the data‐base in a cost-effective way was a top priority for businesses

Yes, You Need Another Database: Finding the Right Tool for the Job

According to Gartner, data volumes are growing 30 percent to 40percent annually, whereas IT budgets are only increasing by 4 per‐cent Businesses have more data to deal with than they have money.They probably have a traditional data warehouse, but the sheer size

of the data coming in is overwhelming it They can go the data lake

route, and set it up on Hadoop, which will save money while captur‐ing all the data coming in, but it won’t help them much with theanalytics that started off the entire cycle This is why these busi‐nesses are turning to analytical databases

Analytical databases typically sit next to the system of record—whether that’s Hadoop, Oracle, or Microsoft—to perform speedyanalytics of big data

In short: people assume a database is a database, but that’s not true.Here’s a metaphor created by Steve Sarsfield, a product-marketingmanager at HPE, to articulate the situation (illustrated in

Figure 1-1):

If you say “I need a hammer,” the correct tool you need is deter‐ mined by what you’re going to do with it.

4 | Chapter 1: Introduction

Trang 12

Figure 1-1 Different hammers are good for different things

The same scenario is true for databases Depending on what youwant to do, you would choose a different database, whether an MPPanalytical database like Vertica, an XML database, or a NoSQL data‐base—you must choose the right tool for the job you need to do.You should choose based upon three factors: structure, size, andanalytics Let’s look a little more closely at each:

Still, though, the three main considerations remain structure, size,and analytics Vertica’s sweet spot, for example, is performing long,deep queries of structured data at rest that have fixed schemas Buteven then there are ways to stretch the spectrum of what Vertica can

Yes, You Need Another Database: Finding the Right Tool for the Job | 5

Trang 13

do by using technologies such as Kafka and Flex Tables, as demon‐strated in Figure 1-2.

Figure 1-2 Stretching the spectrum of what Vertica can do

In the end, the factors that drive your database decision are the sameforces that drive IT decisions in general You want to:

Increase revenues

You do this by investing in big-data analytics solutions thatallow you to reach more customers, develop new product offer‐ings, focus on customer satisfaction, and understand your cus‐tomers’ buying patterns

Enhance efficiency

You need to choose big data analytics solutions that reducesoftware-licensing costs, enable you to perform processes moreefficiently, take advantage of new data sources effectively, andaccelerate the speed at which that information is turned intoknowledge

Improve compliance

Finally, your analytics database must help you to comply withlocal, state, federal, and industry regulations and ensure thatyour reporting passes the robust tests that regulatory mandatesplace on it Plus, your database must be secure to protect theprivacy of the information it contains, so that it’s not stolen orexposed to the world

6 | Chapter 1: Introduction

Trang 14

Sorting Through the Hype

There’s so much hype about big data that it can be difficult to knowwhat to believe We maintain that one size doesn’t fit all when itcomes to big-data analytical databases The top-performing organi‐zations are those that have figured out how to optimize each part oftheir data pipelines and workloads with the right technologies.The job of vendors in this market: to keep up with standards so thatbusinesses don’t need to rip and replace their data schemas, queries,

or frontend tools as their needs evolve

In this document, we show the real-world ways that leading busi‐nesses are using Vertica in combination with other best-in-class big-data solutions to solve real business challenges

Sorting Through the Hype | 7

Trang 16

When selling big data to your company, you need to know youraudience Big data can deliver massive benefits to the business, butyou must know your audience’s interests.

For example, you might know that big data gets you the following:

• 360-degree customer view (improving customer “stickiness”)via cloud services

• Rapid iteration (improving product innovation) via engineeringinformatics

• Force multipliers (reducing support costs) via support automa‐tion

But if others within the business don’t realize what these benefitsmean to them, that’s when you need to begin evangelizing:

• Envision the big-picture business value you could be gettingfrom big data

9

Trang 17

• Communicate that vision to the business and then explainwhat’s required from them to make it succeed

• Think in terms of revenues, costs, competitiveness, and sticki‐ness, among other benefits

Table 2-1 shows what the various stakeholders you need to convincewant to hear

Table 2-1 Know your audience

Analysts want: Business owners

want: IT professionals want: Data scientists want:

SQL and ODBC New revenue

ACID for consistency Sheer speed for

critical answers MPP shared-nothingarchitecture R for in-databaseanalytics The ability to integrate

big-data solutions into

current BI and reporting

tools

Increased operational efficiency

Lower TCO from a reduced footprint Tools to creativelyexplore the big data

Aligning Technologists and Business

Stakeholders

Larry Lancaster, a former chief data scientist at a company offeringhardware and software solutions for data storage and backup, thinksthat getting business strategists in line with what technologists know

is right is a universal challenge in IT “Tech people talk in a languagethat the business people don’t understand,” says Lancaster “Youneed someone to bridge the gap Someone who understands fromboth sides what’s needed, and what will eventually be delivered,”

he says

The best way to win the hearts and minds of business stakeholders:show them what’s possible “The answer is to find a problem, andmake an example of fixing it,” says Lancaster

The good news is that today’s business executives are well aware ofthe power of data But the bad news is that there’s been a certainamount of disappointment in the marketplace “We hear storiesabout companies that threw millions into Hadoop, but got nothingout of it,” laments Lancaster These disappointments make execu‐tives reticent to invest large sums

10 | Chapter 2: Where Do You Start? Follow the Example of This Data-Storage Company

Trang 18

Lancaster’s advice is to pick one of two strategies: either start smalland slowly build success over time, or make an outrageous claim toget people’s attention Here’s his advice on the gradual tactic:

The first approach is to find one use case, and work it up yourself,

in a day or two Don’t bother with complicated technology; use Excel When you get results, work to gain visibility Talk to people above you Tell them you were able to analyze this data and that Bob in marketing got an extra 5 percent response rate, or that your support team closed cases 10 times faster.

Typically, all it takes is one or two persons to do what Lancaster calls

“a little big-data magic” to convince people of the value of the tech‐nology

The other approach is to pick something that is incredibly aggres‐sive, and you make an outrageous statement Says Lancaster:

Intrigue people Bring out amazing facts of what other people are doing with data, and persuade the powers that be that you can do

it, too.

Achieving the “Outrageous” with Big Data

Lancaster knows about taking the second route As chief data scien‐tist, he built an analytics environment from the ground up that com‐pletely eliminated Level 1 and Level 2 support tickets

Imagine telling a business that it could almost completely make rou‐tine support calls disappear No one would pass up that opportunity

“You absolutely have their attention,” said Lancaster

This company offered businesses a unique storage value proposition

in what it calls predictive flash storage Rather than forcing busi‐

nesses to choose between hard drives (cheap but slow) and solidstate drives, (SSDs—fast but expensive) for storage, they offered thebest of both worlds By using predictive analytics, they built systemsthat were very smart about what data went onto the different types

of storage For example, data that businesses were going to read ran‐domly went onto the SSDs Data for sequential reads—or perhaps

no reads at all—were put on the hard drives

How did they accomplish all this? By collecting massive amounts ofdata from all the devices in the field through telemetry, and sending

it back to its analytics database, Vertica, for analysis

Achieving the “Outrageous” with Big Data | 11

Trang 19

Lancaster said it would be very difficult—if not impossible—to sizedeployments or use the correct algorithms to make predictive stor‐age products work without a tight feedback loop to engineering.

We delivered a successful product only because we collected enough information, which went straight to the engineers, who kept iterating and optimizing the product No other storage vendor understands workloads better than us They just don’t have the tele‐ metry out there.

And the data generated by the telemetry was huge The companywere taking in 10,000 to 100,000 data points per minute from eacharray in the field And when you have that much data and beginrunning analytics on it, you realize you could do a lot more, accord‐ing to Lancaster

We wanted to increase how much it was paying off for us, but we needed to do bigger queries faster We had a team of data scientists and didn’t want them twiddling their thumbs That’s what brought

us to Vertica.

Without Vertica helping to analyze the telemetry data, they wouldhave had a traditional support team, opening cases on problems inthe field, and escalating harder issues to engineers, who would thenneed to simulate processes in the lab

“We’re talking about a very labor-intensive, slow process,” said Lan‐caster, who believes that the entire company has a better under‐standing of the way storage works in the real world than any otherstorage vendor—simply because it has the data

As a result of the Vertica deployment, this business opens and closes

80 percent of its support cases automatically Ninety percent areautomatically opened There’s no need to call customers up and askthem to gather data or send log posts Cases that would ordinarilytake days to resolve get closed in an hour

They also use Vertica to audit all of the storage that its customershave deployed to understand how much of it is protected “We knowwith local snapshots, how much of it is replicated for disaster recov‐ery, how much incremental space is required to increase retentiontime, and so on,” said Lancaster This allows them to go to custom‐ers with proactive service recommendations for protecting theirdata in the most cost-effective manner

12 | Chapter 2: Where Do You Start? Follow the Example of This Data-Storage Company

Trang 20

Monetizing Big Data

Lancaster believes that any company could find aspects of support,marketing, or product engineering that could improve by at leasttwo orders of magnitude in terms of efficiency, cost, and perfor‐mance if it utilized data as much as his organization did

More than that, businesses should be figuring out ways to monetizethe data

For example, Lancaster’s company built a professional services offer‐ing that included dedicating an engineer to a customer account, notjust for the storage but also for the host side of the environment, tooptimize reliability and performance This offering was fairly expen‐sive for customers to purchase In the end, because of analyses per‐formed in Vertica, the organization was able to automate nearly all

of the service’s function Yet customers were still willing to pay topdollar for it Says Lancaster:

Enterprises would all sign up for it, so we were able to add 10 per‐ cent to our revenues simply by better leveraging the data we were already collecting Anyone could take their data and discover a sim‐ ilar revenue windfall.

Already, in most industries, there are wars as businesses race for acompetitive edge based on data

For example, look at Tesla, which brings back telemetry from everycar it sells, every second, and is constantly working on optimizingdesigns based on what customers are actually doing with their vehi‐cles “That’s the way to do it,” says Lancaster

But as he began to use Vertica more and more, he realized that theperformance benefits achievable were another order of magnitude

Monetizing Big Data | 13

Trang 21

beyond what you would expect with just the column-store effi‐ciency.

It’s because Vertica allows you to do some very efficient types of encoding on your data So all of the low cardinality columns that would have been wasting space in a row store end up taking almost

no space at all.

According to Lancaster, Vertica is the data warehouse the marketneeded for 20 years, but didn’t have “Aggressive encoding comingtogether with late materialization in a column store, I have to say,was a pivotal technological accomplishment that’s changed the data‐base landscape dramatically,” he says

On smaller Vertica queries, his team of data scientists were onlyexperiencing subsecond latencies On the large ones, it was gettingsub-10-second latencies

It’s absolutely amazing It’s game changing People can sit at their desktops now, manipulate data, come up with new ideas and iterate without having to run a batch and go home It’s a dramatic increase

in productivity.

What else did they do with the data? Says Lancaster, “It was more

like, ‘what didn’t we do with the data?’ By the time we hired BI peo‐

ple everything we wanted was uploaded into Vertica, not just tele‐metry, but also Salesforce, and a lot of other business systems, and

we had this data warehouse dream in place,” he says

Choosing the Right Analytical Database

As you do your research, you’ll find that big data platforms are often

suited for special purposes But you want a general solution with lots

of features, such as the following:

Trang 22

Even before being acquired by what was at that point HP, Verticawas the biggest big data pure-play analytical database A feature-richgeneral solution, it had everything that Lancaster’s organizationneeded:

• Scale-out MPP architecture

• SQL database with ACID compliance

• R-integrated window functions, distributed R

Vertica’s performance-first design makes big data smaller in motionwith the following design features:

Even when it didn’t use in-line compression, the company stillachieved approximately 25 times reduction in storage footprint withVertica post compression This resulted in radically lower TCO for

Choosing the Right Analytical Database | 15

Trang 23

the same performance and significantly better performance for thesame TCO.

Look for the Hot Buttons

So, how do you get your company started on a big-data project?

“Just find a problem your business is having,” advised Lancaster

“Look for a hot button And instead of hiring a new executive tosolve that problem, hire a data scientist.”

Say your product is falling behind in the market—that means yourfeedback to engineering or product development isn’t fast enough.And if you’re bleeding too much in support, that’s because you don’thave sufficient information about what’s happening in the field

“Bring in a data scientist,” advises Lancaster “Solve the problemwith data.”

Of course, showing an initial ROI is essential—as is having a vision,and a champion “You have to demonstrate value,” says Lancaster

“Once you do that, things will grow from there.”

16 | Chapter 2: Where Do You Start? Follow the Example of This Data-Storage Company

Trang 24

Could you benefit from a big-data CoE? Criteo has, and it has someadvice for those who would like to create one for their business.According to Justin Coffey, a senior staff development lead at theperformance marketing technology company, whether you formallycall it a CoE or not, your big-data analytics initiatives should be led

by a team that promotes collaboration with and between users andtechnologists throughout your organization This team should alsoidentify and spread best practices around big-data analytics to drivebusiness- or customer-valued results HPE uses the term “datademocratization” to describe organizations that increase access todata from a variety of internal groups in this way

17

Trang 25

That being said, even though the model tends to be variable acrosscompanies, the work of the CoE tends to be quite similar, including(but not limited to) the following:

• Defining a common set of best practices and work standardsaround big data

• Assessing (or helping others to assess) whether they are utiliz‐ing big data and analytics to best advantage, using the afore‐mentioned best practices

• Providing guidance and support to assist engineers, program‐mers, end users, and data scientists, and other stakeholders toimplement these best practices

Coffey is fond of introducing Criteo as “the largest tech companyyou’ve never heard of.” The business drives conversions for advertis‐ers across multiple online channels: mobile, banner ads, and email.Criteo pays for the display ads, charges for traffic to its advertisers,and optimizes for conversions Based in Paris, it has 2,200 employ‐ees in more than 30 offices worldwide, with more than 400 engi‐neers and more than 100 data analysts

Criteo enables ecommerce companies to effectively engage and con‐vert their customers by using large volumes of granular data It hasestablished one of the biggest European R&D centers dedicated toperformance marketing technology in Paris and an internationalR&D hub in Palo Alto By choosing Vertica, Criteo gets deepinsights across tremendous data loads, enabling it to optimize theperformance of its display ads delivered in real-time for each indi‐vidual consumer across mobile, apps, and desktop

The breadth and scale of Criteo’s analytics stack is breathtaking.Fifty billion total events are logged per day Three billion bannersare served per day More than one billion unique users per monthvisit its advertisers’ websites Its Hadoop cluster ingests more than

25 TB a day The system makes 15 million predictions per secondout of seven datacenters running more than 15,000 servers, withmore than five petabytes under management

18 | Chapter 3: The Center of Excellence Model: Advice from Criteo

Trang 26

Overall, however, it’s a fairly simple stack, as Figure 3-1 illustrates.Criteo decided to use:

• Hadoop to store raw data

• HPE Vertica database for data warehousing

• Tableau as the frontend data analysis and reporting tool

With a thousand users (up to 300 simultaneously during peak peri‐ods), the right setup and optimization of the Tableau server was crit‐ical to ensure the best possible performance

Figure 3-1 The performance marketing technology company’s big-data analytics stack

Criteo started by using Hadoop for internal analytics, but soonfound that its users were unhappy with query performance, andthat direct reporting on top of Hadoop was unrealistic “We havepetabytes available for querying and add 20 TB to it every day,” saysCoffey

Using a Hadoop framework as calculation engine and HPE Vertica

to analyze structured and unstructured data, Criteo generates intelli‐gence and profit from big data The company has experienceddouble-digit growth since its inception, and Vertica allows it to keep

up with the ever-growing volume of data Criteo uses Vertica to dis‐tribute and order data to optimize for specific query scenarios ItsVertica cluster is 75 TB on 50 CPU heavy nodes and growing

The Center of Excellence Model: Advice from Criteo | 19

Trang 27

Observed Coffey, “Vertica can do many things, but is best at acceler‐ating ad hoc queries.” He made a decision to load the business-critical subset of the firm’s Hive data warehouse into Vertica, and tonot allow data to be built or loaded from anywhere else.

The result: with a modicum of tuning, and nearly no day-to-daymaintenance, analytic query throughput skyrocketed Criteo loadsabout 2 TB of data per day into Vertica It arrives mostly in dailybatches and takes about an hour to load via Hadoop streaming jobsthat use the Vertica command-line tool (vsql) to bulk insert

Here are the recommended best practices from Criteo:

Without question, the most important thing is to simplify

For example: sole-sourcing data for Vertica from Hadoop pro‐vides an implicit backup It also allows for easy replication tomultiple clusters Because you can’t be an expert in everything,focus is key Plus, it’s easier to train colleagues to contribute to asimple architecture

Optimizations tend to make systems complex

If your system is already distributed (for example, in Hadoop,Vertica), scale out (or perhaps up) until that no longer works InCoffey’s opinion, it’s okay to waste some CPU cycles “Hadoopwas practically designed for it,” states Coffey “Vertica lets us dothings we were otherwise incapable of doing and with very littleDBA overhead—we actually don’t have a Vertica databaseadministrator—and our users consistently tell us it’s their favor‐ite tool we provide.”

Coffey estimates that thanks to its flexible projections, performancewith Vertica can be orders of magnitude better than Hadoop solu‐tions with very little effort

Keeping the Business on the Right Big-Data Path

Although Criteo doesn’t formally call it a “Center of Excellence,” itdoes have a central team dedicated to making sure that all activitiesaround big-data analytics follow best practices Says Coffey:

It fits the definition of a Center of Excellence because we have a mix

of professionals who understand how databases work at the inner‐

20 | Chapter 3: The Center of Excellence Model: Advice from Criteo

Ngày đăng: 12/11/2019, 22:32

TÀI LIỆU CÙNG NGƯỜI DÙNG

  • Đang cập nhật ...

TÀI LIỆU LIÊN QUAN