1. Trang chủ
  2. » Công Nghệ Thông Tin

IT training big data all stars khotailieu

33 23 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 33
Dung lượng 0,92 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

How comScore Uses Hadoop and MapR to Build its BusinessMichael Brown, CTO at comScore comScore uses MapR to manage and scale their Hadoop cluster of 450 servers, ate more files, process

Trang 3

Introduction

Those of us looking to take a significant step towards creating

a data-driven business sometimes need a little inspiration from those that have traveled the path we are looking to tread This book presents a series of real-world stories from those on the big data frontier who have moved beyond experimentation to creating sustainable, successful big data solutions within their organiza- tions Read these stories to get an inside look at nine “big data all-stars” who have been recognized by MapR and Datanami as having achieved great success in the expanding field of big data Use the examples in this guide to help you develop your own methods, approaches, and best practices for creating big data solutions within your organization Whether you are a business analyst, data scientist, enterprise architect, IT administrator, or developer, you’ll gain key insights from these big data luminar- ies—insights that will help you tackle the big data challenges you face in your own company.

Trang 4

How comScore Uses Hadoop and MapR to Build its Business

Michael Brown, CTO at comScore

comScore uses MapR to manage and scale their Hadoop cluster of 450 servers, ate more files, process more data faster, and produce better streaming and random I/O results MapR allows comScore to easily access data in the cluster and just as easily store it in a variety of warehouse environments

cre-Making Good Things Happen at Wells Fargo

Paul Cao, Director of Data Services for Wells Fargo’s Capital

Markets business

Wells Fargo uses MapR to serve the company’s data needs across the entire banking business, which involve a variety of data types including reference data, market data, and structured and unstructured data, all under the same umbrella Using NoSQL and Hadoop, their solution requires the utmost in security, ease of ingest, ability to scale, high performance, and—particularly important for Wells Fargo—multi-tenancy

Coping with Big Data at Experian–“Don’t Wait, Don’t Stop”

Tom Thomas, Director of IT at Experian

Experian uses MapR to store in-bound source data The files are then available for analysts to query with SQL via Hive, without the need to build and load a structured database Experian is now able to achieve significantly more processing power and storage space, and clients have access to deeper data

Trevor Mason and Big Data: Doing What Comes Naturally

Trevor Mason, Vice President Technology Research at IRI

IRI used MapR to maximize file system performance, facilitate the use of a large number of smaller files, and send files via FTP from the mainframe directly to the cluster With Hadoop, they have been able to speed up data processing while reduc-ing mainframe load, saving more than $1.5 million

Leveraging Big Data to Economically Fuel Growth

Kevin McClowry, Director of Analytics Application Development

at TransUnion

TransUnion uses a hybrid architecture made of commercial databases and Hadoop

so that their analysts can work with data in a way that was previously out of reach The company is introducing the analytics architecture worldwide and sizing it to fit the needs and resources of each country’s operation

Trang 5

Making Big Data Work for a Major Oil & Gas Equipment Manufacturer

Warren Sharp, Big Data Engineer at National Oilwell VARCO (NOV)

NOV created a data platform for time-series data from sensors and control systems

to support deep analytics and machine learning The organization is now able to build, test, and deliver complicated condition-based maintenance models and applications

The NIH Pushes the Boundaries of Health Research with Data Analytics

Chuck Lynch, Chief Knowledge Officer at National Institutes of Health

The National Institutes for Health created a five-server cluster that enables the office

to effectively apply analytical tools to newly-shared data NIH can now do things with health science data it couldn’t do before, and in the process, advance medicine

Keeping an Eye on the Analytic End Game at UnitedHealthcare

Alex Barclay, Vice President of Advanced Analytics at UnitedHealthcare

UnitedHealthcare uses Hadoop as a basic data framework and built a single form equipped with the tools needed to analyze information generated by claims, prescriptions, plan participants, care providers, and claim review outcomes They can now identify mispaid claims in a systematic, consistent way

plat-Creating Flexible Big Data Solutions for Drug Discovery

David Tester, Application Architect at Novartis Institutes for

Biomedical Research

Novartis Institutes for Biomedical Research built a workflow system that uses Hadoop for performance and robustness Bioinformaticians use their familiar tools and metadata to write complex workflows, and researchers can take advantage of the tens of thousands of experiments that public organizations have conducted

20

23

27

29

Trang 6

When comScore was founded in 1999, Mike Brown, the company’s first engineer, was immediately immersed in the world of Big Data.

The company was created to provide digital marketing intelligence and digital media analytics in the form of custom solutions in online audience measure-ment, e-commerce, advertising, search, video and mobile Brown’s job was to create the architecture and design to support the founders’ ambitious plans

It worked Over the past 15 years comScore has built a highly successful business and a customer base composed of some of the world’s top com-panies—Microsoft, Google, Yahoo!, Facebook, Twitter, craigslist, and the BBC

to name just a few Overall the company has more than 2,100 clients wide Measurements are derived from 172 countries with 43 markets reported

world-To service this extensive client base, well over 1.8 trillion interactions are tured monthly, equal to about 40% of the monthly page views of the entire Internet This is Big Data on steroids

cap-Brown, who was named CTO in 2012, continues to grow and evolve the ny’s IT infrastructure to keep pace with this constantly increasing data deluge

compa-“We were a Dell shop from the beginning In 2002 we put together our own grid processing stack to tie all our systems together in order to deal with the fast growing data volumes,” Brown recalls

Introducing Unified Digital Measurement

In addition to its ongoing business, in 2009 the company embarked on

a new initiative called Unified Digital Measurement (UDM), which directly addresses the frequent disparity between census-based site analytics data and panel-based audience measurement data UDM blends these two

approaches into a “best of breed” approach that combines person-level measurement from the two million person comScore global panel with census informed consumption to account for 100 percent of a client’s audience

UDM helped prompt a new round of IT infrastructure upgrades “The volume of data was growing rapidly and processing requirements were growing dramat-ically as well,” Brown says “In addition, our clients were asking us to turn the

How comScore Uses Hadoop and MapR to Build its Business

Michael Brown

CTO at comScore

“ With MapR we see a

3X performance increase

running the same data

and the same code—the

jobs just run faster.”

Mike Brown, CTO, comScore

Trang 7

data around much faster So we looked into building our own stack again, but decided we’d be better off adopting a well accepted, open source, heavy duty processing model—Hadoop.”

With the implementation of Hadoop, comScore continued to expand its server cluster Multiple servers also meant they had to solve the Hadoop shuffle prob-lem During the high volume, parallel processing of data sets coming in from around the world, data is scattered across the server farm To count the number

of events, all this data has to be gathered, or “shuffled” into one location

comScore needed a Hadoop platform that could not only scale, but also provide data protection, high availability, as well as being easy to use

It was requirements like these that led Brown to adopt the MapR distribution for Hadoop

He was not disappointed—by using the MapR distro, the company is able

to more easily manage and scale their Hadoop cluster, create more files and process more data faster, and produce better streaming and random I/O results than other Hadoop distributions “With MapR we see a 3X performance increase running the same data and the same code—the jobs just run faster.”

In addition, the MapR solution provides the requisite data protection and disaster recovery functions: “MapR has built in to the design an automated DR strategy,” Brown notes

Solving the Shuffle

He said they leveraged a feature in MapR known as volumes to directly dress the shuffle problem “It allows us to make this process run superfast We reduced the processing time from 36 hours to three hours—no new hardware,

ad-no new software, ad-no new anything, just a design change This is just what we needed to colocate the data for efficient processing.”

Using volumes to optimize processing was one of several unique solutions that Brown and his team applied to processing comScore’s massive amounts of data Another innovation is pre-sorting the data before it is loaded into the Hadoop cluster Sorting optimizes the data’s storage compression ratio, from the usual ratio of 3:1 to a highly compressed 8:1 with no data loss And this leads to a cas-

“ MapR has built in to

the design an automated

DR strategy.”

Mike Brown,

CTO, comScore

Trang 8

cade of benefits: more efficient processing with far fewer IOPS, less data to read from disk, and less equipment which, in turn, means savings on power, cooling and floor space.

“HDFS is great internally,” says Brown “But to get data in and out of Hadoop, you have to do some kind of HDFS export With MapR, you can just mount HDFS as NFS and then use native tools whether they’re in Windows, Unix, Linux

or whatever NFS allowed our enterprise to easily access data in the cluster and just as easily store it in a variety of warehouse environments.”

For the near future, Brown says the comScore IT infrastructure will continue to scale to meet new customer demand The Hadoop cluster has grown to 450 servers with 17,000 cores and more than 10 petabytes of disk

MapR’s distro of Hadoop is also helping to support a major new product announced in 2012 and enjoying rapid growth Know as validated Campaign Essential (vCE), the new measurement solution provides a holistic view of cam-paign delivery and a verified assessment of ad-exposed audiences via a single, third-party source vCE also allows the identification of non-human traffic and fraudulent delivery

Start Small

When asked if he had any advice for his peers in IT who are also wrestling with Big Data projects, Brown commented, “We all know we have to process moun-tains of data, but when you begin developing your environment, start small Cut out a subset of the data and work on that first while testing your code and making sure everything functions properly Get some small wins Then you can move on to the big stuff.”

“ With MapR, you can

just mount HDFS as NFS

and then use native

tools whether they’re in

Windows, Unix, Linux or

whatever NFS allowed

our enterprise to easily

access data in the cluster

and just as easily store

it in a variety of

ware-house environments.”

Mike Brown,

CTO, comScore

Trang 9

When Paul Cao joined Wells Fargo several years ago, his timing was perfect Big Data analytic technology had just made a major leap forward, providing him with the tools he needed to implement an ambitious program designed

to meet the company’s analytic needs

Wells Fargo is big—a nationwide, community-based financial services company with $1.8 trillion in assets It provides its various services through 8,700 locations

as well as on the Internet and through mobile apps The company has some 265,000 employees and offices in 36 countries They generate a lot of data.Cao has been working with data for twenty years Now, as the Director of Data Services for Wells Fargo’s Capital Markets business, he is creating systems that support the Business Intelligence and analytic needs of its far-flung operations

Meeting Customer and Regulatory Needs

“We receive massive amounts of data from a variety of different systems, covering all types of securities (equity, fixed income, FX, etc.) from around the world,” Cao says “Many of our models reflect the interactions between these systems—it’s multi-layered The analytic solutions we offer are not only driven

by customers’ needs, but by regulatory considerations as well

“We serve the company’s data needs across the entire banking business and

so we work with a variety of data types including reference data, market data, structured and unstructured data, all under the same umbrella,” he continues

“Because of the broad scope of the data we are dealing with, we needed tools that could handle the volume, speed and variety of data as well as all the require-ments that had to be met in order to process that data Just one example is market tick data For North American cash equities, we are dealing with up to three million ticks per second, a huge amount of data that includes all the differ-ent price points for the various equity stocks and the movement of those stocks.”

Enterprise NoSQL on Hadoop

Cao says that given his experience with various Big Data solutions in the past and the recent revolution in the technology, he and his team were well aware

of the limitations of more traditional relational databases So they

concentrat-ed their attention on solutions that support NoSQL and Hadoop They wantconcentrat-ed

Making Good Things Happen

at Wells Fargo

Paul Cao

Director of Data Services

for Wells Fargo’s Capital

Markets business

The MapR solution,

for example, provides

Trang 10

to deal with vendors like MapR that could provide commercial support for the Hadoop distribution rather than relying on open source channels The vendors had to meet criteria such as their ability to provide utmost in security, ease of ingest, ability to scale, high performance, and—particularly important for Wells Fargo—multi-tenancy

Cao explains that he is partnering with the Wells Fargo Enterprise Data & lytics and Enterprise Technology Infrastructure teams to develop a platform servicing many different kinds of capital markets related data– including files

Ana-of all sizes and real time and batch data from a variety Ana-of sources within Well Fargo Multi-tenancy is a must to cost-efficiently and securely share IT resources and allow different business lines, data providers and data consumer appli-cations to coexist on the same cluster with true job isolation and customized security The MapR solution, for example, provides powerful features to logically partition a physical cluster to provide separate administrative control, data placement, job execution and network access

Dramatic Change to Handling Data

“The new technology we are introducing is not an incremental change—this

is a dramatic change in the way we are handling data,” Cao says “Among our challenges is to get users to accept working with the new Hadoop and NoSQL infrastructure, which is so different from what they were used to Within Data Services, we have been fortunate to have people who not only know the new technology, but really know the business This domain expertise is essential to

an understanding of how to deploy and apply the new technologies to solve essential business problems and work successfully with our users.”

When asked what advice he would pass on to others working with Big Data, Cao reiterates his emphasis on gaining a solid understanding of the new tech-nologies along with a comprehensive knowledge of their business domain

“This allows you to marry business and technology to solve business problems,”

he concludes “You’ll be able to understand your users concerns and work with them to make good things happen.”

“ The new technology

Paul Cao, Director of Data

Services for Wells Fargo’s

Capital Markets business

Trang 11

Experian is no stranger to Big Data The company can trace its origins back

to 1803 when a group of London merchants began swapping information on customers who had failed to meet their debts

Fast forward 211 years The rapid growth of the credit reference industry and the market for credit risk management services set the stage for the reliance on increasing amounts of consumer and business data that has culminated in an explosion of Big Data Data that is Experian’s life’s blood

With global revenues of $4.8 billion ($2.4 billion in North America and 16,000 employees worldwide (6,000 in North America), Experian is an internation-

al information services organization working with a majority of the world’s largest companies It has four primary business lines: credit services, decision analytics, direct-to-consumer products, and a marketing services group

Tom Thomas is the director of the Data Development Technology Group within the Consumer Services Division “Our group provides production operations support as well as technology solutions for our various business units includ-ing Automotive, Business, Collections, Consumer, Fraud, and various Data Lab joint-development initiatives,” he explains “I work closely with Norbert Frohlich and Dave Garnier, our lead developers They are responsible for the design and development of our various solutions, including those that leverage MapR Hadoop environments.”

Processing More Data in Less Time

Until recently, the Group had been getting by, as Thomas puts it “…with tions running on a couple of Windows servers and a SAN.” But as the company added new products and new sets of data quality rules, more data had to be processed in the same or less time It was time to upgrade But simply adding

solu-to the existing Windows/SAN system wasn’t an option—solu-too cumbersome and expensive

So the group upgraded to a Linux-based HPC cluster with—for the time being—six nodes Says Thomas, “We have a single customer solution right now But as

we get new customers who can use this kind of capability, we can add additional nodes and storage and processing capacity at the same time.”

Coping with Big Data at Experian–

“Don’t Wait, Don’t Stop”

Tom Thomas

Director of IT at Experian

Trang 12

NFS Provides Direct Access to Data

“All our solutions leverage MapR NFS functionality,” he continues “This allows us

to transition from our previous internal or SAN storage to Hadoop by mounting the cluster directly In turn, this provides us with access to the data via HDFS and Hadoop environment tools, such as Hive.”

ETL tools like DMX-h from Syncsort also figured prominently in the new structure, as does MapR NFS MapR is the only distribution for Apache Hadoop that leverages the full power of the NFS protocol for remote access to shared disks across the network

infra-“Our first solution includes well-known and defined metrics and aggregations,” Thomas says “We leverage DMX-h to determine metrics for each record and pre-aggregate other metrics, which are then stored in Hadoop to be used in downstream analytics as well as real-time rules based actions Our second solution follows a traditional data operations flow, except in this case we use DMX-h to prepare in-bound source data that is then stored in MapR Hadoop Then we run Experian-proprietary models that read the data via Hive and create client-specific and industry-unique results

Data Analysts Use SQL to Query on Hadoop

“Our latest endeavor copies data files from a legacy dual application server and SAN product solution to a MapR Hadoop cluster quite easily as facilitated by the MapR NFS functionality,” Thomas continues “The files are then available for analysts to query with SQL via Hive – without the need to build and load a structured database Since we are just starting to work with this data, we are not ‘stuck’ with that initial database schema that we would have developed, and thus eliminated that rework time Our analysts have Tableau and DMX-h avail-able to them, and will generate our initial reports and any analytics data files Once the useful data, reports, and results formats are firmed up, we will work

Trang 13

equipped with SAN storage Two of the servers from the cluster are also application servers running SmartLoad code and components The result is a more efficient use of hardware with no need for separate servers to run the application

Improved Speed to Market

Here’s how Thomas summarizes the benefits of the upgraded system to both the company and its customers: “We are realizing increased processing speed which leads to shorter delivery times In addition, reduced storage expenses means that we can store more, not acquire less Both the company’s internal operations and our clients have access to deeper data supporting and aiding insights into their business areas

“Overall, we are seeing reduced storage expenses while gaining processing and store capabilities and capacities,” he adds “This translates into an improved speed to market for our business units It also positions our Group to grow our Hadoop ecosystem to meet future Big Data requirements.”

And when it comes to being a Big Data All Star in today’s information- intensive world, Thomas’ advice is short and to the point: “Don’t wait and don’t stop.”

By taking advantage

of the Hadoop cluster,

the team was able to

realize substantial more

processing power and

storage space, without

the costs associated

with traditional blade

servers equipped with

SAN storage

Trang 14

Mason is the vice president for Technology Research at IRI, a 30 year old Chicago- based company that provides information, analytics, business intelligence and domain expertise for the world’s leading CPG, retail and healthcare companies.

“I’ve always had a love of mathematics and proved to be a natural when it came to computer science,” Mason says “So I combined both disciplines and

it has been my interest ever since I joined IRI 20 years ago to work with Big Data (although it wasn’t called that back then) Today I head up a group that is responsible for forward looking research into tools and systems for processing, analyzing and managing massive amounts of data Our mission is two-fold: keep technology costs as low as possible while providing our clients with the state-of-the-art analytic and intelligence tools they need to drive their insights.”

Big Data Challenges

Recent challenges facing Mason and his team included a mix of business and technological issues They were attempting to realize significant cost reductions

by reducing mainframe load, and continue to reduce mainframe support risk that is increasing due to the imminent retirement of key mainframe support per-sonnel At the same time, they wanted to build the foundations for a more cost effective, flexible and expandable data processing and storage environment.The technical problem was equally challenging The team wanted to achieve random extraction rates averaging 600,000 records per second, peaking to over one million records persecond from a 15 TB fact table This table feeds a large multi-TB downstream client-facing reporting farm Given IRI’s emphasis on economy, the solution had to be very efficient, using only 16 to 24 nodes

“We looked at traditional warehouse technologies, but Hadoop was by far the most cost effective solution,” Mason says “Within Hadoop we investigated all the main distributions and various hardware options before settling on MapR

on a Cisco UCS (Unified Computing System) cluster.”

The fact table resides on the mainframe where it is updated and maintained daily These functions are very complex and proved costly to migrate to the cluster However, the extraction process, which represents the majority of the

Trevor Mason and Big Data:

Doing What Comes Naturally

but Hadoop was by far

the most cost effective

solution,” Mason says

“Within Hadoop we

investigated all the

main distributions and

Trevor Mason, Vice President for

Technology Research at IRI

Trang 15

“The solution was to keep the update and maintenance processes on the mainframe and maintain a synchronized copy on the Hadoop cluster by using our mainframe change logging process,” he notes “All extraction processes go against the Hadoop cluster, significantly reducing the mainframe load This met our objective of maximum performance with minimal new development.”

The team chose MapR to maximize file system performance, facilitate the use of

a large number of smaller files, and take full advantage of its NFS capability so files could be sent via FTP from the mainframe directly to the cluster

Shaking up the System

They also gave their system a real workout Recalls Mason, “To maximize ciency we had to see how far we could push the hardware and software before

effi-it broke After several months of pushing the system to effi-its limeffi-its, we weeded out several issues, including a bad disk, a bad node, and incorrect OS, network and driver settings We worked closely with our vendors to root out and correct these issues.”

Overall, he says, the development took about six months followed by two months of final testing and running in parallel with the regular production pro-cesses He also stressed that “Much kudos go to the IRI engineering team and Zaloni consulting team who worked together to implemented all the minute details needed to create the current fully functional production system in only six months.”

To accomplish their ambitious goals, the team took some unique approaches For instance, the methods they used to organize the data and structure the ex-traction process allowed them to achieve between two million and three million records per second extraction rates on a 16 node cluster

They also developed a way to always have a consistent view of the data used in the extraction process while continuously updating it

By far one of the most effective additions to the IRI IT infrastructure was the implementation of Hadoop Before Hadoop the technology team relied on the mainframe running 24×7 to process the data in accordance with their customers’

With Hadoop, they

have been able to

speed up the process

while reducing

main-frame load The result:

annual savings of

more than $1.5 million.

Trang 16

tight timelines With Hadoop, they have been able to speed up the process while reducing mainframe load The result: annual savings of more than

$1.5 million

Says Mason, “Hadoop is not only saving us money, it also provides a flexible platform that can easily scale to meet future corporate growth We can do a lot more in terms of offering our customers unique analytic insights—the Hadoop platform and all its supporting tools allow us to work with large datasets in a highly parallel manner

“IRI specialized in Big Data before the term became popular—this is not new to us,” he concludes “Big Data has been our business now for more than 30 years Our objective is to continue to find ways to collect, process and manage Big Data efficiently so we can provide our clients with leading insights to drive their business growth.”

And finally, when asked what advice he might have for others who would like

to become Big Data All Stars, Mason is very clear: “Find and implement efficient and innovative ways to solve critical Big Data processing and management problems that result in tangible value to the company.”

“ Hadoop is not only

saving us money, it

also provides a flexible

platform that can easily

scale to meet future

corporate growth.”

Trevor Mason, Vice President

for Technology Research at IRI

Ngày đăng: 12/11/2019, 22:12

TÀI LIỆU CÙNG NGƯỜI DÙNG

  • Đang cập nhật ...

TÀI LIỆU LIÊN QUAN