How comScore Uses Hadoop and MapR to Build its BusinessMichael Brown, CTO at comScore comScore uses MapR to manage and scale their Hadoop cluster of 450 servers, ate more files, process
Trang 3Introduction
Those of us looking to take a significant step towards creating
a data-driven business sometimes need a little inspiration from those that have traveled the path we are looking to tread This book presents a series of real-world stories from those on the big data frontier who have moved beyond experimentation to creating sustainable, successful big data solutions within their organiza- tions Read these stories to get an inside look at nine “big data all-stars” who have been recognized by MapR and Datanami as having achieved great success in the expanding field of big data Use the examples in this guide to help you develop your own methods, approaches, and best practices for creating big data solutions within your organization Whether you are a business analyst, data scientist, enterprise architect, IT administrator, or developer, you’ll gain key insights from these big data luminar- ies—insights that will help you tackle the big data challenges you face in your own company.
Trang 4How comScore Uses Hadoop and MapR to Build its Business
Michael Brown, CTO at comScore
comScore uses MapR to manage and scale their Hadoop cluster of 450 servers, ate more files, process more data faster, and produce better streaming and random I/O results MapR allows comScore to easily access data in the cluster and just as easily store it in a variety of warehouse environments
cre-Making Good Things Happen at Wells Fargo
Paul Cao, Director of Data Services for Wells Fargo’s Capital
Markets business
Wells Fargo uses MapR to serve the company’s data needs across the entire banking business, which involve a variety of data types including reference data, market data, and structured and unstructured data, all under the same umbrella Using NoSQL and Hadoop, their solution requires the utmost in security, ease of ingest, ability to scale, high performance, and—particularly important for Wells Fargo—multi-tenancy
Coping with Big Data at Experian–“Don’t Wait, Don’t Stop”
Tom Thomas, Director of IT at Experian
Experian uses MapR to store in-bound source data The files are then available for analysts to query with SQL via Hive, without the need to build and load a structured database Experian is now able to achieve significantly more processing power and storage space, and clients have access to deeper data
Trevor Mason and Big Data: Doing What Comes Naturally
Trevor Mason, Vice President Technology Research at IRI
IRI used MapR to maximize file system performance, facilitate the use of a large number of smaller files, and send files via FTP from the mainframe directly to the cluster With Hadoop, they have been able to speed up data processing while reduc-ing mainframe load, saving more than $1.5 million
Leveraging Big Data to Economically Fuel Growth
Kevin McClowry, Director of Analytics Application Development
at TransUnion
TransUnion uses a hybrid architecture made of commercial databases and Hadoop
so that their analysts can work with data in a way that was previously out of reach The company is introducing the analytics architecture worldwide and sizing it to fit the needs and resources of each country’s operation
Trang 5Making Big Data Work for a Major Oil & Gas Equipment Manufacturer
Warren Sharp, Big Data Engineer at National Oilwell VARCO (NOV)
NOV created a data platform for time-series data from sensors and control systems
to support deep analytics and machine learning The organization is now able to build, test, and deliver complicated condition-based maintenance models and applications
The NIH Pushes the Boundaries of Health Research with Data Analytics
Chuck Lynch, Chief Knowledge Officer at National Institutes of Health
The National Institutes for Health created a five-server cluster that enables the office
to effectively apply analytical tools to newly-shared data NIH can now do things with health science data it couldn’t do before, and in the process, advance medicine
Keeping an Eye on the Analytic End Game at UnitedHealthcare
Alex Barclay, Vice President of Advanced Analytics at UnitedHealthcare
UnitedHealthcare uses Hadoop as a basic data framework and built a single form equipped with the tools needed to analyze information generated by claims, prescriptions, plan participants, care providers, and claim review outcomes They can now identify mispaid claims in a systematic, consistent way
plat-Creating Flexible Big Data Solutions for Drug Discovery
David Tester, Application Architect at Novartis Institutes for
Biomedical Research
Novartis Institutes for Biomedical Research built a workflow system that uses Hadoop for performance and robustness Bioinformaticians use their familiar tools and metadata to write complex workflows, and researchers can take advantage of the tens of thousands of experiments that public organizations have conducted
20
23
27
29
Trang 6When comScore was founded in 1999, Mike Brown, the company’s first engineer, was immediately immersed in the world of Big Data.
The company was created to provide digital marketing intelligence and digital media analytics in the form of custom solutions in online audience measure-ment, e-commerce, advertising, search, video and mobile Brown’s job was to create the architecture and design to support the founders’ ambitious plans
It worked Over the past 15 years comScore has built a highly successful business and a customer base composed of some of the world’s top com-panies—Microsoft, Google, Yahoo!, Facebook, Twitter, craigslist, and the BBC
to name just a few Overall the company has more than 2,100 clients wide Measurements are derived from 172 countries with 43 markets reported
world-To service this extensive client base, well over 1.8 trillion interactions are tured monthly, equal to about 40% of the monthly page views of the entire Internet This is Big Data on steroids
cap-Brown, who was named CTO in 2012, continues to grow and evolve the ny’s IT infrastructure to keep pace with this constantly increasing data deluge
compa-“We were a Dell shop from the beginning In 2002 we put together our own grid processing stack to tie all our systems together in order to deal with the fast growing data volumes,” Brown recalls
Introducing Unified Digital Measurement
In addition to its ongoing business, in 2009 the company embarked on
a new initiative called Unified Digital Measurement (UDM), which directly addresses the frequent disparity between census-based site analytics data and panel-based audience measurement data UDM blends these two
approaches into a “best of breed” approach that combines person-level measurement from the two million person comScore global panel with census informed consumption to account for 100 percent of a client’s audience
UDM helped prompt a new round of IT infrastructure upgrades “The volume of data was growing rapidly and processing requirements were growing dramat-ically as well,” Brown says “In addition, our clients were asking us to turn the
How comScore Uses Hadoop and MapR to Build its Business
Michael Brown
CTO at comScore
“ With MapR we see a
3X performance increase
running the same data
and the same code—the
jobs just run faster.”
Mike Brown, CTO, comScore
Trang 7data around much faster So we looked into building our own stack again, but decided we’d be better off adopting a well accepted, open source, heavy duty processing model—Hadoop.”
With the implementation of Hadoop, comScore continued to expand its server cluster Multiple servers also meant they had to solve the Hadoop shuffle prob-lem During the high volume, parallel processing of data sets coming in from around the world, data is scattered across the server farm To count the number
of events, all this data has to be gathered, or “shuffled” into one location
comScore needed a Hadoop platform that could not only scale, but also provide data protection, high availability, as well as being easy to use
It was requirements like these that led Brown to adopt the MapR distribution for Hadoop
He was not disappointed—by using the MapR distro, the company is able
to more easily manage and scale their Hadoop cluster, create more files and process more data faster, and produce better streaming and random I/O results than other Hadoop distributions “With MapR we see a 3X performance increase running the same data and the same code—the jobs just run faster.”
In addition, the MapR solution provides the requisite data protection and disaster recovery functions: “MapR has built in to the design an automated DR strategy,” Brown notes
Solving the Shuffle
He said they leveraged a feature in MapR known as volumes to directly dress the shuffle problem “It allows us to make this process run superfast We reduced the processing time from 36 hours to three hours—no new hardware,
ad-no new software, ad-no new anything, just a design change This is just what we needed to colocate the data for efficient processing.”
Using volumes to optimize processing was one of several unique solutions that Brown and his team applied to processing comScore’s massive amounts of data Another innovation is pre-sorting the data before it is loaded into the Hadoop cluster Sorting optimizes the data’s storage compression ratio, from the usual ratio of 3:1 to a highly compressed 8:1 with no data loss And this leads to a cas-
“ MapR has built in to
the design an automated
DR strategy.”
Mike Brown,
CTO, comScore
Trang 8cade of benefits: more efficient processing with far fewer IOPS, less data to read from disk, and less equipment which, in turn, means savings on power, cooling and floor space.
“HDFS is great internally,” says Brown “But to get data in and out of Hadoop, you have to do some kind of HDFS export With MapR, you can just mount HDFS as NFS and then use native tools whether they’re in Windows, Unix, Linux
or whatever NFS allowed our enterprise to easily access data in the cluster and just as easily store it in a variety of warehouse environments.”
For the near future, Brown says the comScore IT infrastructure will continue to scale to meet new customer demand The Hadoop cluster has grown to 450 servers with 17,000 cores and more than 10 petabytes of disk
MapR’s distro of Hadoop is also helping to support a major new product announced in 2012 and enjoying rapid growth Know as validated Campaign Essential (vCE), the new measurement solution provides a holistic view of cam-paign delivery and a verified assessment of ad-exposed audiences via a single, third-party source vCE also allows the identification of non-human traffic and fraudulent delivery
Start Small
When asked if he had any advice for his peers in IT who are also wrestling with Big Data projects, Brown commented, “We all know we have to process moun-tains of data, but when you begin developing your environment, start small Cut out a subset of the data and work on that first while testing your code and making sure everything functions properly Get some small wins Then you can move on to the big stuff.”
“ With MapR, you can
just mount HDFS as NFS
and then use native
tools whether they’re in
Windows, Unix, Linux or
whatever NFS allowed
our enterprise to easily
access data in the cluster
and just as easily store
it in a variety of
ware-house environments.”
Mike Brown,
CTO, comScore
Trang 9When Paul Cao joined Wells Fargo several years ago, his timing was perfect Big Data analytic technology had just made a major leap forward, providing him with the tools he needed to implement an ambitious program designed
to meet the company’s analytic needs
Wells Fargo is big—a nationwide, community-based financial services company with $1.8 trillion in assets It provides its various services through 8,700 locations
as well as on the Internet and through mobile apps The company has some 265,000 employees and offices in 36 countries They generate a lot of data.Cao has been working with data for twenty years Now, as the Director of Data Services for Wells Fargo’s Capital Markets business, he is creating systems that support the Business Intelligence and analytic needs of its far-flung operations
Meeting Customer and Regulatory Needs
“We receive massive amounts of data from a variety of different systems, covering all types of securities (equity, fixed income, FX, etc.) from around the world,” Cao says “Many of our models reflect the interactions between these systems—it’s multi-layered The analytic solutions we offer are not only driven
by customers’ needs, but by regulatory considerations as well
“We serve the company’s data needs across the entire banking business and
so we work with a variety of data types including reference data, market data, structured and unstructured data, all under the same umbrella,” he continues
“Because of the broad scope of the data we are dealing with, we needed tools that could handle the volume, speed and variety of data as well as all the require-ments that had to be met in order to process that data Just one example is market tick data For North American cash equities, we are dealing with up to three million ticks per second, a huge amount of data that includes all the differ-ent price points for the various equity stocks and the movement of those stocks.”
Enterprise NoSQL on Hadoop
Cao says that given his experience with various Big Data solutions in the past and the recent revolution in the technology, he and his team were well aware
of the limitations of more traditional relational databases So they
concentrat-ed their attention on solutions that support NoSQL and Hadoop They wantconcentrat-ed
Making Good Things Happen
at Wells Fargo
Paul Cao
Director of Data Services
for Wells Fargo’s Capital
Markets business
The MapR solution,
for example, provides
Trang 10to deal with vendors like MapR that could provide commercial support for the Hadoop distribution rather than relying on open source channels The vendors had to meet criteria such as their ability to provide utmost in security, ease of ingest, ability to scale, high performance, and—particularly important for Wells Fargo—multi-tenancy
Cao explains that he is partnering with the Wells Fargo Enterprise Data & lytics and Enterprise Technology Infrastructure teams to develop a platform servicing many different kinds of capital markets related data– including files
Ana-of all sizes and real time and batch data from a variety Ana-of sources within Well Fargo Multi-tenancy is a must to cost-efficiently and securely share IT resources and allow different business lines, data providers and data consumer appli-cations to coexist on the same cluster with true job isolation and customized security The MapR solution, for example, provides powerful features to logically partition a physical cluster to provide separate administrative control, data placement, job execution and network access
Dramatic Change to Handling Data
“The new technology we are introducing is not an incremental change—this
is a dramatic change in the way we are handling data,” Cao says “Among our challenges is to get users to accept working with the new Hadoop and NoSQL infrastructure, which is so different from what they were used to Within Data Services, we have been fortunate to have people who not only know the new technology, but really know the business This domain expertise is essential to
an understanding of how to deploy and apply the new technologies to solve essential business problems and work successfully with our users.”
When asked what advice he would pass on to others working with Big Data, Cao reiterates his emphasis on gaining a solid understanding of the new tech-nologies along with a comprehensive knowledge of their business domain
“This allows you to marry business and technology to solve business problems,”
he concludes “You’ll be able to understand your users concerns and work with them to make good things happen.”
“ The new technology
Paul Cao, Director of Data
Services for Wells Fargo’s
Capital Markets business
Trang 11Experian is no stranger to Big Data The company can trace its origins back
to 1803 when a group of London merchants began swapping information on customers who had failed to meet their debts
Fast forward 211 years The rapid growth of the credit reference industry and the market for credit risk management services set the stage for the reliance on increasing amounts of consumer and business data that has culminated in an explosion of Big Data Data that is Experian’s life’s blood
With global revenues of $4.8 billion ($2.4 billion in North America and 16,000 employees worldwide (6,000 in North America), Experian is an internation-
al information services organization working with a majority of the world’s largest companies It has four primary business lines: credit services, decision analytics, direct-to-consumer products, and a marketing services group
Tom Thomas is the director of the Data Development Technology Group within the Consumer Services Division “Our group provides production operations support as well as technology solutions for our various business units includ-ing Automotive, Business, Collections, Consumer, Fraud, and various Data Lab joint-development initiatives,” he explains “I work closely with Norbert Frohlich and Dave Garnier, our lead developers They are responsible for the design and development of our various solutions, including those that leverage MapR Hadoop environments.”
Processing More Data in Less Time
Until recently, the Group had been getting by, as Thomas puts it “…with tions running on a couple of Windows servers and a SAN.” But as the company added new products and new sets of data quality rules, more data had to be processed in the same or less time It was time to upgrade But simply adding
solu-to the existing Windows/SAN system wasn’t an option—solu-too cumbersome and expensive
So the group upgraded to a Linux-based HPC cluster with—for the time being—six nodes Says Thomas, “We have a single customer solution right now But as
we get new customers who can use this kind of capability, we can add additional nodes and storage and processing capacity at the same time.”
Coping with Big Data at Experian–
“Don’t Wait, Don’t Stop”
Tom Thomas
Director of IT at Experian
Trang 12NFS Provides Direct Access to Data
“All our solutions leverage MapR NFS functionality,” he continues “This allows us
to transition from our previous internal or SAN storage to Hadoop by mounting the cluster directly In turn, this provides us with access to the data via HDFS and Hadoop environment tools, such as Hive.”
ETL tools like DMX-h from Syncsort also figured prominently in the new structure, as does MapR NFS MapR is the only distribution for Apache Hadoop that leverages the full power of the NFS protocol for remote access to shared disks across the network
infra-“Our first solution includes well-known and defined metrics and aggregations,” Thomas says “We leverage DMX-h to determine metrics for each record and pre-aggregate other metrics, which are then stored in Hadoop to be used in downstream analytics as well as real-time rules based actions Our second solution follows a traditional data operations flow, except in this case we use DMX-h to prepare in-bound source data that is then stored in MapR Hadoop Then we run Experian-proprietary models that read the data via Hive and create client-specific and industry-unique results
Data Analysts Use SQL to Query on Hadoop
“Our latest endeavor copies data files from a legacy dual application server and SAN product solution to a MapR Hadoop cluster quite easily as facilitated by the MapR NFS functionality,” Thomas continues “The files are then available for analysts to query with SQL via Hive – without the need to build and load a structured database Since we are just starting to work with this data, we are not ‘stuck’ with that initial database schema that we would have developed, and thus eliminated that rework time Our analysts have Tableau and DMX-h avail-able to them, and will generate our initial reports and any analytics data files Once the useful data, reports, and results formats are firmed up, we will work
Trang 13equipped with SAN storage Two of the servers from the cluster are also application servers running SmartLoad code and components The result is a more efficient use of hardware with no need for separate servers to run the application
Improved Speed to Market
Here’s how Thomas summarizes the benefits of the upgraded system to both the company and its customers: “We are realizing increased processing speed which leads to shorter delivery times In addition, reduced storage expenses means that we can store more, not acquire less Both the company’s internal operations and our clients have access to deeper data supporting and aiding insights into their business areas
“Overall, we are seeing reduced storage expenses while gaining processing and store capabilities and capacities,” he adds “This translates into an improved speed to market for our business units It also positions our Group to grow our Hadoop ecosystem to meet future Big Data requirements.”
And when it comes to being a Big Data All Star in today’s information- intensive world, Thomas’ advice is short and to the point: “Don’t wait and don’t stop.”
By taking advantage
of the Hadoop cluster,
the team was able to
realize substantial more
processing power and
storage space, without
the costs associated
with traditional blade
servers equipped with
SAN storage
Trang 14Mason is the vice president for Technology Research at IRI, a 30 year old Chicago- based company that provides information, analytics, business intelligence and domain expertise for the world’s leading CPG, retail and healthcare companies.
“I’ve always had a love of mathematics and proved to be a natural when it came to computer science,” Mason says “So I combined both disciplines and
it has been my interest ever since I joined IRI 20 years ago to work with Big Data (although it wasn’t called that back then) Today I head up a group that is responsible for forward looking research into tools and systems for processing, analyzing and managing massive amounts of data Our mission is two-fold: keep technology costs as low as possible while providing our clients with the state-of-the-art analytic and intelligence tools they need to drive their insights.”
Big Data Challenges
Recent challenges facing Mason and his team included a mix of business and technological issues They were attempting to realize significant cost reductions
by reducing mainframe load, and continue to reduce mainframe support risk that is increasing due to the imminent retirement of key mainframe support per-sonnel At the same time, they wanted to build the foundations for a more cost effective, flexible and expandable data processing and storage environment.The technical problem was equally challenging The team wanted to achieve random extraction rates averaging 600,000 records per second, peaking to over one million records persecond from a 15 TB fact table This table feeds a large multi-TB downstream client-facing reporting farm Given IRI’s emphasis on economy, the solution had to be very efficient, using only 16 to 24 nodes
“We looked at traditional warehouse technologies, but Hadoop was by far the most cost effective solution,” Mason says “Within Hadoop we investigated all the main distributions and various hardware options before settling on MapR
on a Cisco UCS (Unified Computing System) cluster.”
The fact table resides on the mainframe where it is updated and maintained daily These functions are very complex and proved costly to migrate to the cluster However, the extraction process, which represents the majority of the
Trevor Mason and Big Data:
Doing What Comes Naturally
but Hadoop was by far
the most cost effective
solution,” Mason says
“Within Hadoop we
investigated all the
main distributions and
Trevor Mason, Vice President for
Technology Research at IRI
Trang 15“The solution was to keep the update and maintenance processes on the mainframe and maintain a synchronized copy on the Hadoop cluster by using our mainframe change logging process,” he notes “All extraction processes go against the Hadoop cluster, significantly reducing the mainframe load This met our objective of maximum performance with minimal new development.”
The team chose MapR to maximize file system performance, facilitate the use of
a large number of smaller files, and take full advantage of its NFS capability so files could be sent via FTP from the mainframe directly to the cluster
Shaking up the System
They also gave their system a real workout Recalls Mason, “To maximize ciency we had to see how far we could push the hardware and software before
effi-it broke After several months of pushing the system to effi-its limeffi-its, we weeded out several issues, including a bad disk, a bad node, and incorrect OS, network and driver settings We worked closely with our vendors to root out and correct these issues.”
Overall, he says, the development took about six months followed by two months of final testing and running in parallel with the regular production pro-cesses He also stressed that “Much kudos go to the IRI engineering team and Zaloni consulting team who worked together to implemented all the minute details needed to create the current fully functional production system in only six months.”
To accomplish their ambitious goals, the team took some unique approaches For instance, the methods they used to organize the data and structure the ex-traction process allowed them to achieve between two million and three million records per second extraction rates on a 16 node cluster
They also developed a way to always have a consistent view of the data used in the extraction process while continuously updating it
By far one of the most effective additions to the IRI IT infrastructure was the implementation of Hadoop Before Hadoop the technology team relied on the mainframe running 24×7 to process the data in accordance with their customers’
With Hadoop, they
have been able to
speed up the process
while reducing
main-frame load The result:
annual savings of
more than $1.5 million.
Trang 16tight timelines With Hadoop, they have been able to speed up the process while reducing mainframe load The result: annual savings of more than
$1.5 million
Says Mason, “Hadoop is not only saving us money, it also provides a flexible platform that can easily scale to meet future corporate growth We can do a lot more in terms of offering our customers unique analytic insights—the Hadoop platform and all its supporting tools allow us to work with large datasets in a highly parallel manner
“IRI specialized in Big Data before the term became popular—this is not new to us,” he concludes “Big Data has been our business now for more than 30 years Our objective is to continue to find ways to collect, process and manage Big Data efficiently so we can provide our clients with leading insights to drive their business growth.”
And finally, when asked what advice he might have for others who would like
to become Big Data All Stars, Mason is very clear: “Find and implement efficient and innovative ways to solve critical Big Data processing and management problems that result in tangible value to the company.”
“ Hadoop is not only
saving us money, it
also provides a flexible
platform that can easily
scale to meet future
corporate growth.”
Trevor Mason, Vice President
for Technology Research at IRI