Presto Big Data Analysis Beyond Hadoop © Copyright 2012 Hewlett Packard Development Company, L P The information contained herein is subject to change without notice R for Big Data Indrajit Roy, HP La.
Trang 1© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
R for Big Data
Indrajit Roy , HP Labs
October 2013
Team:
Trang 2© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
2
A tale of three researchers
(Systems + PL) talk about data mining problems!
Trang 3© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
3
A Big Data story
Once upon a time, a customer in distress had…
… 2+ billion rows of financial data (TBs of data)
… wanted to model defaults on mortgage and credit cards
… by running regression analysis
… Alas!
… traditional databases don’t support regression analysis
… custom code can take from hours to days
Moral of the story:
Customers need platform+programming model for complex analysis
Trang 4© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
4
Big Data has many facets
>1M customer transactions/hr
>1B user graph
>40B photos
>7 TB/day
Trang 5© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
Complex analytics at scale
Event processing at scale Not today’s talk
Trang 6© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
Trang 7© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
7
Example: PageRank using matrices
Power method Dominant eigenvector
Trang 8© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
SQL, database
Machine learning, images, graphs
Hadoop
RDBMS (col store)
R/Matlab
Scale+
Complex Analytics
* very simplified view
Trang 9© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
9
Large scale analytics frameworks
Data-parallel frameworks – MapReduce/Dryad (2004)
Process each record in parallel
Use case: Computing sufficient statistics, analytics queries
Graph-centric frameworks – Pregel/GraphLab (2010)
Process each vertex in parallel
Use case: Graphical models
Array-based frameworks
Process blocks of array in parallel
Use case: Linear Algebra Operations
Our approach*
*Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
S Venkataraman, E Bodzsar, I Roy, A AuYoung, R Schreiber Eurosys 2013
Trang 10© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
Enter the world of R
Trang 11© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
Trang 12© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
12
Why is R popular?
Extremely data driven
• Load data, analyze using functions
• From sum, mean, median to regression, PageRank, and others
• Plot!
Simple data structures: Arrays and Data-frames
• Almost everything is a package Even GUI
• Community driven
• Use install.packages() to get your favorite package
Trang 13© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
Trang 14© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
Almost functional
Superassignment (<<-) has
global effect
Trang 15© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
15
Example 3: Classes and objects
> setClass("person", representation (name= "character", age =
Trang 16© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
16
But R is …
Not parallel Not distributed Limited by dataset size
Trang 17© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
Enter Distributed R
Trang 18© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
18
Challenge 1: R has limited support for parallelism
• R is single-threaded
• Multi-process solutions offered by extensions
• Threads/processes share data through pipes or network
− Time-inefficient (sending copies)
− Space-inefficient (extra copies)
R process
copy of data
R process
copy of data
Server 1
network copy network copy
Server 2
Trang 19© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
19
Challenge 2: R is memory bound
• Data needs to fit DRAM
• Current research solution:
− Uses custom bigarray objects with limited functionality
− Even simple operations like x+y may not work
Trang 20© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
20
Challenge 3: Sparse datasets cause load imbalance
Computation + communication imbalance !
1 10 100 1000 10000
Trang 21© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
Trang 22© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
22
• Relies on user defined partitioning
• Also support for distributed data-frames
darray
Enhancement #1: Distributed data structures
Trang 23© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
23
• Express computations over partitions
• Execute across the cluster
foreach
Enhancement #2: Distributed loop
f (x)
Trang 24© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
function(p= splits (P,i),m= splits (M,i)
x= splits (P_old),z= splits (Z,i)) {
p (m*x)+z
update(p) })
Trang 25© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
function(p= splits (P,i),m= splits (M,i)
x= splits (P_old),z= splits (Z,i)) {
p (m*x)+z
update(p) })
P_old P
}
Trang 26© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
Trang 27© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
27
Worker
Executor pool
• Scheduler: performs I/O and task scheduling
• Worker: executes tasks and I/O operations
Trang 28© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
28
Locality based computation: Part 1
P1 P2 P3 P4
Ship functions to data
M 1 M 2 M 3 M 4
Trang 29© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
P1
Run Task
M 1 M 2 M 3 M 4
Trang 30© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
30
Efficiently sharing data
Goal: Zero-copy sharing across cores
Immutable partitions Safe sharing
Versioned distributed arrays
Trang 31© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
31
Data sharing challenges
Problems with data sharing
• Garbage collection
• Header conflicts
R object data part
R object header
R instance R instance
Corrupt header
Trang 32© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
32
Data sharing challenges
Problems with data sharing
page boundary page boundary
Trang 33© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
Trang 34© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
34
Applications in Distributed R
Netflix recommendation Matrix factorization 130
Fewer than 140 lines of code
*LOC for core of the applications
Trang 35© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
35
Distribtued R has good performance
Algorithm: PageRank (power method)
Dataset: ClueWeb graph, 100M vertices, 1.2B edges, 20GB
Setup: 8 SL 390 servers, 8 cores/server, 96GB RAM
*Shorter is better
Trang 36© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
36
Back to the Big Data story
Scalable and high performance
• Regression on multi-billion rows in minutes
• Graph algorithms on billions of vertices and edges in minutes
Ease of programming?
• Familiar model as R, easy for data scientists
• Distributed algorithms in hundreds of lines: clustering,
classification, regression, graph algorithms, …
Trang 37© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
37
Related work
Non-R systems
• MapReduce, Spark (UC Berkeley), Piccolo (NYU), MadLINQ (Microsoft)
• Pregel (Google), GraphLab (CMU)
• Star-P (MIT), Julia
Trang 38© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
compiler improvements, package contributions, …
Trang 39© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice
Prof Andrew Chien (U Chicago)
Prof Renato Figueiredo (UFL)
http://www.hpl.hp.com/research/distributedr.htm