1. Trang chủ
  2. » Công Nghệ Thông Tin

SPLASH2013 indrajitroy rforbigdata Presto Big Data Analysis Beyond Hadoop

39 3 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Presto Big Data Analysis Beyond Hadoop
Tác giả Indrajit Roy
Chuyên ngành Data Science
Thể loại Lecture
Năm xuất bản 2013
Định dạng
Số trang 39
Dung lượng 1,24 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Presto Big Data Analysis Beyond Hadoop © Copyright 2012 Hewlett Packard Development Company, L P The information contained herein is subject to change without notice R for Big Data Indrajit Roy, HP La.

Trang 1

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

R for Big Data

Indrajit Roy , HP Labs

October 2013

Team:

Trang 2

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

2

A tale of three researchers

(Systems + PL) talk about data mining problems!

Trang 3

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

3

A Big Data story

Once upon a time, a customer in distress had…

… 2+ billion rows of financial data (TBs of data)

… wanted to model defaults on mortgage and credit cards

… by running regression analysis

… Alas!

… traditional databases don’t support regression analysis

… custom code can take from hours to days

Moral of the story:

Customers need platform+programming model for complex analysis

Trang 4

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

4

Big Data has many facets

>1M customer transactions/hr

>1B user graph

>40B photos

>7 TB/day

Trang 5

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

Complex analytics at scale

Event processing at scale Not today’s talk

Trang 6

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

Trang 7

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

7

Example: PageRank using matrices

Power method Dominant eigenvector

Trang 8

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

SQL, database

Machine learning, images, graphs

Hadoop

RDBMS (col store)

R/Matlab

Scale+

Complex Analytics

* very simplified view

Trang 9

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

9

Large scale analytics frameworks

Data-parallel frameworks – MapReduce/Dryad (2004)

Process each record in parallel

Use case: Computing sufficient statistics, analytics queries

Graph-centric frameworks – Pregel/GraphLab (2010)

Process each vertex in parallel

Use case: Graphical models

Array-based frameworks

Process blocks of array in parallel

Use case: Linear Algebra Operations

Our approach*

*Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices

S Venkataraman, E Bodzsar, I Roy, A AuYoung, R Schreiber Eurosys 2013

Trang 10

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

Enter the world of R

Trang 11

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

Trang 12

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

12

Why is R popular?

Extremely data driven

• Load data, analyze using functions

• From sum, mean, median to regression, PageRank, and others

• Plot!

Simple data structures: Arrays and Data-frames

• Almost everything is a package Even GUI

• Community driven

• Use install.packages() to get your favorite package

Trang 13

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

Trang 14

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

Almost functional

Superassignment (<<-) has

global effect

Trang 15

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

15

Example 3: Classes and objects

> setClass("person", representation (name= "character", age =

Trang 16

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

16

But R is …

Not parallel Not distributed Limited by dataset size

Trang 17

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

Enter Distributed R

Trang 18

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

18

Challenge 1: R has limited support for parallelism

• R is single-threaded

• Multi-process solutions offered by extensions

• Threads/processes share data through pipes or network

− Time-inefficient (sending copies)

− Space-inefficient (extra copies)

R process

copy of data

R process

copy of data

Server 1

network copy network copy

Server 2

Trang 19

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

19

Challenge 2: R is memory bound

• Data needs to fit DRAM

• Current research solution:

− Uses custom bigarray objects with limited functionality

− Even simple operations like x+y may not work

Trang 20

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

20

Challenge 3: Sparse datasets cause load imbalance

Computation + communication imbalance !

1 10 100 1000 10000

Trang 21

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

Trang 22

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

22

• Relies on user defined partitioning

• Also support for distributed data-frames

darray

Enhancement #1: Distributed data structures

Trang 23

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

23

• Express computations over partitions

• Execute across the cluster

foreach

Enhancement #2: Distributed loop

f (x)

Trang 24

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

function(p= splits (P,i),m= splits (M,i)

x= splits (P_old),z= splits (Z,i)) {

p  (m*x)+z

update(p) })

Trang 25

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

function(p= splits (P,i),m= splits (M,i)

x= splits (P_old),z= splits (Z,i)) {

p  (m*x)+z

update(p) })

P_old  P

}

Trang 26

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

Trang 27

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

27

Worker

Executor pool

• Scheduler: performs I/O and task scheduling

• Worker: executes tasks and I/O operations

Trang 28

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

28

Locality based computation: Part 1

P1 P2 P3 P4

Ship functions to data

M 1 M 2 M 3 M 4

Trang 29

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

P1

Run Task

M 1 M 2 M 3 M 4

Trang 30

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

30

Efficiently sharing data

Goal: Zero-copy sharing across cores

Immutable partitions  Safe sharing

Versioned distributed arrays

Trang 31

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

31

Data sharing challenges

Problems with data sharing

• Garbage collection

• Header conflicts

R object data part

R object header

R instance R instance

Corrupt header

Trang 32

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

32

Data sharing challenges

Problems with data sharing

page boundary page boundary

Trang 33

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

Trang 34

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

34

Applications in Distributed R

Netflix recommendation Matrix factorization 130

Fewer than 140 lines of code

*LOC for core of the applications

Trang 35

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

35

Distribtued R has good performance

Algorithm: PageRank (power method)

Dataset: ClueWeb graph, 100M vertices, 1.2B edges, 20GB

Setup: 8 SL 390 servers, 8 cores/server, 96GB RAM

*Shorter is better

Trang 36

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

36

Back to the Big Data story

Scalable and high performance

• Regression on multi-billion rows in minutes

• Graph algorithms on billions of vertices and edges in minutes

Ease of programming?

• Familiar model as R, easy for data scientists

• Distributed algorithms in hundreds of lines: clustering,

classification, regression, graph algorithms, …

Trang 37

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

37

Related work

Non-R systems

• MapReduce, Spark (UC Berkeley), Piccolo (NYU), MadLINQ (Microsoft)

• Pregel (Google), GraphLab (CMU)

• Star-P (MIT), Julia

Trang 38

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

compiler improvements, package contributions, …

Trang 39

© Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

Prof Andrew Chien (U Chicago)

Prof Renato Figueiredo (UFL)

http://www.hpl.hp.com/research/distributedr.htm

Ngày đăng: 29/08/2022, 22:37