1. Trang chủ
  2. » Công Nghệ Thông Tin

NIST stonebraker Big Data Means at Least

34 5 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 34
Dung lượng 418,56 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

PowerPoint Presentation Big Data Means at Least Three Different Things Michael Stonebraker 2 The Meaning of Big Data 3 V’s • Big Volume — With simple (SQL) analytics — With complex (non SQL) analytics.

Trang 1

Big Data Means

at Least Three Different Things…

Michael Stonebraker

Trang 2

2

The Meaning of Big Data - 3 V’s

• Big Volume

Trang 3

Big Volume - Little Analytics

• Well addressed by data warehouse crowd

• Who are pretty good at SQL analytics on

Trang 4

4

In My Opinion…

• Column stores will win

• Factor of 50 or so faster than row stores

Trang 5

Big Data - Big Analytics

• Complex math operations (machine learning, clustering, trend detection, ….)

• A dozen or so common ‘inner loops’

SVD decomposition

Trang 7

Now Make It Interesting …

• Do this for all pairs of 4000 stocks

— The data is the following 4000 x 2000 matrix

Stock t 1 t 2 t 3 t 4 t 5 t 6 t 7 … t 2000

S 1

S 2

Trang 9

— Leave out outliers

— Just on securities with a market cap over

$10B

Trang 10

10

These Requirements Arise in

Many Other Domains

• Auto insurance

— Sensor in your car (driving behavior and

location)

— Reward safe driving (no jackrabbit stops,

stay out of bad neighborhoods)

• Ad placement on the web

— Cluster customer sessions

• Lots of science apps

— Genomics, satellite imagery, astronomy,

weather, …

Trang 11

In My Opinion…

• The focus will shift quickly from “small math” to

“big math” in many domains

• I.e this stuff will become main stream…

Trang 12

12

Solution Options

R, SAS, MATLAB, et al

• Weak or non-existent data management

• File system storage

• R doesn’t scale and is not a parallel system

Trang 13

Solution Options

RDBMS alone

• SQL simulator (MadLib) is slooooow (analytics * 01)

• Coding operations as UDFs still requires you to

simulate arrays on top of tables - sloooow

support iteration

Trang 14

14

Solution Options

R + RDBMS

• Have to extract and transform the data from RDBMS

table to R data format

• Need to learn 2 systems

• And R still doesn’t scale and is not a parallel system

Trang 16

16

Solution Options

• New Array DBMS designed with this market in mind

Trang 17

An Example Array Engine DB

SciDB (SciDB.org)

• All-in-one:

• Data is updated via time-travel; not overwritten

• Supports uncertain data, provenance

• Open source

Trang 18

18

Big Velocity

• Trading volumes going through the roof on

Wall Street – breaking infrastructure

• Sensor tagging of {cars, people, …} creates a

firehose to ingest

• The web empowers end users to submit

transactions – sending volume through the

roof

• PDAs lets them submit transactions from

anywhere…

Trang 19

• Big pattern - little state (electronic trading)

— Find me a ‘strawberry’ followed within 100 msec by a ‘banana’

• Complex event processing (CEP) is focused

on this problem

— Patterns in a firehose

Two Different Solutions

Trang 20

20

Two Different Solutions

• Big state - little pattern

— For every security, assemble my real-time

global position

— And alert me if my exposure is greater

than X

• Looks like high performance OLTP

— Want to update a database at very high

speed

Trang 21

My Suspicion

• Your have 3-4 Big state - little pattern

problems for every one Big pattern – little

state problem

Trang 23

Why Not Use Old SQL?

Trang 24

24

No SQL

• Give up SQL

Cassandra and Mongo are

moving to (yup) SQL

• Give up ACID

decision to tear your hair out

by doing it in user code

need ACID tomorrow?

Trang 25

VoltDB: an example of New SQL

• A main memory SQL engine

Trang 26

26

In My Opinion

• ACID is good

• High level languages are good

• Standards (i.e SQL) are good

Trang 27

Big Variety

• Typical enterprise has 5000 operational systems

• And what about all the rest of your data?

Trang 28

28

The World of Data Integration

enterprise data warehouse

text the rest of your data

Trang 29

Summary

• The rest of your data (public and private)

information

Trang 31

Data Tamer in a Nutshell

• Apply machine learning and statistics to perform

automatic:

• With a human assist if necessary

Trang 32

32

Data Tamer

• MIT research project

• Looking for more integration problems

Trang 33

Take away

• One size does not fit all

• Plan on (say) 6 DBMS architectures

• Elephants are not competitive

Have a bad ‘innovator’s dilemma’ problem

Trang 34

• Hub is at M.I.T

• Looking for more partners…

Ngày đăng: 29/08/2022, 22:35

w