IT training data warehousing with greenplum 2e khotailieu

Marshall PresserData Warehousing with Greenplum Open Source Massively Parallel Data Analytics SECOND EDITION Boston Farnham Sebastopol Tokyo Beijing Boston Farnham Sebastopol Tokyo Beiji

Trang 1

Compliments of

REPORT

Data

Warehousing with

Trang 2

Marshall Presser

Data Warehousing with Greenplum

Open Source Massively Parallel Data Analytics

SECOND EDITION

Boston Farnham Sebastopol Tokyo

Beijing Boston Farnham Sebastopol Tokyo

Beijing

Trang 3

[LSI]

Data Warehousing with Greenplum

by Marshall Presser

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com) For more infor‐

mation, contact our corporate/institutional sales department: 800-998-9938 or cor‐

porate@oreilly.com.

Acquisition Editor: Michelle Smith

Development Editor: Corbin Collins

Production Editor: Deborah Baker

Copyeditor: Bob Russell, Octal Publish‐

ing, LLC

Proofreader: Charles Roumeliotis

Interior Designer: David Futato

Cover Designer: Karen Montgomery

Illustrator: Rebecca Demarest July 2019: Second Edition

Revision History for the Second Edition

2019-06-07: First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Data Warehousing

with Greenplum, the cover image, and related trade dress are trademarks of O’Reilly

Media, Inc.

While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limi‐ tation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsi‐ bility to ensure that your use thereof complies with such licenses and/or rights This work is part of a collaboration between O’Reilly and Pivotal See our statement

of editorial independence

Trang 4

Table of Contents

Foreword to the Second Edition vii

Foreword to the First Edition xi

Preface xv

1 Introducing the Greenplum Database 1

Problems with the Traditional Data Warehouse 1

Responses to the Challenge 2

A Brief Greenplum History 3

What Is Massively Parallel Processing? 7

The Greenplum Database Architecture 8

Additional Resources 11

2 What’s New in Greenplum? 15

What’s New in Greenplum 5? 15

What’s New in Greenplum 6? 17

3 Deploying Greenplum 21

Custom(er)-Built Clusters 21

Greenplum Building Blocks 23

Public Cloud 24

Private Cloud 25

Greenplum for Kubernetes 26

Choosing a Greenplum Deployment 27

iii

Trang 5

4 Organizing Data in Greenplum 29

Distributing Data 30

Polymorphic Storage 33

Partitioning Data 33

Orientation 37

Compression 37

Append-Optimized Tables 39

External Tables 39

Indexing 40

5 Loading Data 43

INSERT Statements 43

\COPY Command 43

The gpfdist Process 44

The gpload Tool 46

6 Gaining Analytic Insight 49

Data Science on Greenplum with Apache MADlib 49

Apache MADlib 51

Text Analytics 57

Brief Overview of GPText Architecture 58

7 Monitoring and Managing Greenplum 65

Greenplum Command Center 65

Workload Management 70

Greenplum Management Tools 75

8 Accessing External Data 81

dblink 81

Foreign Data Wrappers 83

Platform Extension Framework 84

Greenplum Stream Server 86

Greenplum-Kafka Integration 87

Greenplum-Informatica Connector 88

GemFire-Greenplum Connector 89

Greenplum-Spark Connector 90

Amazon S3 91

iv | Table of Contents

Trang 6

External Web Tables 92

9 Optimizing Query Response 95

Fast Query Response Explained 95

GPORCA Recent Accomplishments 100

Table of Contents | v

Trang 8

Foreword to the Second Edition

My journey with Pivotal began in 2014 at Morgan Stanley, where I

am the global head of database engineering We wanted to addresstwo challenges:

• The ever-increasing volume and velocity of data that needed to

be acquired, processed, and stored for long periods of time(more than seven years, in some cases)

• The need to satisfy the growing ad hoc query requirements ofour business users

Nearly all the data in this problem space was structured, and ouruser base and business intelligence tool suite used the universal lan‐guage of SQL Upon analysis, we realized that we needed a new datastore to resolve these issues

A team of experienced technology professionals spanning multipleorganizational levels evaluated the pain points of our current datastore product suite in order to select the next-generation platform.The team’s charter was to identify the contenders, define a set ofevaluation criteria, and perform an impartial evaluation Some ofthe key requirements for this new data store were that the productcould easily scale, provide dramatic query response time improve‐ments, be ACID and ANSI compliant, leverage deep data compres‐sion, and support a software-only implementation model We alsoneeded a vendor that had real-world enterprise experience, under‐stood the problem space, and could meet our current and futureneeds We conducted a paper exercise on 12 products followed bytwo comprehensive proofs-of-concept (PoCs) with our key applica‐tion stakeholders We tested each product’s utility suite (load,

vii

Trang 9

unload, backup, restore), its scalability capability along with linearquery performance, and the product’s ability to recover seamlesslyfrom server crashes (high availability) without causing an applica‐tion outage This extensive level of testing allowed us to gain an inti‐mate knowledge of the products, how to manage them, and evensome insight into how their service organizations dealt with soft‐ware defects We chose Greenplum due to its superior query perfor‐mance using a columnar architecture, ease of migration and servermanagement, parallel in-database analytics, the product’s vision androadmap, and strong management commitment and financial back‐ing.

Supporting our Greenplum decision was Pivotal’s commitment toour success Our users had strict timelines for their migration to theGreenplum platform During the POC and our initial stress tests, wediscovered some areas that required improvement Our deploymentschedule was aggressive, and software fixes and updates wereneeded at a faster cadence than Greenplum’s previous software-release cycle Scott Yara, one of Greenplum’s founders, was activelyengaged with our account, and he responded to our needs by assign‐ing Ivan Novick, Greenplum’s current product manager, to workwith us and adapt their processes to meet our need for faster soft‐ware defect repair and enhancement delivery This demonstratedPivotal’s strong customer focus and commitment to Morgan Stanley

To improve the working relationship even further and align ourengineering teams, Pivotal established a Pivotal Tracker (issuetracker, similar to Jira) account, which shortened the feedback loopand improved Morgan Stanley’s communication with the Pivotalengineering teams We had direct access to key engineers and visi‐bility into their sprints This close engagement allowed us to domore with Greenplum at a faster pace

Our initial Greenplum projects were highly successful and our plantdoubled annually The partnership with Pivotal evolved and Pivotalagreed to support our introduction of Postgres into our environ‐ment, even though Postgres was not a Pivotal offering at the time

As we became customer zero on Pivotal Postgres, we aligned ouronline transaction processing (OLTP) and big data analytic offerings

on a Postgres foundation Eventually, Pivotal would go all in withPostgres by open sourcing Greenplum and offering Pivotal Postgres

as a generally available product Making Greenplum the first opensource massively parallel processing (MPP) database built on Post‐

viii | Foreword to the Second Edition

Trang 10

gres gave customers direct access to the code base and allowed Pivo‐tal to tap into the extremely vibrant and eager community thatwanted to promote Postgres and the open source paradigm Thisshowed Pivotal’s commitment to open source and allowed them toleverage open source code for core Postgres features and direct theirfocus on key distinguishing features of Greenplum such as an MPPoptimizer, replicated tables, workload manager (WLM), range parti‐tioning, and graphical user interface (GUI) command center.Greenplum continues to integrate their product with key opensource compute paradigms For example, with the Pivotal’s PlatformExtension Framework (PXF), Greenplum can read and write toHadoop Distributed File System (HDFS) and its various popularformats such as Parquet Greenplum also has read/write connectors

to Spark and Kafka In addition, Greenplum has not neglected thecloud, where they have the capability to write to an Amazon WebServices (AWS) Amazon Simple Storage Service (Amazon S3) objectstore and have hybrid cloud solutions that run on any of the majorcloud vendors The cloud management model is appealing to Mor‐gan Stanley because managing large big data platforms on-premises

is challenging The cloud offers near-instant provisioning, flexibleand reliable hardware options, near-unlimited scalability, and snap‐shot backups Pivotal’s strategic direction of leveraging open sourcePostgres and investing in the cloud aligns with Morgan Stanley’sstrategic vision

The Morgan Stanley Greenplum plant is in the top five of theGreenplum customer footprints due to the contributions of manyteams within Morgan Stanley As our analytic compute require‐ments grow and evolve, Morgan Stanley will continue to leveragetechnology to solve complex business problems and drive innova‐tion

— Howard Goldberg Executive Director Global Head of Database Engineering

Morgan Stanley

Foreword to the Second Edition | ix

Trang 12

Foreword to the First Edition

In the mid-1980s, the phrase “data warehouse” was not in use Theconcept of collecting data from disparate sources, finding a histori‐cal record, and then integrating it all into one repository was barelytechnically possible The biggest relational databases in the worlddid not exceed 50 GB in size The microprocessor revolution wasjust getting underway, and two companies stood out: Tandem,which lashed together microprocessors and distributed online trans‐action processing (OLTP) across the cluster; and Teradata, whichclustered microprocessors and distributed data to solve the big dataproblem Teradata was named from the concept of a terabyte of data

—1,000 GB—an unimaginable amount of data at the time

Until the early 2000s Teradata owned the big data space, offering itssoftware on a cluster of proprietary servers that scaled beyond itsoriginal 1 TB target The database market seemed set and stagnant,with Teradata at the high end; Oracle and Microsoft’s SQL Serverproduct in the OLTP space; and others working to hold on to theirdiminishing share

But in 1999, a new competitor, soon to be renamed Netezza, enteredthe market with a new proprietary hardware design and a newindexing technology, and began to take market share from Teradata

By 2005, other competitors, encouraged by Netezza’s success,entered the market Two of these entrants are noteworthy In 2003,Greenplum entered the market with a product based on Post‐greSQL; it utilized the larger memory in modern servers to goodeffect with a data flow architecture, and reduced costs by deploying

on commodity hardware In 2005, Vertica was founded based on a

xi

Trang 13

major reengineering of the columnar architecture first implemented

by Sybase The database world would never again be stagnant.This book is about Greenplum, and there are several importantcharacteristics of this technology that are worth pointing out.The concept of flowing data from one step in the query executionplan to another without writing it to disk was not invented at Green‐plum, but it implemented this concept effectively This resulted in asignificant performance advantage

Just as important, Greenplum elected to deploy on regular, nonpro‐prietary hardware This provided several advantages First, Green‐plum did not need to spend R&D dollars engineering systems Next,customers could buy hardware from their favorite providers usingany volume purchase agreements that they already might have had

in place In addition, Greenplum could take advantage of the factthat the hardware vendors tended to leapfrog one another in priceand performance every four to six months Greenplum was achiev‐ing a 5–15% price/performance boost several times a year—for free.Finally, the hardware vendors became a sales channel Big playerslike IBM, Dell, and HP would push Greenplum over other players ifthey could make the hardware sale

Building Greenplum on top of PostgreSQL was also noteworthy Notonly did this allow Greenplum to offer a mature product muchsooner, it could use system administration, backup and restore, andother PostgreSQL assets without incurring the cost of building themfrom scratch The architecture of PostgreSQL, which was designedfor extensibility by a community, provided a foundation from whichGreenplum could continuously grow core functionality

Vertica was proving that a full implementation of a columnar archi‐tecture offered a distinct advantage for complex queries against bigdata, so Greenplum quickly added a sophisticated columnar capabil‐ity to its product Other vendors were much slower to react andthen could only offer parts of the columnar architecture in response.The ability to extend the core paid off quickly, and Greenplum’simplementation of columnar still provides a distinct advantage inprice and performance

Further, Greenplum saw an opportunity to make a very significantadvance in the way big data systems optimize queries, and thus theGPORCA optimizer was developed and deployed

xii | Foreword to the First Edition

Trang 14

During the years following 2006, these advantages paid off andGreenplum’s market share grew dramatically until 2010.

In early 2010, the company decided to focus on a part of the datawarehouse space for which sophisticated analytics were the key Thisstrategy was in place when EMC acquired Greenplum in the middle

of that year The EMC/Greenplum match was odd First, the nicheapproach toward analytics and away from data warehousing and bigdata would not scale to the size required by such a large enterprise

Next, the fundamental shared-nothing architecture was an odd fit in

a company whose primary products were shared storage devices.Despite this, EMC worked diligently to make a fit and it made a sig‐nificant financial investment to make it go In 2011, Greenplumimplemented a new strategy and went “all-in” on Hadoop It was nolonger “all-in” on the Greenplum Database

In 2013, EMC spun the Greenplum division into a new company,Pivotal Software From that time to the present, several importantdecisions were made with regard to the Greenplum Database.Importantly, the product is now open sourced Like many opensource products, the bulk of the work is done by Pivotal, but a com‐munity is growing The growth is fueled by another important deci‐sion: to reemphasize the use of PostgreSQL at the core

The result of this is a vibrant Greenplum product that retains theaforementioned core value proposition—the product runs on hard‐ware from your favorite supplier; the product is fast and supportsboth columnar and tabular tables; the product is extensible; and Piv‐otal has an ambitious plan in place that is feasible

The bottom line is that the Greenplum Database is capable of win‐ning any fair competition and should be considered every time

I am a fan

— Rob Klopp Ex-CIO, US Social Security Administration

Author of The Database Fog Blog

Foreword to the First Edition | xiii

Trang 16

Welcome to Pivotal Greenplum, the open source massively parallelanalytic database

Why Are We Rewriting This Book?

Greenplum has come a long way since we wrote this book in 2016.There have been two major releases: version 5, in 2017, and version

6, announced in March 2019 While the basic principles of usingPivotal Greenplum as an analytic data warehouse have not changed,there are many new features that simply did not exist at the time ofthe first edition of this book Some changes for this edition are inresponse to Greenplum customer requests Others have arisen fromanalysis of code users have sent in Yet others are derived from Pivo‐tal’s project to move Greenplum to more recent versions of Post‐greSQL than the version used as a base more than a decade ago.New features will be highlighted where discussed in the book with anote that looks like this:

New in Greenplum Version 6

This feature first appears in Greenplum version 6

In addition, technology changes in the past two years have opened

up new approaches to building Greenplum clusters These are dis‐cussed in Chapter 3 The tools available to do analytics have alsogrown and are discussed in Chapter 6 Greenplum Command Cen‐ter has added many new capabilities that are discussed in Chapter 7

xv

Trang 17

The need to do federated queries and integrate data from a widevariety of sources has led to new tools for using and ingesting datainto Greenplum This is discussed in the Chapter 8.

Why Did We Write This Book in the First Place?

When we at Pivotal decided to open-source the Pivotal GreenplumDatabase, we decided that an open source software project shouldhave more information than that found in online documentation

As a result, we believed we should provide a nontechnical introduc‐tion to Greenplum that does not live on a vendor’s website Manyother open source projects, especially those under the Apache Soft‐ware Foundation, have books, and as Greenplum is an importantproject, it should as well Our goal is to introduce the features andarchitecture of Greenplum to a wider audience

Who Are the “We”?

Marshall Presser is the lead author of this book, but many others atPivotal have contributed content, advice, editing, proofreading, sug‐gestions, topics, and so on Their names are listed in the Acknowl‐edgments section and, when appropriate, in sections where theyhave written extensively It takes a village to raise a child, and it cantake a crowd to source a book

Who Should Read This Book?

Anyone with a background in IT, relational databases, big data, oranalytics can profit from reading this book It is not designed forexperienced Greenplum users who are interested in the more tech‐nical features or those expecting detailed technical discussion ofoptimal query and loading performance, and so on We providepointers to more detailed information if you’re interested in adeeper dive

What the Book Covers

This book covers the basic features of the Greenplum Database,beginning with an introduction to the Greenplum architecture andthen describing data organization and storage, data loading, runningqueries, and doing analytics in the database, including text analytics

xvi | Preface

Trang 18

In addition, there is material on monitoring and managing Green‐plum, deployment options, and some other topics as well.

What It Doesn’t Cover

We don’t cover query tuning, memory management, best practicesfor indexes, adjusting the collection of database parameters (known

as GUCs), or converting to Greenplum from other relational data‐base systems These topics and others are all valuable, but they arecovered elsewhere and would bog down this introduction with toomuch detail

Where You Can Find More Information

At the end of each chapter, there is a section pointing to more infor‐mation on the topic

How to Read This Book

It’s been our experience that a good understanding of the Green‐plum architecture goes a long way An understanding of the basicarchitecture makes the sections on data distribution and data load‐ing seem intuitive Conversely, a lack of understanding of the archi‐tecture will make the rest of the book more difficult to comprehend

We would suggest that you begin with Chapter 1 and Chapter 4 andthen peruse the rest of the book as your interests dictate If you pre‐fer, you’re welcome to start at the beginning and work your way inchronological order to the end That works, too!

Acknowledgments

I owe a huge debt to my colleagues at Pivotal who helped explicitlywith this work and from whom I’ve learned so much in my years atPivotal and Greenplum I cannot name them all, but you know whoyou are

Special callouts to the section contributors (in alphabetical order bylast name):

• Oak Barrett for Greenplum Management Tools

• Kaushik Das for Data Science on Greenplum

Preface | xvii

Trang 19

• Scott Kahler for Greenplum Management Tools

• John Knapp for Greenplum-GemFire Connector

• Jing Li for Greenplum Command Center

• Frank McQuillan for Data Science on Greenplum

• Venkatesh Raghavan for GPORCA, Optimizing QueryResponse

• Craig Sylvester and Bharath Sitaraman for GP Text

Other contributors, reviewers, and colleagues:

• Jim Campbell, Craig Sylvester, Venkatesh Raghavan, and FrankMcQuillan for their yeoman work in reading the text and help‐ing improve it no end

• Cesar Rojas for encouraging Pivotal to back the book project

• Jacque Istok and Dormain Drewitz for encouraging me to writethis book

• Ivan Novick especially for the Greenplum list of achievementsand the Agile development information

• Elisabeth Hendrickson for her really useful content suggestions

• Bob Glithero for working out some marketing issues

• Jon Roberts, Scott Kahler, Mike Goddard, Derek Comingore,Louis Mugnano, Rob Eckhardt, Ivan Novick, and Dan Baskettefor the questions they answered

xviii | Preface

Trang 20

Problems with the Traditional Data

Warehouse

Sometime near the end of the twentieth century, there was a notion

in the data community that the traditional relational data warehousewas floundering As data volumes began to increase in size, the datawarehouses of the time were beginning to run out of power and notscaling up in performance Data loads were struggling to fit in theirallotted time slots More complicated analysis of the data was oftenpushed to analytic workstations, and the data transfer times were asignificant fraction of the total analytic processing times Further‐more, given the technology of the time, the analytics had to be runin-memory, and memory sizes were often only a fraction of the size

of the data This led to sampling the data, which can work well forsome techniques but not for others, such as outlier detection Adhoc queries on the data presented performance challenges to thewarehouse The database community sought to provide responses tothese challenges

1

Trang 21

Responses to the Challenge

One alternative was NoSQL Advocates of this position contended

that SQL itself was not scalable and that performing analytics onlarge datasets required a new computing paradigm Although theNoSQL advocates had successes in many use cases, they encoun‐tered some difficulties There are many varieties of NoSQL data‐bases, often with incompatible underlying models Existing toolshad years of experience in speaking to relational systems There was

a smaller community that understood NoSQL better than SQL, andanalytics in this environment was still immature The NoSQL move‐

ment morphed into a Not Only SQL movement, in which both para‐

digms were used when appropriate

Another alternative was Hadoop Originally a project to index theWorld Wide Web, Hadoop soon became a more general data analyt‐

ics platform MapReduce was its original programming model; this

required developers to be skilled in Java and have a fairly goodunderstanding of the underlying architecture to write performantcode Eventually, higher-level constructs emerged that allowed pro‐grammers to write code in Pig or even let analysts use HiveQL, SQL

on top of Hadoop However, SQL was never as complete or per‐formant as a true relational system

In recent years, Spark has emerged as an in-memory analytics plat‐form Its use is rapidly growing as the dramatic drop in price ofmemory modules makes it feasible to build large memory serversand clusters Spark is particularly useful in iterative algorithms andlarge in-memory calculations, and its ecosystem is growing Spark isstill not as mature as older technologies, such as relational systems,and it has no native storage model

Yet another response was the emergence of clustered relational sys‐tems, often called massively parallel processing (MPP) systems Thefirst entrant into this world was Teradata, in the mid- to late 1980s

In these systems, the relational data, traditionally housed in a system image, is dispersed into many systems This model owesmuch to the scientific computing world, which discovered MPPbefore the relational world The challenge faced by the MPP rela‐tional world was to make the parallel nature transparent to the usercommunity so coding methods did not require change or sophisti‐cated knowledge of the underlying cluster

single-2 | Chapter 1: Introducing the Greenplum Database

Trang 22

A Brief Greenplum History

Greenplum took the MPP approach to deal with the limitations ofthe traditional data warehouse It was originally founded in 2003 byScott Yara and Luke Lonergan as a merger of two companies, Dideraand Metapa Its purpose was to produce an analytic data warehousewith three major goals: rapid query response, rapid data loading,and rapid analytics by moving the analytics to the data

It is important to note that Greenplum is an analytic data warehouse and not a transactional relational database Although Greenplum

does have the notion of a transaction, which is useful for extract,transform, and load (ETL) jobs, that’s not its purpose It’s notdesigned for transactional purposes like ticket reservation systems,air traffic control, or the like Successful Greenplum deploymentsinclude but are not limited to the following:

• Fraud analytics

• Financial risk management

• Cybersecurity

• Customer churn reduction

• Predictive maintenance analytics

• Manufacturing optimization

• Smart cars and internet of things (IoT) analytics

• Insurance claims reporting and pricing analytics

• Health-care claim reporting and treatment evaluations

• Student performance prediction and dropout prevention

• Advertising effectiveness

• Traditional data warehouses and business intelligence (BI)From the beginning, Greenplum was based on PostgreSQL, the pop‐ular and widely used open source database Greenplum kept in syncwith PostgreSQL releases until it forked from the main PostgreSQLline at version 8.2.15

The first version of this new company arrived in 2005, called Gres In the same year, Greenplum and Sun Microsystems formed apartnership to build a 48-disk, 4-CPU, appliance-like product, fol‐lowing the success of the Netezza appliance What distinguishes the

Biz-A Brief Greenplum History | 3

Trang 23

two is that Netezza required special hardware, whereas all Green‐plum products have always run on commodity servers, neverrequiring a special hardware boost.

2007 saw the first publicly known Greenplum product, version 3.0.Later releases added many new features, most notably mirroring andhigh availability (HA)—at a time when the underlying PostgreSQLcould not provide any of these things

In 2010, a consolidation began in the MPP database world Manysmaller companies were purchased by larger ones EMC purchasedGreenplum in July 2010, just after the release of Greenplum version4.0 EMC packaged Greenplum into a hardware platform, the DataComputing Appliance (DCA) Although Greenplum began as a puresoftware play, with customers providing their own hardware plat‐form, the DCA became the most popular platform

2011 saw the release of the first paper describing Greenplum’sapproach to in-database machine learning and analytics, MADlib.(MADlib is described in more detail in Chapter 6.) In 2012, EMCpurchased Pivotal Labs, a well-established San Francisco–basedcompany that specialized in application development incorporatingpair programming, Agile methods, and involving the customer inthe development process This proved to be important not only forthe future development process of Greenplum, but also for giving aname to Greenplum’s 2013 spinoff from EMC The spinoff wascalled Pivotal Software and included assets from EMC as well asfrom VMware These included the Java-centric Spring Framework,RabbitMQ, the Platform as a Service (PaaS) Cloud Foundry, and thein-memory data grid Apache Geode, available commercially asGemFire

In 2015, Pivotal announced that it would adopt an open sourcestrategy for its product set Pivotal would donate much of the soft‐ware to the Apache Foundation and the software then would befreely licensed under the Apache rules However, it maintained asubscription-based enterprise version of the software, which it con‐tinues to sell and support

The Pivotal data products now include the following:

Trang 24

• Pivotal GemFire, the open source version of Apache GeodeOfficially, the open source version is known as the Greenplum Data‐base and the commercial version is the Pivotal Greenplum Database.With the exception of some features that are proprietary and avail‐able only with the commercial edition, the products are the same.

By 2015, many customers were beginning to require open source.Greenplum’s adoption of an open source strategy saw Greenplumcommunity contributions to the software as well as involvement ofPostgreSQL contributors Pivotal sees the move to open source ashaving several advantages:

• Avoidance of vendor lock-in

• Ability to attract talent in Greenplum development

• Faster feature addition to Greenplum with community involve‐ment

• Greater ability to eventually merge Greenplum to the currentPostgreSQL version

• Meeting the demand of many customers for open sourceThere are several distinctions between the commercial PivotalGreenplum and the open source Greenplum Pivotal Greenplumoffers the following:

• 24/7 premium support

• Database installers and production-ready releases

• GP Command Center—GUI management console

• GP Workload Management—dynamic, rule-based resourcemanagement

• GPText—Apache Solr–based text analytics

• Greenplum-Kafka Integration

• Greenplum Stream Server

• gpcopy utility—to migrate data between Greenplum systems

• GemFire-Greenplum Connector—data transfer between PivotalGreenplum and the Pivotal GemFire low-latency, in-memorydata grid

• QuickLZ compression

A Brief Greenplum History | 5

Trang 25

• Open Database Connectivity (ODBC) and Object Linking andEmbedding, Database (OLEDB) drivers for Pivotal Greenplum

GPText is covered in Chapter 6; GP Command Center

and GP Workload Management are discussed in Chap‐

ter 7; and Greenplum-Kafka Integration, Greenplum

Stream Server Integration, and the

GemFire-Greenplum Connector are discussed in Chapter 8

2015 also saw the arrival of the Greenplum development organiza‐tion’s use of an Agile development methodology; in 2016, there were

10 releases of Pivotal Greenplum, which included such features asthe release of the GPORCA optimizer, a high-powered, highly paral‐lel cost-based optimizer for big data In addition, Greenplum addedfeatures like a more sophisticated Workload Manager to deal withissues of concurrency and runaway queries, and the adoption of aresilient connection pooling mechanism The Agile release strategyallows Greenplum to quickly incorporate both customer requests aswell as ecosystem features

With the wider adoption of cloud-based systems in data warehous‐ing, Greenplum added support for Amazon Simple Storage Service(Amazon S3) files for data as well as support for running PivotalGreenplum in both Amazon Web Services (AWS) as well as Micro‐soft’s Azure and, recently, Google Cloud Platform (GCP) 2016 saw

an improved Command Center monitoring and management tooland the release of the second-generation of native text analytics inPivotal Greenplum But perhaps most significant is Pivotal’s com‐mitment to reintegrate Greenplum into more modern versions ofPostgreSQL, eventually leading to PostgreSQL 9.x support This isbeneficial in many ways Greenplum will acquire many of the fea‐tures and performance improvements added to PostgreSQL in thepast decade In return, Pivotal then can contribute back to the com‐munity

Pivotal released Greenplum 5 in mid-2017 In Greenplum 5, thedevelopment team cleaned up many diversions from mainline Post‐greSQL, focusing on where the MPP nature of Greenplum mattersand where it doesn’t In doing this, the code base is now considera‐bly smaller and thus easier to manage and support

6 | Chapter 1: Introducing the Greenplum Database

Trang 26

It included features such as the following:

• JSON support, which is of interest to those linking Greenplumand MongoDB and translating JSON into a relational format

• XML enhancements such as an increased set of functions forimporting XML data into Greenplum

• PostgreSQL-based Analyze, which is an order of magnitudefaster in generating table statistics

• Enhanced vacuum performance

• Lazy transactions IDs, which translate into fewer vacuum oper‐ations

• Universally unique identifier (UUID) data type

• Raster PostGIS

• User-defined function (UDF) default parameters

What Is Massively Parallel Processing?

To best understand how MPP came to the analytic database world,it’s useful to begin with scientific computing

Stymied by the amount of time required to do complex mathemati‐cal calculations, Seymour Cray introduced vectorized operations inthe Cray-1 in the early 1970s In this architecture, the CPU acts onall the elements of the vector simultaneously or in parallel, speedingthe computation dramatically As Cray computers became moreexpensive and budgets for science were static or shrinking, the sci‐entific community expanded the notion of parallelism by dividingcomplex problems into small portions and dividing the work among

a number of independent, small, inexpensive computers This group

of computers became known as a cluster Tools to decompose com‐

plex problems were originally scarce and much expertise wasrequired to be successful The original attempts to extend the MPParchitecture to data analytics were difficult However, a number ofsmall companies discovered that it was possible to start with stan‐dard SQL relational databases, distribute the data among the servers

in the cluster, and transparently parallelize operations Users couldwrite SQL code without knowing the data distribution Greenplumwas one of the pioneers in this endeavor

What Is Massively Parallel Processing? | 7

Trang 27

Here’s a small example of how MPP works Suppose that there is abox of 1,200 business cards The task is to scan all the cards and findthe names of all those who work for Acme Widget If a person canscan one card per second, it would take that one person 20 minutes

to find all those people whose card says Acme Widget

Let’s try it again, but this time distribute the cards into 10 equal piles

of 120 cards each and recruit 10 people to scan the cards, each onescanning the cards in one pile If they simultaneously scanned at therate of 1 card per second, they would all finish in about 2 minutes.This cuts down the time required by a factor of 10

This idea of data and workload distribution is at the heart of MPPdatabase technology In an MPP database, the data is distributed inchunks to all the nodes in the cluster In the Greenplum database,these chunks of data and the processes that operate on them are

known as segments In an MPP database, as in the business card

example, the amount of work distributed to each segment should beapproximately the same to achieve optimal performance

Of course, Greenplum is not the only MPP technology or even theonly MPP database Hadoop is a common MPP data storage andanalytics tool Spark also has an MPP architecture Pivotal GemFire

is an in-memory data-grid MPP architecture These are all very dif‐ferent from Greenplum because they do not natively speak standardSQL

The Greenplum Database Architecture

The Greenplum Database employs a shared-nothing architecture.

This means that each server or node in the cluster has its own inde‐pendent operating system (OS), memory, and storage infrastructure.Its name notwithstanding, in fact there is something shared and that

is the network connection between the nodes that allows them tocommunicate and transfer data as necessary Figure 1-1 illustratesthe Greenplum Database architecture

Trang 28

Figure 1-1 The Greenplum MPP architecture

Master and Standby Master

Greenplum users and database administrators (DBAs) connect to amaster server, which houses the metadata for the entire system Thismetadata is stored in a PostgreSQL database derivative When theGreenplum instance on the master server receives a SQL statement,

it parses it, examines the metadata repository, forms a plan to exe‐cute that statement, passes the plan to the workers, and awaits theresult In some circumstances, the master must perform some of thecomputation

Only metadata is stored on the master All the user data is stored onthe segment servers, the worker nodes in the cluster In addition tothe master, all production systems should also have a standby server.The standby is a passive member of the cluster, whose job is toreceive mirrored copies of changes made to the master’s metadata

In case of a master failure, the standby has a copy of the metadata,preventing the master from becoming a single point of failure.Some Greenplum clusters use the standby as an ETL server because

it has unused memory and CPU capacity This might be satisfactorywhen the master is working, but in times of a failover to the standby,the standby now is doing the ETL work as well as the role of themaster This can become a choke point in the architecture

The Greenplum Database Architecture | 9

Trang 29

Segments and Segment Hosts

Greenplum distributes user data into what are often known as

shards, but are called segments in Greenplum A segment host is the

server on which the segments reside Typically, there are several seg‐ments running on each segment server In a Greenplum installationwith eight segment servers, each might have six segments for a total

of 48 segments Every user table in Greenplum will have its data dis‐tributed in all 48 segments (We go into more detail on distributingdata in Chapter 4.) Unless directed by Greenplum support, users orDBAs should never connect to the segments themselves exceptthrough the master

A single Greenplum segment server runs multiple segments Thus,all other things being equal, it will run faster than a single-instancedatabase running on the same server That said, you should neveruse a single-instance Greenplum installation for a business-criticalprocess because it provides no HA or failover in case of hardware,software, or storage error

Private Interconnect

The master must communicate with the segments and the segmentsmust communicate with one another They do this on a private UserDatagram Protocol (UDP) network that is distinct from the publicnetwork on which users connect to the master This is critical Werethe segments to communicate on the public network, user down‐loads and other heavy loads would greatly affect Greenplum perfor‐mance The private network is critical Greenplum requires a 10 Gbnetwork and strongly urges using 10 Gb switches for redundancy.Other than the master, the standby, and the segment servers, someother servers might be plumbed into the private interconnect net‐work Greenplum will use these to do fast parallel data loading andunloading We discuss this topic in Chapter 5

Mirror Segments

In addition to the redundancy provided by the standby master,

Greenplum strongly urges the creation of mirror segments These are

segments that maintain a copy of the data on a primary segment, theone that actually does the work Should either a primary segment orthe host housing a primary segment fail, the mirrored segment con‐

Trang 30

tains all of the data on the primary segment Of course, the primaryand its mirror must reside on different segment hosts When a seg‐ment fails, the system automatically fails-over from the primary tothe mirrored segment, but operations in flight fail and must berestarted DBAs can run a process to recover the failed segment tosynchronize it with the current state of the databases.

There is also a wealth of information about Greenplum available onthe Pivotal website

Latest Documentation

When you see a Greenplum documentation page URL

that includes the word “latest,” this refers to the latest

Greenplum release That web page can also get you to

documentation for all supported releases

Greenplum Best Practices Guide

If you’re implementing Greenplum, you should read the Best Practi‐ces guide As this book should make clear, there are a number ofthings to consider to make good use of the power of the GreenplumDatabase If you’re accustomed to single-node databases, you arelikely to miss some of the important issues that this document helpsexplain No one should build or manage a production clusterwithout reading it and following its suggestions

Greenplum Cluster Concepts Guide

For a discussion on building Greenplum clusters, the GreenplumCluster Concepts Guide (PDF) is invaluable for understanding clus‐

Additional Resources | 11

Trang 31

ter issues that pertain to appliances, cloud-based Greenplum, orcustomer-built commodity clusters.

PivotalGuru (Formerly Greenplum Guru)

Pivotal’s Jon Roberts has a PivotalGuru website that discusses issuesand techniques that many Greenplum users find valuable AlthoughJon is a long-time Pivotal employee, the content on PivotalGuru ishis own and is not supported by Pivotal

Pivotal Greenplum Blogs

Members of the Greenplum engineering, field services, and data sci‐ence groups blog about Greenplum, providing insights and use casesnot found in many other locations These tend to be more technical

in nature

Greenplum YouTube Channel

The Greenplum YouTube channel has informative content on a vari‐ety of technical topics Most of the videos are general and will behelpful to those without prior experience with Greenplum Others,such as the discussion on the Pivotal Query Optimizer, go into con‐siderable depth

In addition to this material, there are some interesting discussionsabout the history of Greenplum that warrant examination:

Greenplum’s origin (video)

• Jacque Istok on Greenplum in the wild (video)

Greenplum Knowledge Base

Some topics in the knowledge base can get a bit esoteric Probablynot a place to start, this resource is more suited to experiencedusers The Pivotal Greenplum Knowledge Base houses general ques‐tions, their answers, and some detailed discussions on deeper topics

Trang 32

As part of the Greenplum open source initiative, Pivotal formed twoGoogle groups tasked with facilitating discussion about Greenplum.The first group deals with user questions about Greenplum Pivotaldata personnel monitor the group and provide answers in a timelyfashion The second group is a conversation vehicle for the Green‐plum development community (To view these, you need to be amember of the Google groups gpdb-dev and gpdb-users.) In the spi‐rit of transparency, Pivotal invites interested parties to listen in andlearn about the development process and potentially contribute.There are two mailing lists that provide insight into the current andfuture state of the Greenplum database: gpdb-users@greenplum.org,

on issues arising from the user community; and

Greenplum developers

Other Sources

Internet searches on Greenplum return many hits It’s wise toremember that not everything on the internet is accurate In partic‐ular, as Greenplum evolves over time, comments made in years pastmight no longer reflect the current state of things

Additional Resources | 13

Trang 34

CHAPTER 2

What’s New in Greenplum?

Technology products evolve over time Greenplum forked from themainline branch of PostgreSQL at release 8.2.15, but continued toadd new PostgreSQL features PostgreSQL also evolved over time,and Pivotal began the process of reintegrating Greenplum into Post‐greSQL with the goals of introducing the useful new features of laterreleases of PostgreSQL into Greenplum while also addingGreenplum-specific features into PostgreSQL

This process began in release 5 of Greenplum in 2017 and continueswith release 6 of Greenplum in 2019

What’s New in Greenplum 5?

Following is a list of the new features in Greenplum 5

See later sections of the book for more details on some

Trang 35

R and Python data science modules

These are collections of open source packages that data scien‐tists find useful They can be used in conjunction with the pro‐cedural languages for writing sophisticated analytic routines

New datatypes

JSON, UUID, and improved XML support

Enhanced query optimization

The GPORCA query optimizer has increased support for morecomplex queries

PXF extension format for integrating external data

PXF is a framework for accessing data external to Greenplum.This is discussed in Chapter 8

analyzedb enhancement

Critical for good query performance is the understanding of thesize and data contents of tables This utility was enhanced tocover more use cases and provide increased performance

PL/Container for untrusted languages

Python and R and untrusted languages because they contain OScallouts As a result, only the database superuser can createfunctions in these languages Greenplum 5 added the ability torun such functions in a container isolated from the OS proper

so superuser powers are not required

Improved backup and restore and incremental backup

Enhancements to the tools used to back up and restore Green‐plum These will be discussed in Chapter 7

Resource groups to improve concurrency

The ability to control queries cannot be underestimated in ananalytic environment The new resource group mechanism isdiscussed in Chapter 7

Greenplum-Kafka integration

Kafka has emerged as a leading technology for data dissemina‐tion and integration for real-time and near-real-time datastreaming Its integration with Greenplum is discussed in Chap‐ter 8

16 | Chapter 2: What’s New in Greenplum?

Trang 36

Enhanced monitoring with Pivotal Greenplum Command Center 4

Critical for efficient use of Greenplum is the ability to under‐stand what is occurring in Greenplum now and in the past This

is discussed in Chapter 7

What’s New in Greenplum 6?

This is a list of the new features in Greenplum 6 Some

features are explored in more detail later in the book

Greenplum 6 continued the integration of later PostgreSQL releasesand is now fully compatible with PostgreSQL 9.4 Pivotal is on aquest to add more PostgreSQL compatibility with each new majorrelease

PostgreSQL 9.4 merged

Pivotal Greenplum now has integrated the PostgreSQL 9.4 codebase This opens up new features and absorbs many perfor‐mance improvements

Write-ahead logging (WAL) replication

WAL is a PostgreSQL method for assuring data integrity.Though beyond the scope of this book, more information about

it is located in the High Availability section of the AdministratorGuide

Row-level locking

Prior to Greenplum 6, updates to tables required locking theentire table The introduction of locks to single rows canimprove performance by a factor of 50

Foreign data wrapper

The foreign data wrapper API allows Greenplum to access otherdata sources as though they were PostgreSQL or Greenplumtables This is discussed in Chapter 8

PostgreSQL extension (e.g., pgaudit)

The inclusion of PostgreSQL 9.4 code brings along many utilit‐ies that depend upon 9.4 functionality in the database pgaudit

is a contributed tool that makes use of that

What’s New in Greenplum 6? | 17

Trang 37

Recursive common table expressions (CTEs)

CTEs are like temporary tables, but they only exist for the dura‐tion of the SQL statement Recursive CTEs reference themselvesand are useful in querying hierarchical data

JSON, FTS, GIN indexes

These are specialized indexes for multicolumn and text-basedsearches They are not discussed in this book

Vastly improved online transaction performance (OLTP)

Greenplum 6 now uses row-level locking for data manipulationlanguage (DML) operations This has an enormous impact onthe performance of these operations, which often occur in ETLand data cleansing

Replicated tables

Replicated tables have long been requested by customers Theseare discussed in Chapter 4

zStandard compression

zStandard is a fast lossless compression algorithm

More efficient cluster expansion

Cluster expansion, though a rare event, requires computationaland disk access resources to redistribute the data A new algo‐rithm minimizes this time

Greenplum on Kubernetes

Useful in on-premises cloud deployment; this is discussed inChapter 3

More optimizer enhancements

Other than performance improvements, these are mostly trans‐parent to the user community

Diskquota

The Diskquota extension provides disk usage enforcement fordatabase objects It sets a limit on the amount of disk space that

a schema or a role can use

18 | Chapter 2: What’s New in Greenplum?

Trang 40

CHAPTER 3

Deploying Greenplum

Greenplum remains a software company That said, its analytic datawarehouse requires a computing environment and there are manyoptions As the computing world continues to evolve, Greenplum’sdeployment options have embraced these changes The evolution offaster networks, high-speed memory-based storage, and multicoreCPUs have led to a rethinking of how to build “bare metal” Green‐plum clusters The advances in facilites offered in public and privateclouds make them more attractive as deployment options for Green‐plum And lastly, the emergence of container-based products likeKubernetes provide yet another deployment option

Custom(er)-Built Clusters

For those customers who wish to deploy Greenplum on hardware intheir own datacenter, Greenplum has always provided a cluster-aware installer but assumed that the customer had correctly built thecluster This strategy provided a certain amount of flexibility Forexample, customers could configure exactly the number of segmenthosts they required and could add hosts when needed They had norestrictions on which brand of network gear to use, how muchmemory per node, or the number or size of the disks On the otherhand, building a cluster is considerably more difficult than configur‐ing a single server To this end, Greenplum has a number of facilitiesthat assist customers in building clusters

Today, there is much greater experience and understanding in build‐ing MPP clusters, but a decade ago, this was much less true and

21

Định dạng
Số trang	121
Dung lượng	6,03 MB