1. Trang chủ
  2. » Luận Văn - Báo Cáo

Data Analysis Using Sql And Excel ( Pdfdrive ).Pdf

795 1 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Data Analysis Using SQL and Excel®
Tác giả Gordon S. Linoff
Thể loại Book
Năm xuất bản Second Edition
Định dạng
Số trang 795
Dung lượng 28,03 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Data Analysis Using SQL and Excel Data Analysis Using SQL and Excel® Gordon S Linoff Data Analysis Using SQL and Excel® Second Edition Data Analysis Using SQL and Excel®, Second Edition Published by J[.]

Trang 3

Data Analysis Using

SQL and Excel®

Trang 5

Gordon S Linoff

Data Analysis Using

Second Edition

Trang 6

Copyright © 2016 by John Wiley & Sons, Inc., Indianapolis, Indiana

Published simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form

or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as mitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600 Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

per-Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or ranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose No warranty may be created or extended by sales or promotional materials The advice and strategies contained herein may not

war-be suitable for every situation This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services If professional assistance is required, the services

of a competent professional person should be sought Neither the publisher nor the author shall be liable for damages arising herefrom The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or website may provide or recommendations it may make Further, readers should be aware that Internet websites listed in this work may have changed or disappeared between when this work was written and when it is read.

For general information on our other products and services please contact our Customer Care Department within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002 Wiley publishes in a variety of print and electronic formats and by print-on-demand Some material included with standard print versions of this book may not be included in e-books or in print-on-demand If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com For more information about Wiley products, visit www.wiley.com

Library of Congress Control Number: 2015950486

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc and/or its affiliates, in the United States and other countries, and may not be used without written permission Excel is a registered trademark of Microsoft Corporation All other trademarks are the property of their respec- tive owners John Wiley & Sons, Inc is not associated with any product or vendor mentioned in this book.

Trang 9

About the Author 

Gordon S Linoff has been working with databases, big data, and data mining for almost longer than he can remember With decades of experience on the practice

of using data effectively, he is a recognized expert in the field of data mining.Gordon started using spreadsheets while a student at MIT, on the original Compaq Portable, the world’s first luggable computer Not very many years later, he managed a development group at the now‐defunct Thinking Machines Corporation, tasked with building a massively parallel relational database for decision support

After Thinking Machines’ demise, he founded Data Miners in 1998 with his friend and former colleague Michael J A Berry (who left in 2012) Since then, he has worked on a wide diversity of projects across many different companies He has taught hundreds of classes around the world on data mining and survival analysis through SAS Institute, a leader in statistical and business analytics software He is also an avid contributor to Stack Overflow, particularly on ques-tions related to databases, having the highest score in 2014

Together with Michael Berry, Gordon has written several influential books on

data mining, including Data Mining Techniques for Marketing, Sales, and Customer

Support, the first book on data mining to achieve a third edition

Gordon lives in New York with Giuseppe Scalia, his partner of 25 years

Trang 13

Acknowledgments

Although this book has only one name on the cover, many people have helped

me both specifically on this book and more generally in understanding data, analysis, and presentation

I first met Michael Berry in 1990 We later founded Data Miners together, and

he has been helpful on all fronts He reviewed the chapters, tested the SQL code

in the examples, and helped anonymize the data His insights have been helpful and his debugging skills have made the examples much more accurate His wife, Stephanie Jack, also deserves special praise for her patience and willingness to share Michael’s time

The original idea for the book came from Nick Drake, who then worked at Datran Media A statistician by training, Nick was looking for a book that would help him use databases for data analysis Bob Elliott, at the time my editor at Wiley, liked the idea

Throughout the chapters, the understanding of data processing is based on dataflows, which Craig Stanfill of Ab Initio Corporation first introduced me to long ago when we worked together at Thinking Machines Corporation

Along the way, I have learned a lot from many people Anne Milley of SAS Institute first suggested that I learn survival analysis Will Potts, now work-ing at CapitalOne, then taught me much of what I know about the subject Brij Masand helped extend the ideas to practical forecasting applications Chi Kong

Ho and his team at the New York Times provided valuable feedback for applying

survival analysis to customer value calculations

Stuart Ward from the New York Times and Zaiying Huang spent countless

hours explaining and discussing statistical concepts Harrison Sohmer, also of

the New York Times, taught me many Excel tricks, some of which I’ve been able

to include in the book

Trang 14

Jamie MacLennan and the SQL Server team at Microsoft have been helpful

in answering my questions about the product

Over the past few years, I have been a major contributor to Stack Overflow Along the way, I have learned an incredible amount about SQL and about how to explain concepts A handful of people whom I’ve never met in person have helped in various ways Richard Stallman invented emacs and the Free Software Foundation; emacs provided the basis for the calendar table Rob Bovey

of Applications Professional, Inc created the X‐Y chart labeler used in several chapters The Census data set was created by the folks at the Missouri Census Data Center Juice Analytics inspired the example for Worksheet bar charts in Chapter 5 (and thanks to Alex Wimbush, who pointed me in their direction) Edwin Straver of Frontline Systems answered several questions about Solver.Over the years, many colleagues, friends, and students have provided inspira-tion, questions, and answers There are too many to list them all, but I want to particularly thank Eran Abikhzer, Christian Albright, Michael Benigno, Emily Cohen, Carol D’Andrea, Sonia Dubin, Lounette Dyer, Victor Fu, Josh Goff, Richard Greenburg, Gregory Lampshire, Mikhail Levdanski, Savvas Mavridis, Fiona McNeill, Karen Kennedy McConlogue, Steven Mullaney, Courage Noko, Laura Palmer, Alan Parker, Ashit Patel, Ronnie Rowton, Vishal Santoshi, Adam Schwebber, Kent Taylor, John Trustman, John Wallace, David Wang, and Zhilang Zhao I would also like to thank the folks in the SAS Institute Training group who have organized, reviewed, and sponsored my data mining classes for many years, giving me the opportunity to meet many interesting and diverse people involved with data mining

I also thank all those friends and family I’ve visited while writing this book and who (for the most part) allowed me the space and time to work—my mother,

my father, my sister Debbie, my brother Joe, my in‐laws Raimonda Scalia, Ugo Scalia, and Terry Sparacio, and my friends Jon Mosley, Paul Houlihan, Leonid Poretsky, Anthony DiCarlo, and Maciej Zworski On the other hand, my cat Luna, who spent many hours curled up next to me, will miss my writing.Finally, acknowledgments would be incomplete without thanking Giuseppe Scalia, my partner through seven books, who has managed to maintain my sanity through all of them

Thank you, everyone!

Trang 15

Chapter 2 What’s in a Table? Getting Started with Data Exploration 49

Chapter 4 Where Is It All Happening? Location, Location, Location 145

Chapter 6 How Long Will Customers Last? Survival Analysis to

Chapter 7 Factors Affecting Survival: The What and Why of

Chapter 8 Customer Purchases and Other Repeated Events  367 Chapter 9 What’s in a Shopping Cart? Market Basket Analysis  421

Chapter 12 The Best-Fit Line: Linear Regression Models 561 Chapter 13 Building Customer Signatures for Further Analysis 609 Chapter 14 Performance Is the Issue: Using SQL Effectively 655

Contents at a Glance

Trang 17

Picturing the Structure of the Data 6

Picturing Data Analysis Using Dataflows 16

Contents

Trang 18

LOOKUP: Looking Up Values in One Table in Another 19CROSSJOIN: Generating the Cartesian Product of Two Tables 19

Subqueries and Common Table Expressions

Chapter 2 What’s in a Table? Getting Started with Data Exploration 49

Trang 19

Stacked Columns 60

Ranges Based on the Number of Digits, Using Numeric

Ranges Based on the Number of Digits, Using String

More Values to Explore—Min, Max, and Mode 79

Exploring Values in Two Columns 86

From Summarizing One Column to Summarizing All Columns 90

Trang 20

How Different Are the Averages? 105

Ratios and Their Statistics 128

Chapter 4 Where Is It All Happening? Location, Location, Location 145

Trang 21

Euclidian Method 149

Dates and Times in Databases 198

Trang 22

Intervals (Durations) 202

Starting to Investigate Dates 204

How Long Between Two Dates? 218

Counting Active Customers by Day 239

Simple Chart Animation in Excel 247

Trang 23

Order Date to Ship Date 248

Chapter 6 How Long Will Customers Last? Survival Analysis to Understand

Background on Survival Analysis 256

Comparing Different Groups of Customers 280

Comparing Survival over Time 287

Important Measures Derived from Survival 293

Trang 24

Using Survival for Customer Value Calculations 298

Calculating the Number of Existing Customers on July 1st 311

Chapter 7 Factors Affecting Survival: The What and Why of

Which Factors Are Important and When 316

Calculating One Hazard Probability Using a Time Window 338

Trang 25

How Many Days in a Row Do Customers Make Purchases? 391

Trang 26

Chapter 9 What’s in a Shopping Cart? Market Basket Analysis  421

Are Duplicates Explained by Multiple Ship Dates or Prices? 430

Which Products Tend to be Sold Multiple Times Within

Products and Customer Worth 437

Product Geographic Distribution 448

Trang 27

Which Products Have Broad Appeal Versus Local Appeal 449

Which Customers Have Particular Products? 451

Investigating Products within Households but Not within

The Simplest Association Rules 480

Trang 28

Heterogeneous Associations 496

Extending Association Rules 499

Introduction to Directed Data Mining 508

Lookup Model for Most Popular Product 522

Lookup Model for Order Size 528

Lookup Model for Probability of Response 534

Trang 29

How Accurate Are the Models? 537

Nạve Bayesian Models (Evidence Models) 546

Chapter 12 The Best-Fit Line: Linear Regression Models 561

LINEST() for Logarithmic, Exponential, and Power Curves 580

Measuring Goodness of Fit Using R2 581

Direct Calculation of Best-Fit Line Coefficients 584

Trang 30

Calculating the Coefficients 584

More Than One Input Variable 600

Chapter 13 Building Customer

What Is a Customer Signature? 610

Designing Customer Signatures 617

Trang 31

Operations to Build Customer Signatures 622

Summarizing Customer Behaviors 644

Chapter 14 Performance Is the Issue: Using SQL Effectively 655

Query Engines and Performance 656

Trang 32

Parallel Full Table Scan 658

Reference Only the Columns and Tables That Are Needed

Pros and Cons: Different Ways of Expressing the Same Thing 686

Trang 33

Pre‐aggregation Fixes the Performance Problem 690

Trang 35

Foreword

Gordon Linoff and I have written three and a half books together (Four, if we

get to count the second edition of Data Mining Techniques as a whole new book;

it didn't feel like any less work.) Neither of us has written a book without the other before, so I must admit to a tiny twinge of regret upon first seeing the cover of this one without my name on it next to Gordon's The feeling passed very quickly as recollections of the authorial life came flooding back—vaca-tions spent at the keyboard instead of in or on the lake, opportunities missed, relationships strained More importantly, this is a book that only Gordon Linoff could have written His unique combination of talents and experiences informs every chapter

I first met Gordon at Thinking Machines Corporation, a now long‐defunct manufacturer of parallel supercomputers where we both worked in the late eighties and early nineties Among other roles, Gordon managed the implemen-tation of a parallel relational database designed to support complex analytical queries on very large databases The design point for this database was radically different from other relational database systems available at the time in that no trade‐offs were made to support transaction processing The requirements for a system designed to quickly retrieve or update a single record are quite different from the requirements for a system to scan and join huge tables Jettisoning the requirement to support transaction processing made for a cleaner, more efficient database for analytical processing This part of Gordon's background means he understands SQL for data analysis literally from the inside out

Just as a database designed to answer big important questions has a different

structure from one designed to process many individual transactions, a book

about using databases to answer big important questions requires a different approach to SQL Many books on SQL are written for database administrators

Trang 36

Others are written for users wishing to prepare simple reports Still others attempt to introduce some particular dialect of SQL in every detail This one

is written for data analysts, data miners, and anyone who wants to extract maximum information value from large corporate databases Jettisoning the requirement to address all the disparate types of database users makes this a better, more focused book for the intended audience In short, this is a book about how to use databases the way we ourselves use them

Even more important than Gordon's database technology background are his many years experience as a data mining consultant This has given him a deep understanding of the kinds of questions businesses need to ask and of the data they are likely to have available to answer them Years spent exploring corporate databases have given Gordon an intuitive feel for how to approach the kinds of problems that crop up time and again across many different business domains:

How to take advantage of geographic data. A zip code field looks much richer when you realize that from zip code you can get to latitude and longitude, and from latitude and longitude you can get to distance It looks richer still when your realize that you can use it to join in Census Bureau data to get at important attributes, such as population density, median income, percentage of people on public assistance, and the like

How to take advantage of dates. Order dates, ship dates, enrollment dates, birth dates Corporate data is full of dates These fields look richer when you understand how to turn dates into tenures, analyze purchases

by day of week, and track trends in fulfillment time They look richer still when you know how to use this data to analyze time‐to‐event problems such as time to next purchase or expected remaining lifetime of a customer relationship

How to build data mining models directly in SQL. This book shows you how to do things in SQL that you probably never imagined pos-sible, including generating association rules for market basket analysis, building regression models, and implementing nạve Bayesian models and scorecards

How to prepare data for use with data mining tools. Although more than most people realize can be done using just SQL and Excel, eventu-ally you will want to use more specialized data mining tools These tools

need data in a specific format known as a customer signature This book

shows you how to create these data mining extracts

The book is rich in examples and they all use real data This point is worth saying more about Unrealistic datasets lead to unrealistic results This is frus-trating to the student In real life, the more you know about the business context, the better your data mining results will be Subject matter expertise gives you a head start You know what variables ought to be predictive and have good ideas

Trang 37

about new ones to derive Fake data does not reward these good ideas because patterns that should be in the data are missing and patterns that shouldn't be there have been introduced inadvertently Real data is hard to come by, not least because real data may reveal more than its owners are willing to share about their business operations As a result, many books and courses make do with artificially constructed datasets Best of all, the datasets used in the book are all available for download at www.wiley.com/go/dataanalysisusingsqlandexcel2e.

I reviewed the chapters of this book as they were written This process was very beneficial to my own use of SQL and Excel The exercise of thinking about the fairly complex queries used in the examples greatly increased my under-standing of how SQL actually works As a result, I have lost my fear of nested queries, multi‐way joins, giant case statements, and other formerly daunting aspects of the language In well over a decade of collaboration, I have always turned to Gordon for help using SQL and Excel to best advantage Now, I can turn to this book And you can, too

—Michael J A Berry

Trang 39

Introduction

The first edition of this book set out to explain data analysis from an eminently practical perspective, using the familiar tools of SQL and Excel The guiding principle of the book was to start with questions and guide the reader through the solutions, both from a business perspective and a technical perspective This approach proved to be quite successful

Much has changed in the ten years since I started writing the first edition The tools themselves have changed In those days, Excel did not have a Ribbon, for instance And, window functions were rare in databases The world that analysts inhabit has also changed, with tools such as Python and R and NoSQL databases becoming more common However, relational databases are still in widespread use, and SQL is, if anything, even more relevant today as technology spreads through businesses big and small Excel still seems to be the reporting and presentation tool of choice for many business users Big data is no longer

a future frontier; it is a problem, a challenge, and an opportunity that we face

on a daily basis

The second edition has been revised and updated to reflect the changes in the underlying software, with more examples and more techniques, and an additional chapter on database performance In doing so, I have strived to keep the strengths from the first edition The book is still organized around the principles of data, analysis, and presentation—three capabilities that are rarely treated together Examples are organized around questions, with a discussion

of both the business relevance and the technical approaches to the problems The examples carry through to actual code The data, the code, and the Excel examples are all available on the companion website

Trang 40

The motivation for this approach originally came from a colleague, Nick Drake, who is a statistician by training Once upon a time, he was looking for

a book that would explain how to use SQL for the complex queries needed for data analysis Books on SQL tend to cover either basic query constructs or the details of how databases work None come strictly from a perspective of analyz-ing data, and none are structured around answering questions about data Of the many books on statistics, none address the simple fact that most of the data being used resides in relational databases This book fills that gap

My other books on data mining, written with Michael Berry, focus on advanced algorithms and case studies By contrast, this book focuses on the “how‐to.” It starts by describing data stored in databases and continues through prepar-ing and producing results Interspersed are stories based on my experience in the field, explaining how results might be applied and why some things work and other things do not The examples are so practical that the data used for them is available on the book’s companion website (www.wiley.com/go/ dataanalysisusingsqlandexcel2e)

One of the truisms about data warehouses and analysis databases in

gen-eral is that they don’t actually do anything Yes, they store data Yes, they bring

together data from different sources, cleansing and clarifying along the way Yes, they define business dimensions, store transactions about customers, and, perhaps, summarize important data (And, yes, all these are very important!) However, data in a database resides on many spinning disks and in complex data structures in a computer’s memory So much data, so little information.How can we exploit this data, particularly data that describes customers? The many fancy algorithms for statistical modeling and data mining all have

a simple rule: “garbage‐in, garbage‐out.” The results of even the most cated techniques are only as good as the data being used (and the assumptions being fed into the model) Data is central to the task of understanding customers, understanding products, and understanding markets

sophisti-The chapters in this book cover different aspects of data and several important analytic techniques that are readily supported by SQL and Excel The analytic techniques range from exploratory data analysis to survival analysis, from market basket analysis to nạve Bayesian models, and from simple animations

to regression Of course, the potential range of possible techniques is much larger than can be presented in one book These methods have proven useful over time and are applicable in many different areas

And finally, data and analysis are not enough Data must be analyzed, and the results must be presented to the right audience To fully exploit its value,

we must transform data into stories and scenarios, charts and metrics and insights

Ngày đăng: 11/08/2023, 15:36

w