Data Analysis Using SQL and Excel Data Analysis Using SQL and Excel® Gordon S Linoff Data Analysis Using SQL and Excel® Second Edition Data Analysis Using SQL and Excel®, Second Edition Published by J[.]
Trang 3Data Analysis Using
SQL and Excel®
Trang 5Gordon S Linoff
Data Analysis Using
Second Edition
Trang 6Copyright © 2016 by John Wiley & Sons, Inc., Indianapolis, Indiana
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form
or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as mitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600 Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
per-Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or ranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose No warranty may be created or extended by sales or promotional materials The advice and strategies contained herein may not
war-be suitable for every situation This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services If professional assistance is required, the services
of a competent professional person should be sought Neither the publisher nor the author shall be liable for damages arising herefrom The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or website may provide or recommendations it may make Further, readers should be aware that Internet websites listed in this work may have changed or disappeared between when this work was written and when it is read.
For general information on our other products and services please contact our Customer Care Department within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002 Wiley publishes in a variety of print and electronic formats and by print-on-demand Some material included with standard print versions of this book may not be included in e-books or in print-on-demand If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com For more information about Wiley products, visit www.wiley.com
Library of Congress Control Number: 2015950486
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc and/or its affiliates, in the United States and other countries, and may not be used without written permission Excel is a registered trademark of Microsoft Corporation All other trademarks are the property of their respec- tive owners John Wiley & Sons, Inc is not associated with any product or vendor mentioned in this book.
Trang 9About the Author
Gordon S Linoff has been working with databases, big data, and data mining for almost longer than he can remember With decades of experience on the practice
of using data effectively, he is a recognized expert in the field of data mining.Gordon started using spreadsheets while a student at MIT, on the original Compaq Portable, the world’s first luggable computer Not very many years later, he managed a development group at the now‐defunct Thinking Machines Corporation, tasked with building a massively parallel relational database for decision support
After Thinking Machines’ demise, he founded Data Miners in 1998 with his friend and former colleague Michael J A Berry (who left in 2012) Since then, he has worked on a wide diversity of projects across many different companies He has taught hundreds of classes around the world on data mining and survival analysis through SAS Institute, a leader in statistical and business analytics software He is also an avid contributor to Stack Overflow, particularly on ques-tions related to databases, having the highest score in 2014
Together with Michael Berry, Gordon has written several influential books on
data mining, including Data Mining Techniques for Marketing, Sales, and Customer
Support, the first book on data mining to achieve a third edition
Gordon lives in New York with Giuseppe Scalia, his partner of 25 years
Trang 13Acknowledgments
Although this book has only one name on the cover, many people have helped
me both specifically on this book and more generally in understanding data, analysis, and presentation
I first met Michael Berry in 1990 We later founded Data Miners together, and
he has been helpful on all fronts He reviewed the chapters, tested the SQL code
in the examples, and helped anonymize the data His insights have been helpful and his debugging skills have made the examples much more accurate His wife, Stephanie Jack, also deserves special praise for her patience and willingness to share Michael’s time
The original idea for the book came from Nick Drake, who then worked at Datran Media A statistician by training, Nick was looking for a book that would help him use databases for data analysis Bob Elliott, at the time my editor at Wiley, liked the idea
Throughout the chapters, the understanding of data processing is based on dataflows, which Craig Stanfill of Ab Initio Corporation first introduced me to long ago when we worked together at Thinking Machines Corporation
Along the way, I have learned a lot from many people Anne Milley of SAS Institute first suggested that I learn survival analysis Will Potts, now work-ing at CapitalOne, then taught me much of what I know about the subject Brij Masand helped extend the ideas to practical forecasting applications Chi Kong
Ho and his team at the New York Times provided valuable feedback for applying
survival analysis to customer value calculations
Stuart Ward from the New York Times and Zaiying Huang spent countless
hours explaining and discussing statistical concepts Harrison Sohmer, also of
the New York Times, taught me many Excel tricks, some of which I’ve been able
to include in the book
Trang 14Jamie MacLennan and the SQL Server team at Microsoft have been helpful
in answering my questions about the product
Over the past few years, I have been a major contributor to Stack Overflow Along the way, I have learned an incredible amount about SQL and about how to explain concepts A handful of people whom I’ve never met in person have helped in various ways Richard Stallman invented emacs and the Free Software Foundation; emacs provided the basis for the calendar table Rob Bovey
of Applications Professional, Inc created the X‐Y chart labeler used in several chapters The Census data set was created by the folks at the Missouri Census Data Center Juice Analytics inspired the example for Worksheet bar charts in Chapter 5 (and thanks to Alex Wimbush, who pointed me in their direction) Edwin Straver of Frontline Systems answered several questions about Solver.Over the years, many colleagues, friends, and students have provided inspira-tion, questions, and answers There are too many to list them all, but I want to particularly thank Eran Abikhzer, Christian Albright, Michael Benigno, Emily Cohen, Carol D’Andrea, Sonia Dubin, Lounette Dyer, Victor Fu, Josh Goff, Richard Greenburg, Gregory Lampshire, Mikhail Levdanski, Savvas Mavridis, Fiona McNeill, Karen Kennedy McConlogue, Steven Mullaney, Courage Noko, Laura Palmer, Alan Parker, Ashit Patel, Ronnie Rowton, Vishal Santoshi, Adam Schwebber, Kent Taylor, John Trustman, John Wallace, David Wang, and Zhilang Zhao I would also like to thank the folks in the SAS Institute Training group who have organized, reviewed, and sponsored my data mining classes for many years, giving me the opportunity to meet many interesting and diverse people involved with data mining
I also thank all those friends and family I’ve visited while writing this book and who (for the most part) allowed me the space and time to work—my mother,
my father, my sister Debbie, my brother Joe, my in‐laws Raimonda Scalia, Ugo Scalia, and Terry Sparacio, and my friends Jon Mosley, Paul Houlihan, Leonid Poretsky, Anthony DiCarlo, and Maciej Zworski On the other hand, my cat Luna, who spent many hours curled up next to me, will miss my writing.Finally, acknowledgments would be incomplete without thanking Giuseppe Scalia, my partner through seven books, who has managed to maintain my sanity through all of them
Thank you, everyone!
Trang 15Chapter 2 What’s in a Table? Getting Started with Data Exploration 49
Chapter 4 Where Is It All Happening? Location, Location, Location 145
Chapter 6 How Long Will Customers Last? Survival Analysis to
Chapter 7 Factors Affecting Survival: The What and Why of
Chapter 8 Customer Purchases and Other Repeated Events 367 Chapter 9 What’s in a Shopping Cart? Market Basket Analysis 421
Chapter 12 The Best-Fit Line: Linear Regression Models 561 Chapter 13 Building Customer Signatures for Further Analysis 609 Chapter 14 Performance Is the Issue: Using SQL Effectively 655
Contents at a Glance
Trang 17Picturing the Structure of the Data 6
Picturing Data Analysis Using Dataflows 16
Contents
Trang 18LOOKUP: Looking Up Values in One Table in Another 19CROSSJOIN: Generating the Cartesian Product of Two Tables 19
Subqueries and Common Table Expressions
Chapter 2 What’s in a Table? Getting Started with Data Exploration 49
Trang 19Stacked Columns 60
Ranges Based on the Number of Digits, Using Numeric
Ranges Based on the Number of Digits, Using String
More Values to Explore—Min, Max, and Mode 79
Exploring Values in Two Columns 86
From Summarizing One Column to Summarizing All Columns 90
Trang 20How Different Are the Averages? 105
Ratios and Their Statistics 128
Chapter 4 Where Is It All Happening? Location, Location, Location 145
Trang 21Euclidian Method 149
Dates and Times in Databases 198
Trang 22Intervals (Durations) 202
Starting to Investigate Dates 204
How Long Between Two Dates? 218
Counting Active Customers by Day 239
Simple Chart Animation in Excel 247
Trang 23Order Date to Ship Date 248
Chapter 6 How Long Will Customers Last? Survival Analysis to Understand
Background on Survival Analysis 256
Comparing Different Groups of Customers 280
Comparing Survival over Time 287
Important Measures Derived from Survival 293
Trang 24Using Survival for Customer Value Calculations 298
Calculating the Number of Existing Customers on July 1st 311
Chapter 7 Factors Affecting Survival: The What and Why of
Which Factors Are Important and When 316
Calculating One Hazard Probability Using a Time Window 338
Trang 25How Many Days in a Row Do Customers Make Purchases? 391
Trang 26Chapter 9 What’s in a Shopping Cart? Market Basket Analysis 421
Are Duplicates Explained by Multiple Ship Dates or Prices? 430
Which Products Tend to be Sold Multiple Times Within
Products and Customer Worth 437
Product Geographic Distribution 448
Trang 27Which Products Have Broad Appeal Versus Local Appeal 449
Which Customers Have Particular Products? 451
Investigating Products within Households but Not within
The Simplest Association Rules 480
Trang 28Heterogeneous Associations 496
Extending Association Rules 499
Introduction to Directed Data Mining 508
Lookup Model for Most Popular Product 522
Lookup Model for Order Size 528
Lookup Model for Probability of Response 534
Trang 29How Accurate Are the Models? 537
Nạve Bayesian Models (Evidence Models) 546
Chapter 12 The Best-Fit Line: Linear Regression Models 561
LINEST() for Logarithmic, Exponential, and Power Curves 580
Measuring Goodness of Fit Using R2 581
Direct Calculation of Best-Fit Line Coefficients 584
Trang 30Calculating the Coefficients 584
More Than One Input Variable 600
Chapter 13 Building Customer
What Is a Customer Signature? 610
Designing Customer Signatures 617
Trang 31Operations to Build Customer Signatures 622
Summarizing Customer Behaviors 644
Chapter 14 Performance Is the Issue: Using SQL Effectively 655
Query Engines and Performance 656
Trang 32Parallel Full Table Scan 658
Reference Only the Columns and Tables That Are Needed
Pros and Cons: Different Ways of Expressing the Same Thing 686
Trang 33Pre‐aggregation Fixes the Performance Problem 690
Trang 35Foreword
Gordon Linoff and I have written three and a half books together (Four, if we
get to count the second edition of Data Mining Techniques as a whole new book;
it didn't feel like any less work.) Neither of us has written a book without the other before, so I must admit to a tiny twinge of regret upon first seeing the cover of this one without my name on it next to Gordon's The feeling passed very quickly as recollections of the authorial life came flooding back—vaca-tions spent at the keyboard instead of in or on the lake, opportunities missed, relationships strained More importantly, this is a book that only Gordon Linoff could have written His unique combination of talents and experiences informs every chapter
I first met Gordon at Thinking Machines Corporation, a now long‐defunct manufacturer of parallel supercomputers where we both worked in the late eighties and early nineties Among other roles, Gordon managed the implemen-tation of a parallel relational database designed to support complex analytical queries on very large databases The design point for this database was radically different from other relational database systems available at the time in that no trade‐offs were made to support transaction processing The requirements for a system designed to quickly retrieve or update a single record are quite different from the requirements for a system to scan and join huge tables Jettisoning the requirement to support transaction processing made for a cleaner, more efficient database for analytical processing This part of Gordon's background means he understands SQL for data analysis literally from the inside out
Just as a database designed to answer big important questions has a different
structure from one designed to process many individual transactions, a book
about using databases to answer big important questions requires a different approach to SQL Many books on SQL are written for database administrators
Trang 36Others are written for users wishing to prepare simple reports Still others attempt to introduce some particular dialect of SQL in every detail This one
is written for data analysts, data miners, and anyone who wants to extract maximum information value from large corporate databases Jettisoning the requirement to address all the disparate types of database users makes this a better, more focused book for the intended audience In short, this is a book about how to use databases the way we ourselves use them
Even more important than Gordon's database technology background are his many years experience as a data mining consultant This has given him a deep understanding of the kinds of questions businesses need to ask and of the data they are likely to have available to answer them Years spent exploring corporate databases have given Gordon an intuitive feel for how to approach the kinds of problems that crop up time and again across many different business domains:
■ How to take advantage of geographic data. A zip code field looks much richer when you realize that from zip code you can get to latitude and longitude, and from latitude and longitude you can get to distance It looks richer still when your realize that you can use it to join in Census Bureau data to get at important attributes, such as population density, median income, percentage of people on public assistance, and the like
■ How to take advantage of dates. Order dates, ship dates, enrollment dates, birth dates Corporate data is full of dates These fields look richer when you understand how to turn dates into tenures, analyze purchases
by day of week, and track trends in fulfillment time They look richer still when you know how to use this data to analyze time‐to‐event problems such as time to next purchase or expected remaining lifetime of a customer relationship
■ How to build data mining models directly in SQL. This book shows you how to do things in SQL that you probably never imagined pos-sible, including generating association rules for market basket analysis, building regression models, and implementing nạve Bayesian models and scorecards
■ How to prepare data for use with data mining tools. Although more than most people realize can be done using just SQL and Excel, eventu-ally you will want to use more specialized data mining tools These tools
need data in a specific format known as a customer signature This book
shows you how to create these data mining extracts
The book is rich in examples and they all use real data This point is worth saying more about Unrealistic datasets lead to unrealistic results This is frus-trating to the student In real life, the more you know about the business context, the better your data mining results will be Subject matter expertise gives you a head start You know what variables ought to be predictive and have good ideas
Trang 37about new ones to derive Fake data does not reward these good ideas because patterns that should be in the data are missing and patterns that shouldn't be there have been introduced inadvertently Real data is hard to come by, not least because real data may reveal more than its owners are willing to share about their business operations As a result, many books and courses make do with artificially constructed datasets Best of all, the datasets used in the book are all available for download at www.wiley.com/go/dataanalysisusingsqlandexcel2e.
I reviewed the chapters of this book as they were written This process was very beneficial to my own use of SQL and Excel The exercise of thinking about the fairly complex queries used in the examples greatly increased my under-standing of how SQL actually works As a result, I have lost my fear of nested queries, multi‐way joins, giant case statements, and other formerly daunting aspects of the language In well over a decade of collaboration, I have always turned to Gordon for help using SQL and Excel to best advantage Now, I can turn to this book And you can, too
—Michael J A Berry
Trang 39Introduction
The first edition of this book set out to explain data analysis from an eminently practical perspective, using the familiar tools of SQL and Excel The guiding principle of the book was to start with questions and guide the reader through the solutions, both from a business perspective and a technical perspective This approach proved to be quite successful
Much has changed in the ten years since I started writing the first edition The tools themselves have changed In those days, Excel did not have a Ribbon, for instance And, window functions were rare in databases The world that analysts inhabit has also changed, with tools such as Python and R and NoSQL databases becoming more common However, relational databases are still in widespread use, and SQL is, if anything, even more relevant today as technology spreads through businesses big and small Excel still seems to be the reporting and presentation tool of choice for many business users Big data is no longer
a future frontier; it is a problem, a challenge, and an opportunity that we face
on a daily basis
The second edition has been revised and updated to reflect the changes in the underlying software, with more examples and more techniques, and an additional chapter on database performance In doing so, I have strived to keep the strengths from the first edition The book is still organized around the principles of data, analysis, and presentation—three capabilities that are rarely treated together Examples are organized around questions, with a discussion
of both the business relevance and the technical approaches to the problems The examples carry through to actual code The data, the code, and the Excel examples are all available on the companion website
Trang 40The motivation for this approach originally came from a colleague, Nick Drake, who is a statistician by training Once upon a time, he was looking for
a book that would explain how to use SQL for the complex queries needed for data analysis Books on SQL tend to cover either basic query constructs or the details of how databases work None come strictly from a perspective of analyz-ing data, and none are structured around answering questions about data Of the many books on statistics, none address the simple fact that most of the data being used resides in relational databases This book fills that gap
My other books on data mining, written with Michael Berry, focus on advanced algorithms and case studies By contrast, this book focuses on the “how‐to.” It starts by describing data stored in databases and continues through prepar-ing and producing results Interspersed are stories based on my experience in the field, explaining how results might be applied and why some things work and other things do not The examples are so practical that the data used for them is available on the book’s companion website (www.wiley.com/go/ dataanalysisusingsqlandexcel2e)
One of the truisms about data warehouses and analysis databases in
gen-eral is that they don’t actually do anything Yes, they store data Yes, they bring
together data from different sources, cleansing and clarifying along the way Yes, they define business dimensions, store transactions about customers, and, perhaps, summarize important data (And, yes, all these are very important!) However, data in a database resides on many spinning disks and in complex data structures in a computer’s memory So much data, so little information.How can we exploit this data, particularly data that describes customers? The many fancy algorithms for statistical modeling and data mining all have
a simple rule: “garbage‐in, garbage‐out.” The results of even the most cated techniques are only as good as the data being used (and the assumptions being fed into the model) Data is central to the task of understanding customers, understanding products, and understanding markets
sophisti-The chapters in this book cover different aspects of data and several important analytic techniques that are readily supported by SQL and Excel The analytic techniques range from exploratory data analysis to survival analysis, from market basket analysis to nạve Bayesian models, and from simple animations
to regression Of course, the potential range of possible techniques is much larger than can be presented in one book These methods have proven useful over time and are applicable in many different areas
And finally, data and analysis are not enough Data must be analyzed, and the results must be presented to the right audience To fully exploit its value,
we must transform data into stories and scenarios, charts and metrics and insights