1767 data analysis using SQL and excel

With his colleague Michael Berry, Gordon has written three of the most pop-ular books on data mining, starting with Data Mining Techniques for Marketing, Sales, and Customer Support.. 43

Trang 4

Data Analysis Using

Trang 7

Published simultaneously in Canada

01923, (978) 750-8400, fax (978) 646-8600 Requests to the Publisher for permission should be addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN

46256, (317) 572-3447, fax (317) 572-4355, or online at http://www.wiley.com/go/permissions Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations

or warranties with respect to the accuracy or completeness of the contents of this work and ically disclaim all warranties, including without limitation warranties of fitness for a particular purpose No warranty may be created or extended by sales or promotional materials The advice and strategies contained herein may not be suitable for every situation This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services If professional assistance is required, the services of a competent professional person should be sought Neither the publisher nor the author shall be liable for damages arising herefrom The fact that an organization or Website is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Website may provide or recommendations it may make Further, readers should be aware that Internet Websites listed in this work may have changed or disap- peared between when this work was written and when it is read.

specif-For general information on our other products and services or to obtain technical support, please contact our Customer Care Department within the U.S at (800) 762-2974, outside the U.S at (317) 572-3993, or fax (317) 572-4002.

Library of Congress Cataloging-in-Publication Data:

of John Wiley & Sons, Inc and/or its affiliates, in the United States and other countries, and may not

be used without written permission Excel is a registered trademark of Microsoft Corporation in the United States and/or other countries All other trademarks are the property of their respective owners Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic books.

Trang 8

To Giuseppe for sixteen years, five books, and counting

Trang 9

Gordon Linoff (gordon@data-miners.com) is a recognized expert in the field

of data mining He has more than twenty-five years of experience workingwith companies large and small to analyze customer data and to help designdata warehouses His passion for SQL and relational databases dates to theearly 1990s, when he was building a relational database engine designed forlarge corporate data warehouses at the now-defunct Thinking Machines Cor-poration Since then, he has had the opportunity to work with all the leadingdatabase vendors, including Microsoft, Oracle, and IBM

With his colleague Michael Berry, Gordon has written three of the most

pop-ular books on data mining, starting with Data Mining Techniques for Marketing,

Sales, and Customer Support In addition to writing books on data mining, he

also teaches courses on data mining, and has taught thousands of students onfour continents

Gordon is currently a principal at Data Miners, a consulting company heand Michael Berry founded in 1998 Data Miners is devoted to doing andteaching data mining and customer-centric data analysis

About the Author

vi

Trang 10

Johnna VanHoose Dinse

Anniversary Logo Design

Richard Pacifico

Trang 12

Foreword xxvii

What Is an Entity-Relationship Diagram? 7

Contents

ix

Trang 13

Subqueries for Naming Variables 33Subqueries for Handling Summaries 34

Rewriting the “IN” as a JOIN 36Correlated Subqueries 37

Subqueries for UNION ALL 39

Chapter 2 What’s In a Table? Getting Started with Data Exploration 43

A Basic Chart: Column Charts 45

Creating the Column Chart 47Formatting the Column Chart 49Useful Variations on the Column Chart 52

X-Y Charts (Scatter Plots) 57

Histograms 60

Cumulative Histograms of Counts 66Histograms (Frequencies) for Numeric Values 67Ranges Based on the Number of Digits, Using

Trang 14

Ranges Based on the Number of Digits, Using

More Refined Ranges: First Digit Plus Number of Digits 69Breaking Numerics into Equal-Sized Groups 71

Minimum and Maximum Values 72The Most Common Value (Mode) 73Calculating Mode Using Standard SQL 73Calculating Mode Using SQL Extensions 74Calculating Mode Using String Operations 75

Strings Starting or Ending with Spaces 76Handling Upper- and Lowercase 77What Characters Are in a String? 77

What Are Average Sales By State? 79How Often Are Products Repeated within a Single Order? 80Direct Counting Approach 80Comparison of Distinct Counts to Overall Counts 81Which State Has the Most American Express Users? 83

From Summarizing One Column to Summarizing

Good Summary for One Column 84Query to Get All Columns in a Table 87Using SQL to Generate Summary Code 88

Chapter 3 How Different Is Different? 91

Confidence and Probability 94

How Many Californians? 110Null Hypothesis and Confidence 112How Many Customers Are Still Active? 113Given the Count, What Is the Probability? 114Given the Probability, What Is the Number of Stops? 116The Rate or the Number? 117

Trang 15

Ratios, and Their Statistics 118

Standard Error of a Proportion 118Confidence Interval on Proportions 120Difference of Proportions 121Conservative Lower Bounds 122

Chi-Square Calculation 124Chi-Square Distribution 125

Chapter 4 Where Is It All Happening? Location, Location, Location 133

Definition of Latitude and Longitude 134Degrees, Minutes, Seconds, and All That 136Distance between Two Locations 137

Finding All Zip Codes within a Given Distance 141Finding Nearest Zip Code in Excel 143Pictures with Zip Codes 145The Scatter Plot Map 145Who Uses Solar Power for Heating? 146Where Are the Customers? 148

Wealthiest Zip Code in a State? 162Zip Code with the Most Orders in Each State 165Interesting Hierarchies in Geographic Data 167

Trang 16

Calculating County Wealth 170Identifying Counties 170

Distribution of Values of Wealth 172Which Zip Code Is Wealthiest Relative to Its County? 173County with Highest Relative Order Penetration 175

State Boundaries on Scatter Plots of Zip Codes 180Plotting State Boundaries 180Pictures of State Boundaries 182

Some Fundamentals of Dates and Times in Databases 187Extracting Components of Dates and Times 187Converting to Standard Formats 189Intervals (Durations) 190

Verifying that Dates Have No Times 192Comparing Counts by Date 193Orderlines Shipped and Billed 193Customers Shipped and Billed 195Number of Different Bill and Ship Dates per Order 196Counts of Orders and Order Sizes 197Items as Measured by Number of Units 198Items as Measured by Distinct Products 198Size as Measured by Dollars 201

Billing Date by Day of the Week 203Changes in Day of the Week by Year 204Comparison of Days of the Week for Two Dates 205

A Business Problem about Days of the Week 210Outline of a Solution 210

Using a Calendar Table Instead 213

Trang 17

How Many Customers on a Given Day? 224How Many Customers Every Day? 224How Many Customers of Different Types? 226How Many Customers by Tenure Segment? 227

Order Date to Ship Date 231Order Date to Ship Date by Year 234

Creating the One-Year Excel Table 235Creating and Customizing the Chart 236

Chapter 6 How Long Will Customers Last? Survival Analysis

to Understand Customers and Their Value 239

Point Estimate for Survival 254Calculating Survival for All Tenures 254Calculating Survival in SQL 256Step 1 Create the Survival Table 257Step 2: Load POPT and STOPT 257Step 3: Calculate Cumulative Population 258Step 4: Calculate the Hazard 259Step 5: Calculate the Survival 259Step 6: Fix ENDTENURE and NUMDAYS in Last Row 260Generalizing the SQL 260

Trang 18

A Simple Customer Retention Calculation 260Comparison between Retention and Survival 262Simple Example of Hazard and Survival 262

What Happens to a Mixture 264Constant Hazard Corresponding to Survival 266

Summarizing the Markets 267Stratifying by Market 268

How Has a Particular Hazard Changed over Time? 273What Is Customer Survival by Year of Start? 275What Did Survival Look Like in the Past? 275

Point Estimate of Survival 278Median Customer Tenure 279Average Customer Lifetime 281Confidence in the Hazards 282

Estimating Future Revenue for One Future Start 286SQL Day-by-Day Approach 287SQL Summary Approach 288Estimated Revenue for a Simple Group of Existing Customers 289Estimated Second Year Revenue for a Homogenous Group 289Pre-calculating Yearly Revenue by Tenure 291Estimated Future Revenue for All Customers 292

Chapter 7 Factors Affecting Survival: The What and

Explanation of the Approach 298Using Averages to Compare Numeric Variables 301

Recognizing Left Truncation 309Effect of Left Truncation 311

Trang 19

How to Fix Left Truncation, Conceptually 313Estimating Hazard Probability for One Tenure 314Estimating Hazard Probabilities for All Tenures 314

Time Windows = Left Truncation + Right Censoring 318Calculating One Hazard Probability Using a Time Window 318All Hazard Probabilities for a Time Window 319Comparison of Hazards by Stops in Year 320

A Cohort-Based Approach 328The Survival Analysis Approach 330

Other Identifying Information 361

Trang 20

How Many New Customers Appear Each Year? 362

Which Households Are Increasing Purchase

Comparison of Earliest and Latest Values 381Calculating the Earliest and Latest Values 381Comparing the First and Last Values 386Comparison of First Year Values and Last Year Values 390Trend from the Best Fit Line 392

Calculating the Slope 393

Idea behind the Calculation 395Calculating Next Purchase Date Using SQL 396From Next Purchase Date to Time-to-Event 397Stratifying Time-to-Event 398

Chapter 9 What’s in a Shopping Cart? Market Basket Analysis

Scatter Plot of Products 402Duplicate Products in Orders 403Histogram of Number of Units 407Products Associated with One-Time Customers 408Products Associated with the Best Customers 410

Combinations of Two Products 415Number of Two-Way Combinations 415Generating All Two-Way Combinations 417Examples of Combinations 419Variations on Combinations 420Combinations of Product Groups 420Multi-Way Combinations 422

Trang 21

Households Not Orders 424Combinations within a Household 424Investigating Products within Households but

Multiple Purchases of the Same Product 426

Associations and Rules 428Zero-Way Association Rules 429What Is the Distribution of Probabilities? 429What Do Zero-Way Associations Tell Us? 430

Example of One-Way Association Rules 431Generating All One-Way Rules 433One-Way Rules with Evaluation Information 434One-Way Rules on Product Groups 436Calculating Product Group Rules Using an

Multi-Way Associations 451Rules Using Attributes of Products 452Rules with Different Left- and Right-Hand Sides 453Before and After: Sequential Associations 454

Chapter 10 Data Mining Models in SQL 457

Trang 22

Yes-or-No Models with Propensity Scores 464

Estimating Numeric Values 465

What Is the Best Zip Code? 466

A Basic Look-Alike Model 468Look-Alike Using Z-Scores 469Example of Nearest Neighbor Model 473

Calculating Most Popular Product Group 475Evaluating the Lookup Model 477Using a Profiling Lookup Model for Prediction 478Using Binary Classification Instead 480

Most Basic Example: No Dimensions 481

Adding More Dimensions 484Examining Nonstationarity 484Evaluating the Model Using an Average Value Chart 485

The Overall Probability as a Model 487Exploring Different Dimensions 488How Accurate Are the Models? 490Adding More Dimensions 493

Some Ideas in Probability 495

Chapter 11 The Best-Fit Line: Linear Regression Models 511

Tenure and Amount Paid 512

Trang 23

Properties of the Best-fit Line 513What Does Best-Fit Mean? 513

Preserving the Averages 518

Trend Lines in Charts 521Best-fit Line in Scatter Plots 521Logarithmic, Power, and Exponential Trend Curves 522Polynomial Trend Curves 524

Best-fit Using LINEST() Function 528Returning Values in Multiple Cells 528Calculating Expected Values 530LINEST() for Logarithmic, Exponential, and Power Curves 531

Doing the Calculation 536Calculating the Best-Fit Line in SQL 537

Price Frequency for $20 Books 541Price Elasticity Model in SQL 542Price Elasticity Average Value Chart 543

Customer Stops during the First Year 545

Weighted Best-Fit Line in a Chart 548Weighted Best-Fit in SQL 549Weighted Best-Fit Using Solver 550The Weighted Best-Fit Line 550Solver Is Better Than Guessing 551

Multiple Regression in Excel 552

Investigating Each Variable Separately 554Building a Model with Three Input Variables 555Using Solver for Multiple Regression 557Choosing Input Variables One-By-One 558Multiple Regression in SQL 558

Trang 24

Chapter 12 Building Customer Signatures for Further Analysis 563

Sources of Data for the Customer Signature 566Current Customer Snapshot 566Initial Customer Information 567Self-Reported Information 568External Data (Demographic and So On) 568About Their Neighbors 569Transaction Summaries 569Using Customer Signatures 570Predictive and Profile Modeling 570

Repository of Customer-Centric Business Metrics 570

Using an Existing Table as the Driving Table 578Derived Table as the Driving Table 580

Customer Dimension Lookup Tables 582

Trang 25

Extracting Features 596

Geographic Location Information 596

Product Descriptions 599

Calculating Slope for Time Series 601Calculating Slope from Pivoted Time Series 601Calculating Slope for a Regular Time Series 603Calculating Slope for an Irregular Time Series 604

Trang 27

Year, Month, and Day of Month 622

Trang 29

Return a Handful of Rows 633

Trang 30

Gordon Linoff and I have written three and a half books together (Four, if we get

to count the second edition of Data Mining Techniques as a whole new book; it

didn’t feel like any less work.) Neither of us has written a book without the otherbefore, so I must admit to a tiny twinge of regret upon first seeing the cover ofthis one without my name on it next to Gordon’s The feeling passed veryquickly as recollections of the authorial life came flooding back — vacationsspent at the keyboard instead of in or on the lake, opportunities missed, rela-tionships strained More importantly, this is a book that only Gordon Linoffcould have written His unique combination of talents and experiences informsevery chapter

I first met Gordon at Thinking Machines Corporation, a now long-defunctmanufacturer of parallel supercomputers where we both worked in the lateeighties and early nineties Among other roles, Gordon managed the imple-mentation of a parallel relational database designed to support complex ana-lytical queries on very large databases The design point for this database wasradically different from other relational database systems available at the time

in that no trade-offs were made to support transaction processing The ments for a system designed to quickly retrieve or update a single record arequite different from the requirements for a system to scan and join huge tables.Jettisoning the requirement to support transaction processing made for acleaner, more efficient database for analytical processing This part of Gor-don’s background means he understands SQL for data analysis literally fromthe inside out

require-Just as a database designed to answer big important questions has a different

structure from one designed to process many individual transactions, a book

about using databases to answer big important questions requires a different

Foreword

xxvii

Trang 31

approach to SQL Many books on SQL are written for database administrators.Others are written for users wishing to prepare simple reports Still othersattempt to introduce some particular dialect of SQL in every detail This one iswritten for data analysts, data miners, and anyone who wants to extract maxi-mum information value from large corporate databases Jettisoning the require-ment to address all the disparate types of database user makes this a better,more focused book for the intended audience In short, this is a book about how

to use databases the way we ourselves use them

Even more important than Gordon’s database technology background, ishis many years as a data mining consultant This has given him a deep under-standing of the kinds of questions businesses need to ask and of the data theyare likely to have available to answer them Years spent exploring corporatedatabases has given Gordon an intuitive feel for how to approach the kinds ofproblems that crop up time and again across many different business domains:

■■ How to take advantage of geographic data.A zip code field looks muchricher when you realize that from zip code you can get to latitude andlongitude and from latitude and longitude you can get to distance Itlooks richer still when your realize that you can use it to join in censusbureau data to get at important attributes such as population density,median income, percentage of people on public assistance, and the like

■■ How to take advantage of dates.Order dates, ship dates, enrollmentdates, birth dates Corporate data is full of dates These fields lookricher when you understand how to turn dates into tenures, analyzepurchases by day of week, and track trends in fulfillment time Theylook richer still when you know how to use this data to analyze time-to-event problems such as time to next purchase or expected remaininglifetime of a customer relationship

■■ How to build data mining models directly in SQL.This book showsyou how to do things in SQL that you probably never imagined possible,including generating association rules for market basket analysis, build-ing regression models, and implementing nạve Bayesian models andscorecards

■■ How to prepare data for use with data mining tools.Although morethan most people realize can be done using just SQL and Excel, eventu-ally you will want to use more specialized data mining tools These tools

need data in a specific format known as a customer signature This book

shows you how to create these data mining extracts

The book is rich in examples and they all use real data This point is worthsaying more about Unrealistic datasets lead to unrealistic results This is frus-trating to the student In real life, the more you know about the business con-text, the better your data mining results will be Subject matter expertise gives

Trang 32

you a head start You know what variables ought to be predictive and havegood ideas about new ones to derive Fake data does not reward these goodideas because patterns that should be in the data are missing and patterns thatshouldn’t be there have been introduced inadvertently Real data is hard tocome by, not least because real data may reveal more than its owners are will-ing to share about their business operations As a result, many books andcourses make do with artificially constructed datasets Best of all, the datasetsused in the book are all available for download at the companion web site and

I reviewed the chapters of this book as they were written This process wasvery beneficial to my own use of SQL and Excel The exercise of thinking aboutthe fairly complex queries used in the examples greatly increased my under-standing of how SQL actually works As a result, I have lost my fear of nestedqueries, multi-way joins, giant case statements, and other formerly dauntingaspects of the language In well over a decade of collaboration, I have alwaysturned to Gordon for help using SQL and Excel to best advantage Now, I canturn to this book And you can too

— Michael J A Berry

Trang 34

Although this book has only one name on the cover, there are many peopleover a long period of time who have helped me both specifically on this bookand more generally in understanding data, analysis, and presentation.

Michael Berry, my business partner and colleague since 1998 at Data Miners,has been tremendously helpful on all fronts He reviewed the chapters, testedthe SQL code in the examples, and helped anonymize the data His insightshave been helpful and his debugging skills have made the examples muchmore accurate His wife, Stephanie Jack, also deserves special praise for herpatience and willingness to share Michael’s time

Bob Elliott, my editor at Wiley, and the Wiley team not only accepted myoriginal idea for this book, but have exhibited patience and understanding as

I refined the ideas and layout

Matt Keiser, President of Datran Marketing, and Howard Lehrman merly of Datran) were kind enough to provide computing power for testingmany of the examples Nick Drake, also of Datran, inspired the book, by ask-ing for a SQL reference focused on data analysis

(for-Throughout the chapters, the understanding of data processing is based ondataflows, which Craig Stanfill of Ab Initio Corporation first introduced me to,once upon a time when we worked together at Thinking Machines Corporation.Stuart Ward and Zaiying Huang (from the New York Times) have spentcountless hours over the past several years explaining statistical concepts to

me Harrison Sohmer, also of the New York Times, taught me many Exceltricks, some of which I’ve been able to include in the book

Anne Milley of SAS Institute originally suggested that I learn survival sis Will Potts, now at CapitalOne, taught me much of what I know about thesubject, including helping to develop two of the earliest survival analysis–based

analy-Acknowledgments

xxxi

Trang 35

forecasts (and finally convincing me that hazard probabilities really can’t be ative) Brij Masand, a colleague at Data Miners, helped extend this knowledge topractical forecasting applications.

neg-Jamie MacLennan and the SQL Server team at Microsoft have been helpful

in answering my questions about the product

There are a handful of people whom I’ve never met in person who havehelped in various ways Richard Stallman invented emacs and the Free Soft-ware Foundation; emacs provided the basis for the calendar table Rob Bovey

of Applications Professional, Inc created the X-Y chart labeler used in severalchapters Robert Clair at the Census Bureau answered some questions viaemail Juice Analytics inspired the example for Worksheet bar charts in Chap-ter 5 (and thanks to Alex Wimbush who pointed me in their direction) EdwinStraver of Frontline Systems answered several questions about Solver, intro-duced in Chapter 11

Over the years, many colleagues, friends, and students have provided ration, questions, and answers There are too many to list all of them, but Iwant to particularly thank Eran Abikhzer, Michael Benigno, Emily Cohen,Carol D’Andrea, Sonia Dubin, Lounette Dyer, Josh Goff, Richard Greenburg,Gregory Lampshire, Fiona McNeill, Karen Kennedy McConlogue, AlanParker, Ashit Patel, Ronnie Rowton, Adam Schwebber, John Trustman, JohnWallace, Kathleen Wright, and Zhilang Zhao I would also like to thank thefolks in the SAS Institute Training group who have organized, reviewed, andsponsored our data mining classes for many years, giving me the opportunity

inspi-to meet many interesting and diverse people involved with data mining

I also thank all those friends and family I’ve visited while writing this bookand who (for the most part) allowed me the space and time to work — mymother, my father, my sister Debbie, my brother Joe, my in-laws RaimondaScalia, Ugo Scalia, and Terry Sparacio, and my friends Jon Mosley, Paul Houli-han, Joe Hughes, and Maciej Zworski

Finally, acknowledgments would be incomplete without thanking my lifepartner, Giuseppe Scalia, who has managed to maintain our sanity for the pastyear while I wrote this book

Thank you everyone

Trang 36

Data Analysis Presentation These three key capabilities are needed for tively transforming data into information And yet, these three topics arerarely treated together Other books focus on one or the other — on the details

effec-of relational databases, or on applying statistics to business problems, or onusing Excel This book approaches the challenges of data analysis from a moreholistic perspective, and with the aim of explaining the relevant ideas both topeople responsible for analyzing data and to people who want to use suchinformation, responsibly

The motivation for this approach came from a colleague, Nick Drake, who is

a statistician by training Once upon a time, he was looking for a book thatwould explain how to use SQL for the complex queries needed for data analy-sis There are many books on SQL, few focused on using the language forqueries, and none that come strictly from a perspective of analyzing data Sim-ilarly, there are many books on statistics, none of which address the simple factthat most of the data being used resides in relational databases This book isintended to fill that gap

There are many approaches to data analysis My earlier books, written withMichael Berry, focus on the more advanced algorithms and case studies usu-ally falling under the heading “data mining.” By contrast, this book focuses onthe “how-to.” It starts by describing data stored in databases and continuesthrough preparing and producing results Interspersed are stories based on

my experience in the field, explaining how results might be applied and whysome things work and other things do not The examples are so practical thatthe data used for them is available on the companion web site and at

www.data-miners.com

Introduction

xxxiii

Trang 37

One of the truisms about data warehouses and analysis databases in general

is that they don’t actually do anything Yes, they store data Yes, they bring

together data from different sources, cleansing and clarifying along the way.Yes, they define business dimensions, store transactions about customers, and,perhaps, summarize important data (And, yes, all these are very important!)However, data in a database resides on so many spinning disks and in complexdata structures in a computer’s memory So much data So little information.Oil deposits and diamonds hidden in rich seams beneath the surface of theearth are worth much less than gasoline at the pump or Tiffany diamond rings.Prospectors can make a quick buck on such deposits On the other hand, thecompanies willing to invest the dollars to transform and process the raw mate-rials into marketable goods are the ones that uncover the long-term riches.This book is about the basic tools needed for exploiting data, particularlydata that describes customers There are many fancy algorithms for statisticalmodeling and data mining However, “garbage-in, garbage-out.” The results

of even the most sophisticated techniques are only as good as the data beingused Data is central to the task of understanding customers, understandingproducts, and understanding markets

The chapters in this book discuss different aspects of data and several ferent analytic techniques The analytic techniques range from exploratorydata analysis to survival analysis, from market basket analysis to nạveBayesian models, from simple animations to regression Of course, the poten-tial range of possible techniques is much larger than can be presented in onebook The methods have proven useful over time and are applicable in manydifferent areas

dif-And finally, data and analysis are not enough Data must be analyzed, andthe results must be presented to the right audience To fully exploit its value, wemust transform data into stories and scenarios, charts and metrics, and insights

Overview of the Book and Technology

This book focuses on three key technological areas used for transforming datainto actionable information:

data is SQL

the most powerful feature of Excel is the charting capability, which turnscolumns of numbers into pictures

Trang 38

These three technologies are presented together, because they are all related SQL answers the question “how do we pull data from a database?”Statistics answers the question “how is it relevant”? And Excel makes it pos-sible to convince other people of the veracity of what we find.

inter-The description of data processing is organized around the SQL language.Although there are extensions of SQL and other very powerful data manipu-lation languages, SQL is common to most databases And, databases such asOracle, IBM’s DB2, and Microsoft SQL Server are common in the businessworld, storing the vast majority of business data transactions Other databasessuch as mysql are available at no cost and easily downloaded The good news

is that all relational databases support SQL as a query language However, just

as England and the United States have been described as “two countries rated by a common language,” each database supports a slightly differentdialect of SQL The Appendix contains a list of commonly used functions andhow they are represented in various different dialects

sepa-Similarly, there are beautiful presentation tools and professional graphicspackages However, very rare and exceptional is the workplace computer thatdoes not have Excel (or an equivalent spreadsheet)

Statistics and data mining techniques do not always require advanced tools.Some very important techniques are readily available using the combination

of SQL and Excel, including survival analysis, nạve Bayesian models, andassociation rules In fact, the methods in this book are often more powerfulthan the methods available in many statistics and data mining tools, preciselybecause they are close to the data and customizable for specific applications.The explanation of the techniques covers both the basic ideas and the exten-sions that may not be available in other tools

The chapters describing the various techniques provide a solid introduction

to modeling and data exploration, in the context of familiar tools and data.They also highlight when the more advanced tools are useful, because there isnot a simpler solution using more readily available tools

In the interests of full disclosure, I should admit that in the early 1990s Iworked on a package called Darwin at a company called Thinking Machines Inthe intervening years, this package has become much more powerful and user-friendly, and has now grown into Oracle Data Mining In addition to Oracle,SQL Server offers data mining extensions within the tool — an exciting devel-opment that brings advanced data analysis even closer to the data

This book does not discuss such functionality at all The methods in thechapters have been chosen for their general applicability to data stored inrelational databases The explicit purpose is not to focus on a particular rela-tional database engine In many ways, the methods discussed here comple-ment such extensions

Trang 39

How This Book Is Organized

The twelve chapters in this book fall roughly into three parts The first threeintroduce key concepts of SQL, Excel, and statistics The six middle chaptersdiscuss various methods of exploring data, and techniques specifically suited

to SQL and Excel The last three focus on the idea of modeling, in the sense ofstatistics and data mining

Each chapter explains some aspect of data analysis using SQL and Excelfrom several different perspectives, including:

www.data-miners.com.SQL is a concise language that is sometimes difficult to follow Dataflows,graphical representations of data processing that explain data manipulations,are used to illustrate how the SQL works

Results are presented in charts and tables, sprinkled throughout the book Inaddition, important features of Excel are highlighted, and interesting uses ofExcel graphics are explained Each chapter has a couple of technical asides,typically explaining some aspect of a technique or an interesting bit of historyassociated with the methods described in the chapter

Introductory Chapters

The first chapter, “A Data Miner Looks at SQL,” introduces SQL from the spective of data analysis This is the querying part of the SQL language, wheredata stored in databases is extracted using SQL queries

per-Ultimately, data about customers and about the business is stored in SQLdatabases This chapter introduces entity-relationship diagrams to describethe structure of the data — the tables and columns and how they relate to eachother It also introduces dataflows to describe the processing of queries;dataflows provide a graphical explanation of how data is processed

The first chapter also describes the datasets used for examples throughoutthe book (and which are also available on the companion web site) This dataincludes tables describing retail purchases, tables describing mobile telephonecustomers, and reference tables that describe zip codes and the calendar

Trang 40

The second chapter, “What’s In a Table? Getting Started with Data ration,” introduces Excel for exploratory data analysis and presentation Ofmany useful capabilities in Excel, perhaps the most useful are charts As theancient Chinese saying goes, “a picture paints a thousand words,” and Excelmakes it possible to paint pictures using data Such charts are not only usefulaesthetically, but more practically for Word documents, PowerPoint, email, theWeb, and so on.

Explo-Charts are not a means unto themselves This chapter starts down the road

of exploratory data analysis, using charts to convey interesting summaries ofdata In addition, this chapter discusses summarizing columns in a table, aswell as the interesting idea of using SQL to generate SQL queries

Chapter 3, “How Different Is Different?”, explains some key concepts ofdescriptive statistics, such as averages, p-values, and the chi-square test Thepurpose of this chapter is to show how to use such statistics on data residing

in tables The particular statistics and statistical tests are chosen for their ticality, and the chapter focuses on applying the methods, not explaining theunderlying theory Conveniently, most of the statistical tests that we want to

prac-do are feasible in Excel and even in SQL

SQL Techniques

Several techniques are suited very well for the combination of SQL and Excel.Chapter 4, “Where Is It All Happening? Location, Location, Location,”explains geography and how to incorporate geographic information into dataanalysis Geography starts with locations, described by latitude and longitude.There are then various levels of geography, such as census blocks, zip code tab-ulation areas, and the more familiar counties and states, all of which haveinformation available from the Census Bureau This chapter also discussesvarious methods for comparing results at different levels of geography And,finally, no discussion of geography would be complete without maps UsingExcel, it is possible to build very rudimentary maps

Chapter 5, “It’s a Matter of Time,” discusses another key attribute of tomer behavior, when things occur This chapter describes how to access fea-tures of dates and times in databases, and then how to use this information tounderstand customers

cus-The chapter includes examples for accurately making year-over-year parisons, for summarizing by day of the week, for measuring durations indays, weeks, and months, and for calculating the number of active customers

com-by day, historically The chapter ends with a simple animation in Excel

Chapters 6 and 7, “How Long Will Customers Last? Survival Analysis toUnderstand Customers and Their Value” and “Factors Affecting Survival: TheWhat and Why of Customer Tenure,” explain one of the most important analytic

Định dạng
Số trang	690
Dung lượng	19,51 MB