With his colleague Michael Berry, Gordon has written three of the most pop-ular books on data mining, starting with Data Mining Techniques for Marketing, Sales, and Customer Support.. 43
Trang 4Data Analysis Using
Trang 7Copyright © 2008 by Wiley Publishing, Inc., Indianapolis, Indiana
Published simultaneously in Canada
01923, (978) 750-8400, fax (978) 646-8600 Requests to the Publisher for permission should be addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN
46256, (317) 572-3447, fax (317) 572-4355, or online at http://www.wiley.com/go/permissions Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations
or warranties with respect to the accuracy or completeness of the contents of this work and ically disclaim all warranties, including without limitation warranties of fitness for a particular purpose No warranty may be created or extended by sales or promotional materials The advice and strategies contained herein may not be suitable for every situation This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other profes- sional services If professional assistance is required, the services of a competent professional per- son should be sought Neither the publisher nor the author shall be liable for damages arising herefrom The fact that an organization or Website is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Website may provide or recommendations it may make Further, readers should be aware that Internet Websites listed in this work may have changed or disap- peared between when this work was written and when it is read.
specif-For general information on our other products and services or to obtain technical support, please contact our Customer Care Department within the U.S at (800) 762-2974, outside the U.S at (317) 572-3993, or fax (317) 572-4002.
Library of Congress Cataloging-in-Publication Data:
of John Wiley & Sons, Inc and/or its affiliates, in the United States and other countries, and may not
be used without written permission Excel is a registered trademark of Microsoft Corporation in the United States and/or other countries All other trademarks are the property of their respective own- ers Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic books.
Trang 8To Giuseppe for sixteen years, five books, and counting
Trang 9Gordon Linoff (gordon@data-miners.com) is a recognized expert in the field
of data mining He has more than twenty-five years of experience workingwith companies large and small to analyze customer data and to help designdata warehouses His passion for SQL and relational databases dates to theearly 1990s, when he was building a relational database engine designed forlarge corporate data warehouses at the now-defunct Thinking Machines Cor-poration Since then, he has had the opportunity to work with all the leadingdatabase vendors, including Microsoft, Oracle, and IBM
With his colleague Michael Berry, Gordon has written three of the most
pop-ular books on data mining, starting with Data Mining Techniques for Marketing,
Sales, and Customer Support In addition to writing books on data mining, he
also teaches courses on data mining, and has taught thousands of students onfour continents
Gordon is currently a principal at Data Miners, a consulting company heand Michael Berry founded in 1998 Data Miners is devoted to doing andteaching data mining and customer-centric data analysis
About the Author
vi
Trang 10Johnna VanHoose Dinse
Anniversary Logo Design
Richard Pacifico
Trang 12Foreword xxvii
What Is an Entity-Relationship Diagram? 7
Contents
ix
Trang 13Subqueries for Naming Variables 33Subqueries for Handling Summaries 34
Rewriting the “IN” as a JOIN 36Correlated Subqueries 37
Subqueries for UNION ALL 39
Chapter 2 What’s In a Table? Getting Started with Data Exploration 43
A Basic Chart: Column Charts 45
Creating the Column Chart 47Formatting the Column Chart 49Useful Variations on the Column Chart 52
X-Y Charts (Scatter Plots) 57
Histograms 60
Cumulative Histograms of Counts 66Histograms (Frequencies) for Numeric Values 67Ranges Based on the Number of Digits, Using
Trang 14Ranges Based on the Number of Digits, Using
More Refined Ranges: First Digit Plus Number of Digits 69Breaking Numerics into Equal-Sized Groups 71
Minimum and Maximum Values 72The Most Common Value (Mode) 73Calculating Mode Using Standard SQL 73Calculating Mode Using SQL Extensions 74Calculating Mode Using String Operations 75
Strings Starting or Ending with Spaces 76Handling Upper- and Lowercase 77What Characters Are in a String? 77
What Are Average Sales By State? 79How Often Are Products Repeated within a Single Order? 80Direct Counting Approach 80Comparison of Distinct Counts to Overall Counts 81Which State Has the Most American Express Users? 83
From Summarizing One Column to Summarizing
Good Summary for One Column 84Query to Get All Columns in a Table 87Using SQL to Generate Summary Code 88
Chapter 3 How Different Is Different? 91
Confidence and Probability 94
How Many Californians? 110Null Hypothesis and Confidence 112How Many Customers Are Still Active? 113Given the Count, What Is the Probability? 114Given the Probability, What Is the Number of Stops? 116The Rate or the Number? 117
Trang 15Ratios, and Their Statistics 118
Standard Error of a Proportion 118Confidence Interval on Proportions 120Difference of Proportions 121Conservative Lower Bounds 122
Chi-Square Calculation 124Chi-Square Distribution 125
Chapter 4 Where Is It All Happening? Location, Location, Location 133
Definition of Latitude and Longitude 134Degrees, Minutes, Seconds, and All That 136Distance between Two Locations 137
Finding All Zip Codes within a Given Distance 141Finding Nearest Zip Code in Excel 143Pictures with Zip Codes 145The Scatter Plot Map 145Who Uses Solar Power for Heating? 146Where Are the Customers? 148
Wealthiest Zip Code in a State? 162Zip Code with the Most Orders in Each State 165Interesting Hierarchies in Geographic Data 167
Trang 16Calculating County Wealth 170Identifying Counties 170
Distribution of Values of Wealth 172Which Zip Code Is Wealthiest Relative to Its County? 173County with Highest Relative Order Penetration 175
State Boundaries on Scatter Plots of Zip Codes 180Plotting State Boundaries 180Pictures of State Boundaries 182
Some Fundamentals of Dates and Times in Databases 187Extracting Components of Dates and Times 187Converting to Standard Formats 189Intervals (Durations) 190
Verifying that Dates Have No Times 192Comparing Counts by Date 193Orderlines Shipped and Billed 193Customers Shipped and Billed 195Number of Different Bill and Ship Dates per Order 196Counts of Orders and Order Sizes 197Items as Measured by Number of Units 198Items as Measured by Distinct Products 198Size as Measured by Dollars 201
Billing Date by Day of the Week 203Changes in Day of the Week by Year 204Comparison of Days of the Week for Two Dates 205
A Business Problem about Days of the Week 210Outline of a Solution 210
Using a Calendar Table Instead 213
Trang 17How Many Customers on a Given Day? 224How Many Customers Every Day? 224How Many Customers of Different Types? 226How Many Customers by Tenure Segment? 227
Order Date to Ship Date 231Order Date to Ship Date by Year 234
Creating the One-Year Excel Table 235Creating and Customizing the Chart 236
Chapter 6 How Long Will Customers Last? Survival Analysis
to Understand Customers and Their Value 239
Point Estimate for Survival 254Calculating Survival for All Tenures 254Calculating Survival in SQL 256Step 1 Create the Survival Table 257Step 2: Load POPT and STOPT 257Step 3: Calculate Cumulative Population 258Step 4: Calculate the Hazard 259Step 5: Calculate the Survival 259Step 6: Fix ENDTENURE and NUMDAYS in Last Row 260Generalizing the SQL 260
Trang 18A Simple Customer Retention Calculation 260Comparison between Retention and Survival 262Simple Example of Hazard and Survival 262
What Happens to a Mixture 264Constant Hazard Corresponding to Survival 266
Summarizing the Markets 267Stratifying by Market 268
How Has a Particular Hazard Changed over Time? 273What Is Customer Survival by Year of Start? 275What Did Survival Look Like in the Past? 275
Point Estimate of Survival 278Median Customer Tenure 279Average Customer Lifetime 281Confidence in the Hazards 282
Estimating Future Revenue for One Future Start 286SQL Day-by-Day Approach 287SQL Summary Approach 288Estimated Revenue for a Simple Group of Existing Customers 289Estimated Second Year Revenue for a Homogenous Group 289Pre-calculating Yearly Revenue by Tenure 291Estimated Future Revenue for All Customers 292
Chapter 7 Factors Affecting Survival: The What and
Explanation of the Approach 298Using Averages to Compare Numeric Variables 301
Recognizing Left Truncation 309Effect of Left Truncation 311
Trang 19How to Fix Left Truncation, Conceptually 313Estimating Hazard Probability for One Tenure 314Estimating Hazard Probabilities for All Tenures 314
Time Windows = Left Truncation + Right Censoring 318Calculating One Hazard Probability Using a Time Window 318All Hazard Probabilities for a Time Window 319Comparison of Hazards by Stops in Year 320
A Cohort-Based Approach 328The Survival Analysis Approach 330
Other Identifying Information 361
Trang 20How Many New Customers Appear Each Year? 362
Which Households Are Increasing Purchase
Comparison of Earliest and Latest Values 381Calculating the Earliest and Latest Values 381Comparing the First and Last Values 386Comparison of First Year Values and Last Year Values 390Trend from the Best Fit Line 392
Calculating the Slope 393
Idea behind the Calculation 395Calculating Next Purchase Date Using SQL 396From Next Purchase Date to Time-to-Event 397Stratifying Time-to-Event 398
Chapter 9 What’s in a Shopping Cart? Market Basket Analysis
Scatter Plot of Products 402Duplicate Products in Orders 403Histogram of Number of Units 407Products Associated with One-Time Customers 408Products Associated with the Best Customers 410
Combinations of Two Products 415Number of Two-Way Combinations 415Generating All Two-Way Combinations 417Examples of Combinations 419Variations on Combinations 420Combinations of Product Groups 420Multi-Way Combinations 422
Trang 21Households Not Orders 424Combinations within a Household 424Investigating Products within Households but
Multiple Purchases of the Same Product 426
Associations and Rules 428Zero-Way Association Rules 429What Is the Distribution of Probabilities? 429What Do Zero-Way Associations Tell Us? 430
Example of One-Way Association Rules 431Generating All One-Way Rules 433One-Way Rules with Evaluation Information 434One-Way Rules on Product Groups 436Calculating Product Group Rules Using an
Multi-Way Associations 451Rules Using Attributes of Products 452Rules with Different Left- and Right-Hand Sides 453Before and After: Sequential Associations 454
Chapter 10 Data Mining Models in SQL 457
Trang 22Yes-or-No Models with Propensity Scores 464
Estimating Numeric Values 465
What Is the Best Zip Code? 466
A Basic Look-Alike Model 468Look-Alike Using Z-Scores 469Example of Nearest Neighbor Model 473
Calculating Most Popular Product Group 475Evaluating the Lookup Model 477Using a Profiling Lookup Model for Prediction 478Using Binary Classification Instead 480
Most Basic Example: No Dimensions 481
Adding More Dimensions 484Examining Nonstationarity 484Evaluating the Model Using an Average Value Chart 485
The Overall Probability as a Model 487Exploring Different Dimensions 488How Accurate Are the Models? 490Adding More Dimensions 493
Some Ideas in Probability 495
Chapter 11 The Best-Fit Line: Linear Regression Models 511
Tenure and Amount Paid 512
Trang 23Properties of the Best-fit Line 513What Does Best-Fit Mean? 513
Preserving the Averages 518
Trend Lines in Charts 521Best-fit Line in Scatter Plots 521Logarithmic, Power, and Exponential Trend Curves 522Polynomial Trend Curves 524
Best-fit Using LINEST() Function 528Returning Values in Multiple Cells 528Calculating Expected Values 530LINEST() for Logarithmic, Exponential, and Power Curves 531
Doing the Calculation 536Calculating the Best-Fit Line in SQL 537
Price Frequency for $20 Books 541Price Elasticity Model in SQL 542Price Elasticity Average Value Chart 543
Customer Stops during the First Year 545
Weighted Best-Fit Line in a Chart 548Weighted Best-Fit in SQL 549Weighted Best-Fit Using Solver 550The Weighted Best-Fit Line 550Solver Is Better Than Guessing 551
Multiple Regression in Excel 552
Investigating Each Variable Separately 554Building a Model with Three Input Variables 555Using Solver for Multiple Regression 557Choosing Input Variables One-By-One 558Multiple Regression in SQL 558
Trang 24Chapter 12 Building Customer Signatures for Further Analysis 563
Sources of Data for the Customer Signature 566Current Customer Snapshot 566Initial Customer Information 567Self-Reported Information 568External Data (Demographic and So On) 568About Their Neighbors 569Transaction Summaries 569Using Customer Signatures 570Predictive and Profile Modeling 570
Repository of Customer-Centric Business Metrics 570
Using an Existing Table as the Driving Table 578Derived Table as the Driving Table 580
Customer Dimension Lookup Tables 582
Trang 25Extracting Features 596
Geographic Location Information 596
Product Descriptions 599
Calculating Slope for Time Series 601Calculating Slope from Pivoted Time Series 601Calculating Slope for a Regular Time Series 603Calculating Slope for an Irregular Time Series 604
Trang 27Year, Month, and Day of Month 622
Trang 29Return a Handful of Rows 633
Trang 30Gordon Linoff and I have written three and a half books together (Four, if we get
to count the second edition of Data Mining Techniques as a whole new book; it
didn’t feel like any less work.) Neither of us has written a book without the otherbefore, so I must admit to a tiny twinge of regret upon first seeing the cover ofthis one without my name on it next to Gordon’s The feeling passed veryquickly as recollections of the authorial life came flooding back — vacationsspent at the keyboard instead of in or on the lake, opportunities missed, rela-tionships strained More importantly, this is a book that only Gordon Linoffcould have written His unique combination of talents and experiences informsevery chapter
I first met Gordon at Thinking Machines Corporation, a now long-defunctmanufacturer of parallel supercomputers where we both worked in the lateeighties and early nineties Among other roles, Gordon managed the imple-mentation of a parallel relational database designed to support complex ana-lytical queries on very large databases The design point for this database wasradically different from other relational database systems available at the time
in that no trade-offs were made to support transaction processing The ments for a system designed to quickly retrieve or update a single record arequite different from the requirements for a system to scan and join huge tables.Jettisoning the requirement to support transaction processing made for acleaner, more efficient database for analytical processing This part of Gor-don’s background means he understands SQL for data analysis literally fromthe inside out
require-Just as a database designed to answer big important questions has a different
structure from one designed to process many individual transactions, a book
about using databases to answer big important questions requires a different
Foreword
xxvii
Trang 31approach to SQL Many books on SQL are written for database administrators.Others are written for users wishing to prepare simple reports Still othersattempt to introduce some particular dialect of SQL in every detail This one iswritten for data analysts, data miners, and anyone who wants to extract maxi-mum information value from large corporate databases Jettisoning the require-ment to address all the disparate types of database user makes this a better,more focused book for the intended audience In short, this is a book about how
to use databases the way we ourselves use them
Even more important than Gordon’s database technology background, ishis many years as a data mining consultant This has given him a deep under-standing of the kinds of questions businesses need to ask and of the data theyare likely to have available to answer them Years spent exploring corporatedatabases has given Gordon an intuitive feel for how to approach the kinds ofproblems that crop up time and again across many different business domains:
■■ How to take advantage of geographic data.A zip code field looks muchricher when you realize that from zip code you can get to latitude andlongitude and from latitude and longitude you can get to distance Itlooks richer still when your realize that you can use it to join in censusbureau data to get at important attributes such as population density,median income, percentage of people on public assistance, and the like
■■ How to take advantage of dates.Order dates, ship dates, enrollmentdates, birth dates Corporate data is full of dates These fields lookricher when you understand how to turn dates into tenures, analyzepurchases by day of week, and track trends in fulfillment time Theylook richer still when you know how to use this data to analyze time-to-event problems such as time to next purchase or expected remaininglifetime of a customer relationship
■■ How to build data mining models directly in SQL.This book showsyou how to do things in SQL that you probably never imagined possible,including generating association rules for market basket analysis, build-ing regression models, and implementing nạve Bayesian models andscorecards
■■ How to prepare data for use with data mining tools.Although morethan most people realize can be done using just SQL and Excel, eventu-ally you will want to use more specialized data mining tools These tools
need data in a specific format known as a customer signature This book
shows you how to create these data mining extracts
The book is rich in examples and they all use real data This point is worthsaying more about Unrealistic datasets lead to unrealistic results This is frus-trating to the student In real life, the more you know about the business con-text, the better your data mining results will be Subject matter expertise gives
Trang 32you a head start You know what variables ought to be predictive and havegood ideas about new ones to derive Fake data does not reward these goodideas because patterns that should be in the data are missing and patterns thatshouldn’t be there have been introduced inadvertently Real data is hard tocome by, not least because real data may reveal more than its owners are will-ing to share about their business operations As a result, many books andcourses make do with artificially constructed datasets Best of all, the datasetsused in the book are all available for download at the companion web site and
I reviewed the chapters of this book as they were written This process wasvery beneficial to my own use of SQL and Excel The exercise of thinking aboutthe fairly complex queries used in the examples greatly increased my under-standing of how SQL actually works As a result, I have lost my fear of nestedqueries, multi-way joins, giant case statements, and other formerly dauntingaspects of the language In well over a decade of collaboration, I have alwaysturned to Gordon for help using SQL and Excel to best advantage Now, I canturn to this book And you can too
— Michael J A Berry
Trang 34Although this book has only one name on the cover, there are many peopleover a long period of time who have helped me both specifically on this bookand more generally in understanding data, analysis, and presentation.
Michael Berry, my business partner and colleague since 1998 at Data Miners,has been tremendously helpful on all fronts He reviewed the chapters, testedthe SQL code in the examples, and helped anonymize the data His insightshave been helpful and his debugging skills have made the examples muchmore accurate His wife, Stephanie Jack, also deserves special praise for herpatience and willingness to share Michael’s time
Bob Elliott, my editor at Wiley, and the Wiley team not only accepted myoriginal idea for this book, but have exhibited patience and understanding as
I refined the ideas and layout
Matt Keiser, President of Datran Marketing, and Howard Lehrman merly of Datran) were kind enough to provide computing power for testingmany of the examples Nick Drake, also of Datran, inspired the book, by ask-ing for a SQL reference focused on data analysis
(for-Throughout the chapters, the understanding of data processing is based ondataflows, which Craig Stanfill of Ab Initio Corporation first introduced me to,once upon a time when we worked together at Thinking Machines Corporation.Stuart Ward and Zaiying Huang (from the New York Times) have spentcountless hours over the past several years explaining statistical concepts to
me Harrison Sohmer, also of the New York Times, taught me many Exceltricks, some of which I’ve been able to include in the book
Anne Milley of SAS Institute originally suggested that I learn survival sis Will Potts, now at CapitalOne, taught me much of what I know about thesubject, including helping to develop two of the earliest survival analysis–based
analy-Acknowledgments
xxxi
Trang 35forecasts (and finally convincing me that hazard probabilities really can’t be ative) Brij Masand, a colleague at Data Miners, helped extend this knowledge topractical forecasting applications.
neg-Jamie MacLennan and the SQL Server team at Microsoft have been helpful
in answering my questions about the product
There are a handful of people whom I’ve never met in person who havehelped in various ways Richard Stallman invented emacs and the Free Soft-ware Foundation; emacs provided the basis for the calendar table Rob Bovey
of Applications Professional, Inc created the X-Y chart labeler used in severalchapters Robert Clair at the Census Bureau answered some questions viaemail Juice Analytics inspired the example for Worksheet bar charts in Chap-ter 5 (and thanks to Alex Wimbush who pointed me in their direction) EdwinStraver of Frontline Systems answered several questions about Solver, intro-duced in Chapter 11
Over the years, many colleagues, friends, and students have provided ration, questions, and answers There are too many to list all of them, but Iwant to particularly thank Eran Abikhzer, Michael Benigno, Emily Cohen,Carol D’Andrea, Sonia Dubin, Lounette Dyer, Josh Goff, Richard Greenburg,Gregory Lampshire, Fiona McNeill, Karen Kennedy McConlogue, AlanParker, Ashit Patel, Ronnie Rowton, Adam Schwebber, John Trustman, JohnWallace, Kathleen Wright, and Zhilang Zhao I would also like to thank thefolks in the SAS Institute Training group who have organized, reviewed, andsponsored our data mining classes for many years, giving me the opportunity
inspi-to meet many interesting and diverse people involved with data mining
I also thank all those friends and family I’ve visited while writing this bookand who (for the most part) allowed me the space and time to work — mymother, my father, my sister Debbie, my brother Joe, my in-laws RaimondaScalia, Ugo Scalia, and Terry Sparacio, and my friends Jon Mosley, Paul Houli-han, Joe Hughes, and Maciej Zworski
Finally, acknowledgments would be incomplete without thanking my lifepartner, Giuseppe Scalia, who has managed to maintain our sanity for the pastyear while I wrote this book
Thank you everyone
Trang 36Data Analysis Presentation These three key capabilities are needed for tively transforming data into information And yet, these three topics arerarely treated together Other books focus on one or the other — on the details
effec-of relational databases, or on applying statistics to business problems, or onusing Excel This book approaches the challenges of data analysis from a moreholistic perspective, and with the aim of explaining the relevant ideas both topeople responsible for analyzing data and to people who want to use suchinformation, responsibly
The motivation for this approach came from a colleague, Nick Drake, who is
a statistician by training Once upon a time, he was looking for a book thatwould explain how to use SQL for the complex queries needed for data analy-sis There are many books on SQL, few focused on using the language forqueries, and none that come strictly from a perspective of analyzing data Sim-ilarly, there are many books on statistics, none of which address the simple factthat most of the data being used resides in relational databases This book isintended to fill that gap
There are many approaches to data analysis My earlier books, written withMichael Berry, focus on the more advanced algorithms and case studies usu-ally falling under the heading “data mining.” By contrast, this book focuses onthe “how-to.” It starts by describing data stored in databases and continuesthrough preparing and producing results Interspersed are stories based on
my experience in the field, explaining how results might be applied and whysome things work and other things do not The examples are so practical thatthe data used for them is available on the companion web site and at
www.data-miners.com
Introduction
xxxiii
Trang 37One of the truisms about data warehouses and analysis databases in general
is that they don’t actually do anything Yes, they store data Yes, they bring
together data from different sources, cleansing and clarifying along the way.Yes, they define business dimensions, store transactions about customers, and,perhaps, summarize important data (And, yes, all these are very important!)However, data in a database resides on so many spinning disks and in complexdata structures in a computer’s memory So much data So little information.Oil deposits and diamonds hidden in rich seams beneath the surface of theearth are worth much less than gasoline at the pump or Tiffany diamond rings.Prospectors can make a quick buck on such deposits On the other hand, thecompanies willing to invest the dollars to transform and process the raw mate-rials into marketable goods are the ones that uncover the long-term riches.This book is about the basic tools needed for exploiting data, particularlydata that describes customers There are many fancy algorithms for statisticalmodeling and data mining However, “garbage-in, garbage-out.” The results
of even the most sophisticated techniques are only as good as the data beingused Data is central to the task of understanding customers, understandingproducts, and understanding markets
The chapters in this book discuss different aspects of data and several ferent analytic techniques The analytic techniques range from exploratorydata analysis to survival analysis, from market basket analysis to nạveBayesian models, from simple animations to regression Of course, the poten-tial range of possible techniques is much larger than can be presented in onebook The methods have proven useful over time and are applicable in manydifferent areas
dif-And finally, data and analysis are not enough Data must be analyzed, andthe results must be presented to the right audience To fully exploit its value, wemust transform data into stories and scenarios, charts and metrics, and insights
Overview of the Book and Technology
This book focuses on three key technological areas used for transforming datainto actionable information:
data is SQL
the most powerful feature of Excel is the charting capability, which turnscolumns of numbers into pictures
Trang 38These three technologies are presented together, because they are all related SQL answers the question “how do we pull data from a database?”Statistics answers the question “how is it relevant”? And Excel makes it pos-sible to convince other people of the veracity of what we find.
inter-The description of data processing is organized around the SQL language.Although there are extensions of SQL and other very powerful data manipu-lation languages, SQL is common to most databases And, databases such asOracle, IBM’s DB2, and Microsoft SQL Server are common in the businessworld, storing the vast majority of business data transactions Other databasessuch as mysql are available at no cost and easily downloaded The good news
is that all relational databases support SQL as a query language However, just
as England and the United States have been described as “two countries rated by a common language,” each database supports a slightly differentdialect of SQL The Appendix contains a list of commonly used functions andhow they are represented in various different dialects
sepa-Similarly, there are beautiful presentation tools and professional graphicspackages However, very rare and exceptional is the workplace computer thatdoes not have Excel (or an equivalent spreadsheet)
Statistics and data mining techniques do not always require advanced tools.Some very important techniques are readily available using the combination
of SQL and Excel, including survival analysis, nạve Bayesian models, andassociation rules In fact, the methods in this book are often more powerfulthan the methods available in many statistics and data mining tools, preciselybecause they are close to the data and customizable for specific applications.The explanation of the techniques covers both the basic ideas and the exten-sions that may not be available in other tools
The chapters describing the various techniques provide a solid introduction
to modeling and data exploration, in the context of familiar tools and data.They also highlight when the more advanced tools are useful, because there isnot a simpler solution using more readily available tools
In the interests of full disclosure, I should admit that in the early 1990s Iworked on a package called Darwin at a company called Thinking Machines Inthe intervening years, this package has become much more powerful and user-friendly, and has now grown into Oracle Data Mining In addition to Oracle,SQL Server offers data mining extensions within the tool — an exciting devel-opment that brings advanced data analysis even closer to the data
This book does not discuss such functionality at all The methods in thechapters have been chosen for their general applicability to data stored inrelational databases The explicit purpose is not to focus on a particular rela-tional database engine In many ways, the methods discussed here comple-ment such extensions
Trang 39How This Book Is Organized
The twelve chapters in this book fall roughly into three parts The first threeintroduce key concepts of SQL, Excel, and statistics The six middle chaptersdiscuss various methods of exploring data, and techniques specifically suited
to SQL and Excel The last three focus on the idea of modeling, in the sense ofstatistics and data mining
Each chapter explains some aspect of data analysis using SQL and Excelfrom several different perspectives, including:
www.data-miners.com.SQL is a concise language that is sometimes difficult to follow Dataflows,graphical representations of data processing that explain data manipulations,are used to illustrate how the SQL works
Results are presented in charts and tables, sprinkled throughout the book Inaddition, important features of Excel are highlighted, and interesting uses ofExcel graphics are explained Each chapter has a couple of technical asides,typically explaining some aspect of a technique or an interesting bit of historyassociated with the methods described in the chapter
Introductory Chapters
The first chapter, “A Data Miner Looks at SQL,” introduces SQL from the spective of data analysis This is the querying part of the SQL language, wheredata stored in databases is extracted using SQL queries
per-Ultimately, data about customers and about the business is stored in SQLdatabases This chapter introduces entity-relationship diagrams to describethe structure of the data — the tables and columns and how they relate to eachother It also introduces dataflows to describe the processing of queries;dataflows provide a graphical explanation of how data is processed
The first chapter also describes the datasets used for examples throughoutthe book (and which are also available on the companion web site) This dataincludes tables describing retail purchases, tables describing mobile telephonecustomers, and reference tables that describe zip codes and the calendar
Trang 40The second chapter, “What’s In a Table? Getting Started with Data ration,” introduces Excel for exploratory data analysis and presentation Ofmany useful capabilities in Excel, perhaps the most useful are charts As theancient Chinese saying goes, “a picture paints a thousand words,” and Excelmakes it possible to paint pictures using data Such charts are not only usefulaesthetically, but more practically for Word documents, PowerPoint, email, theWeb, and so on.
Explo-Charts are not a means unto themselves This chapter starts down the road
of exploratory data analysis, using charts to convey interesting summaries ofdata In addition, this chapter discusses summarizing columns in a table, aswell as the interesting idea of using SQL to generate SQL queries
Chapter 3, “How Different Is Different?”, explains some key concepts ofdescriptive statistics, such as averages, p-values, and the chi-square test Thepurpose of this chapter is to show how to use such statistics on data residing
in tables The particular statistics and statistical tests are chosen for their ticality, and the chapter focuses on applying the methods, not explaining theunderlying theory Conveniently, most of the statistical tests that we want to
prac-do are feasible in Excel and even in SQL
SQL Techniques
Several techniques are suited very well for the combination of SQL and Excel.Chapter 4, “Where Is It All Happening? Location, Location, Location,”explains geography and how to incorporate geographic information into dataanalysis Geography starts with locations, described by latitude and longitude.There are then various levels of geography, such as census blocks, zip code tab-ulation areas, and the more familiar counties and states, all of which haveinformation available from the Census Bureau This chapter also discussesvarious methods for comparing results at different levels of geography And,finally, no discussion of geography would be complete without maps UsingExcel, it is possible to build very rudimentary maps
Chapter 5, “It’s a Matter of Time,” discusses another key attribute of tomer behavior, when things occur This chapter describes how to access fea-tures of dates and times in databases, and then how to use this information tounderstand customers
cus-The chapter includes examples for accurately making year-over-year parisons, for summarizing by day of the week, for measuring durations indays, weeks, and months, and for calculating the number of active customers
com-by day, historically The chapter ends with a simple animation in Excel
Chapters 6 and 7, “How Long Will Customers Last? Survival Analysis toUnderstand Customers and Their Value” and “Factors Affecting Survival: TheWhat and Why of Customer Tenure,” explain one of the most important analytic