IT training data mining for business intelligence concepts, techniques, and applications in microsoft office excel with XLMiner shmueli, patel bruce 2010 10 26

Foreword Preface to the second edition Preface to the first edition Acknowledgments PART I PRELIMINARIES Chapter 1 Introduction 1.1 What Is Data Mining?. 1.3 Origins of Data Mining 1.4 R

Trang 2

Foreword

Preface to the second edition

Preface to the first edition

Acknowledgments

PART I PRELIMINARIES

Chapter 1 Introduction

1.1 What Is Data Mining?

1.2 Where Is Data Mining Used?

1.3 Origins of Data Mining

1.4 Rapid Growth of Data Mining

1.5 Why Are There So Many Different Methods?1.6 Terminology and Notation

1.7 Road Maps to This Book

Chapter 2 Overview of the Data Mining Process2.1 Introduction

2.2 Core Ideas in Data Mining

Trang 3

2.3 Supervised and Unsupervised Learning

2.4 Steps in Data Mining

Chapter 3 Data Visualization

3.1 Uses of Data Visualization

Trang 4

4.7 Principal Components Analysis

4.8 Dimension Reduction Using Regression Models

4.9 Dimension Reduction Using Classification andRegression Trees

PROBLEMS

PART III PERFORMANCE EVALUATION

Chapter 5 Evaluating Classification and PredictivePerformance

5.1 Introduction

5.2 Judging Classification Performance

5.3 Evaluating Predictive Performance

Trang 5

6.2 Explanatory versus Predictive modeling

6.3 Estimating the Regression Equation and Prediction6.4 Variable Selection in Linear Regression

PROBLEMS

Chapter 7 k-Nearest Neighbors (k-NN)

7.1 k-NN Classifier (categorical outcome)

7.2 k-NN for a Numerical Response

7.3 Advantages and Shortcomings of k-NN AlgorithmsPROBLEMS

Chapter 8 Naive Bayes

8.1 Introduction

8.2 Applying the Full (Exact) Bayesian Classifier

Trang 6

8.3 Advantages and Shortcomings of the Naive BayesClassifier

9.6 Classification Rules from Trees

9.7 Classification Trees for More Than two Classes

10.2 Logistic Regression Model

10.3 Evaluating Classification performance

Trang 7

10.4 Example of Complete Analysis: Predicting DelayedFlights

10.5 Appendix: logistic Regression for Profiling

PROBLEMS

Chapter 11 Neural Nets

11.1 Introduction

11.2 Concept And Structure Of A Neural Network

11.3 Fitting A Network To Data

11.4 Required User Input

11.5 Exploring The Relationship Between Predictors AndResponse

11.6 Advantages And Weaknesses Of Neural NetworksPROBLEMS

Chapter 12 Discriminant Analysis

12.1 Introduction

12.2 Distance of an Observation from a Class

12.3 Fisher’s Linear Classification Functions

12.4 Classification performance of Discriminant Analysis

Trang 8

12.5 Prior Probabilities

12.6 Unequal Misclassification Costs

12.7 Classifying more Than Two Classes

12.8 Advantages and Weaknesses

13.3 Generating Candidate Rules

13.4 Selecting Strong Rules

Trang 9

14.3 Measuring Distance Between Two Clusters

14.4 Hierarchical (Agglomerative) Clustering

14.5 Nonhierarchical Clustering: The k-Means AlgorithmPROBLEMS

PART VI FORECASTING TIME SERIES

Chapter 15 Handling Time Series

15.1 Introduction

15.2 Explanatory versus Predictive Modeling

15.3 Popular Forecasting Methods in Business

15.4 Time Series Components

15.5 Data Partitioning

PROBLEMS

Chapter 16 Regression-Based Forecasting

16.1 Model With Trend

16.2 Model With Seasonality

16.3 Model With Trend And Seasonality

16.4 Autocorrelation And ARIMA Models

Trang 10

Chapter 17 Smoothing Methods

17.1 Introduction

17.2 Moving Average

17.3 Simple Exponential Smoothing

17.4 Advanced Exponential Smoothing

18.3 Tayko Software Cataloger

18.4 Segmenting Consumers of Bath Soap

Trang 11

Index

Trang 13

To our families

Boaz and Noa

Tehmi, Arjun, and in

memory of Aneesh

Liz, Lisa, and Allison

Trang 14

Published by John Wiley & Sons, Inc., Hoboken, NewJersey

Published simultaneously in Canada

No part of this publication may be reproduced, stored in aretrieval system, or transmitted in any form or by anymeans, electronic, mechanical, photocopying, recording,scanning, or otherwise, except as permitted under Section

107 or 108 of the 1976 United States Copyright Act,without either the prior written permission of thePublisher, or authorization through payment of theappropriate per-copy fee to the Copyright ClearanceCenter, Inc., 222 Rosewood Drive, Danvers, MA 01923,(978) 750-8400, fax (978) 750-4470, or on the web atwww.copyright.com Requests to the Publisher forpermission should be addressed to the PermissionsDepartment, John Wiley & Sons, Inc., 111 River Street,Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008,

or online athttp://www.wiley.com/go/permission

Limit of Liability/Disclaimer of Warranty: While thepublisher and author have used their best efforts inpreparing this book, they make no representations orwarranties with respect to the accuracy or completeness ofthe contents of this book and specifically disclaim anyimplied warranties of merchantability or fitness for aparticular purpose No warranty may be created orextended by sales representatives or written sales

Trang 15

materials The advice and strategies contained herein maynot be suitable for your situation You should consult with

a professional where appropriate Neither the publisher norauthor shall be liable for any loss of profit or any othercommercial damages, including but not limited to special,incidental, consequential, or other damages

For general information on our other products and services

or for technical support, please contact our Customer CareDepartment within the United States at (800) 762-2974,outside the United States at (317) 572-3993 or fax(317)572-4002

Wiley also publishes its books in a variety of electronicformats Some content that appears in print may not beavailable in electronic formats For more informationabout Wiley products, visit our web site atwww.wiley.com

Library of Congress Cataloging-in-Publication Data:

Shmueli, Galit,

1971-Data mining for business intelligence: concepts,techniques, and applications in Microsoft Office Excelwith XLMiner / Galit Shmueli, Nitin R Patel, Peter C.Bruce – 2nd ed

p cm

Includes bibliographical references and index

ISBN 978-0-470-52682-8 (cloth)

Trang 16

1 Business–Data processing 2 Data mining 3 MicrosoftExcel (Computer file) I Patel, Nitin R (Nitin Ratilal) II.Bruce, Peter C., 1953- III.

Title

HF5548.2.S44843 2010

005.54–dc22

2010005152

Trang 17

Data mining—the art of extracting useful information fromlarge amounts of data—is of growing importance intoday’s world Your e-mail spam filter relies at least in part

on rules that a data mining algorithm has learned fromexamining millions of e-mail messages that have beenclassified as spam or not spam Real-time data miningmethods enable Web-based merchants to tell you that

“customers who purchased x are also likely to purchase y.”

Data mining helps banks determine which applicants arelikely to default on loans, helps tax authorities identifywhich tax returns are most likely to be fraudulent, andhelps catalog merchants target those customers most likely

to purchase

And data mining is not just about numbers—text miningtechniques help search engines like Google and Yahoofind what you are looking for by ordering documentsaccording to their relevance to your query In the processthey have effectively monetized search by orderingsponsored ads that are relevant to your query

The amount of data flowing from, to, and throughenterprises of all sorts is enormous, and growingrapidly—more rapidly than the capabilities oforganizations to use it Successful enterprises are those thatmake effective use of the abundance of data to which theyhave access: to make better predictions, better decisions,and better strategies The margin over a competitor may besmall (they, after all, have access to the same methods for

Trang 18

making effective use of information), hence the need totake advantage of every possible avenue to advantage.

At no time has the need been greater for quantitativelyskilled managerial expertise Successful managers nowneed to know about the possibilities and limitations of datamining But at what level? A high-level overview canprovide a general idea of what data mining can do for theenterprise but fails to provide the intuition that could beattained by actually building models with real data A verytechnical approach from the computer science, database, orstatistical standpoint can get bogged down in detail thathas little bearing on decision making

It is essential that managers be able to translate business orother functional problems into the appropriate statisticalproblem before it can be “handed off” to a technical team.But it is difficult for managers to do this with confidenceunless they have actually had hands-on experiencedeveloping models for a variety of real problems using realdata That is the perspective of this book—the use of realdata, actual cases, and an Excel-based program to buildand compare models with a minimal learning curve

DARYLPREGIBON

Google Inc, 2006

Trang 19

Preface to the Second Edition

Since the book’s appearance in early 2007, it has beenused in many classes, ranging from dedicated data miningclasses to more general business intelligence courses.Following feedback from instructors teaching both MBAand undergraduate courses, as well as students, we revisedsome of the existing chapters as well as covered two newtopics that are central in data mining: data visualizationand time series forecasting

We have added a set of three chapters on time seriesforecasting (Chapters 15–17), which present the mostcommonly used forecasting tools in the business world.They include a set of new datasets and exercises, and anew case (in Chapter 18)

The chapter on data visualization provides comprehensivecoverage of basic and advanced visualization techniquesthat support the exploratory step of data mining We alsoprovide a discussion of interactive visualization principlesand tools, and the chapter exercises include assignments tofamiliarize readers with interactive visualization inpractice

In the new edition we have created separate chapters for

the k-nearest-neighbor and naive Bayes methods The

explanation of the naive Bayes classifier is now clearer,and additional exercises have been added to both chapters.Another addition are brief chapter summaries at thebeginning of each chapter

Trang 20

We have also reorganized the order of some chapters,following readers’ feedback The chapters are nowgrouped into seven parts: Preliminaries, Data Explorationand Dimension Reduction, Performance Evaluation,Prediction and Classification Methods, MiningRelationships Among Records, Forecasting Time Series,and Cases The new organization is aimed at helpinginstructors of various types of courses to choose subsets oftopics to teach.

Two-semester data mining courses could cover in detail

data exploration and dimension reduction and supervisedlearning in one term (choosing the type and amount ofprediction and classification methods according to thecourse flavor and the audience interest) Forecasting timeseries and unsupervised learning can be covered in thesecond term

Single-semester data mining courses would do best to

concentrate on the first parts of the book, and onlyintroduce time series forecasting as time allows This isespecially true if a dedicated forecasting course is offered

in the program

General business intelligence courses would best focus

on the first three parts, then choose a small number ofprediction/classification methods for illustration, andpresent the mining relationships chapters All these can becovered via a few cases, where students read the relevantchapters that support the analysis done in the case

A set of data mining courses that constitute a

concentration can be built according to the sequence of

Trang 21

parts in the book The first three parts (Preliminaries, DataExploration and Dimension Reduction, and PerformanceEvaluation) should serve as requirements for the nextcourses Cases can be used either within appropriate topiccourses or as project-type courses.

In all courses, we strongly recommend including a projectcomponent, where data are either collected by studentsaccording to their interest or provided by the instructor(e.g., from the many data mining competition datasetsavailable) From our experience and other instructors’experience, such projects enhance the learning and providestudents with an excellent opportunity to understand thestrengths of data mining and the challenges that arise in theprocess

Trang 22

Preface to the First Edition

This book arose out of a data mining course at MIT’sSloan School of Management and was refined during itsuse in data mining courses at the University of Maryland’s

R H Smith School of Business and at statistics.com.Preparation for the course revealed that there are a number

of excellent books on the business context of data mining,but their coverage of the statistical and machine-learningalgorithms that underlie data mining is not sufficientlydetailed to provide a practical guide if the instructor’s goal

is to equip students with the skills and tools to implementthose algorithms On the other hand, there are also anumber of more technical books about data miningalgorithms, but these are aimed at the statistical researcher

or more advanced graduate student, and do not provide thecase-oriented business focus that is successful in teachingbusiness students

Hence, this book is intended for the business student (andpractitioner) of data mining techniques, and its goal isthreefold:

1 To provide both a theoretical and a practicalunderstanding of the key methods of classification,prediction, reduction, and exploration that are at the heart

of data mining

2 To provide a business decision-making context for thesemethods

Trang 23

3 Using real business cases, to illustrate the applicationand interpretation of these methods.

The presentation of the cases in the book is structured sothat the reader can follow along and implement thealgorithms on his or her own with a very low learninghurdle

Just as a natural science course without a lab componentwould seem incomplete, a data mining course withoutpractical work with actual data is missing a key ingredient.The MIT data mining course that gave rise to this bookfollowed an introductory quantitative course that relied onExcel—this made its practical work universally accessible.Using Excel for data mining seemed a natural progression

An important feature of this book is the use of Excel, anenvironment familiar to business analysts All requireddata mining algorithms (plus illustrative datasets) areprovided in an Excel add-in, XLMiner Data for both the

www.dataminingbook.com

Although the genesis for this book lay in the need for acase-oriented guide to teaching data mining, analysts andconsultants who are considering the application of datamining techniques in contexts where they are not currently

in use will also find this a useful, practical guide

Trang 24

The authors thank the many people who assisted us inimproving the first edition and improving it further in thesecond edition Anthony Babinec, who has been usingdrafts of this book for years in his data mining courses atstatistics.com, provided us with detailed and expertcorrections Similarly, Dan Toy and John Elder IV greetedour project with enthusiasm and provided detailed anduseful comments on earlier drafts Boaz Shmueli andRaquelle Azran gave detailed editorial comments andsuggestions on both editions; Bruce McCullough andAdam Hughes did the same for the first edition RaviBapna, who used an early draft in a data mining course atthe Indian School of Business, provided invaluablecomments and helpful suggestions Useful comments andfeedback have also come from the many instructors, toonumerous to mention, who have used the book in theirclasses

From the Smith School of Business at the University ofMaryland, colleagues Shrivardhan Lele, Wolfgang Jank,and Paul Zantek provided practical advice and comments

We thank Robert Windle, and MBA students TimothyRoach, Pablo Macouzet, and Nathan Birckhead forinvaluable datasets We also thank MBA students RobWhitener and Daniel Curtis for the heatmap and mapcharts And we thank the many MBA students for fruitfuldiscussions and interesting data mining projects that havehelped shape and improve the book

Trang 25

This book would not have seen the light of day without thenurturing support of the faculty at the Sloan School ofManagement at MIT Our special thanks to DimitrisBertsimas, James Orlin, Robert Freund, Roy Welsch,Gordon Kaufmann, and Gabriel Bitran As teachingassistants for the data mining course at Sloan, AdamMersereau gave detailed comments on the notes and casesthat were the genesis of this book, Romy Shioda helpedwith the preparation of several cases and exercises usedhere, and Mahesh Kumar helped with the material onclustering We are grateful to the MBA students at Sloanfor stimulating discussions in the class that led torefinement of the notes as well as XLMiner.

Chris Albright, Gregory Piatetsky-Shapiro, WayneWinston, and Uday Karmarkar gave us helpful advice onthe use of XLMiner Anand Bodapati provided both dataand advice Suresh Ankolekar and Mayank Shah helpeddevelop several cases and provided valuable pedagogicalcomments Vinni Bhandari helped write the Charles BookClub case

We would like to thank Marvin Zelen, L J Wei, andCyrus Mehta at Harvard, as well as Anil Gore at PuneUniversity, for thought-provoking discussions on therelationship between statistics and data mining Our thanks

to Richard Larson of the Engineering Systems Division,MIT, for sparking many stimulating ideas on the role ofdata mining in modeling complex systems They helped usdevelop a balanced philosophical perspective on theemerging field of data mining

Trang 26

Our thanks to Ajay Sathe, who energetically shepherdedXLMiner’s development over the years and continues to

do so, and to his colleagues on the XLMiner team: SureshAnkolekar, Poonam Baviskar, Kuber Deokar, RupaliDesai, Yogesh Gajjar, Ajit Ghanekar, Ayan Khare, BharatLande, Dipankar Mukhopadhyay, S V Sabnis, UshaSathe, Anurag Srivastava, V Subramaniam, RameshRaman, and Sanhita Yeolkar

Steve Quigley at Wiley showed confidence in this bookfrom the beginning and helped us navigate through thepublishing process with great speed Curt Hinrichs’ vision,tips, and encouragement helped bring this book to thestarting gate We are also grateful to Ashwini Kumthekar,Achala Sabane, Michael Shapard, and Heidi Sestrich whoassisted with typesetting, figures, and indexing, and toValerie Troiano who has shepherded many instructorsthrough the use of XLMiner and early drafts of this text

We also thank Catherine Plaisant at the University ofMaryland’s Human-Computer Interaction Lab, who helpedout in a major way by contributing exercises andillustrations to the data visualization chapter, MariettaTretter at Texas A&M for her helpful comments andthoughts on the time series chapters, and Stephen Few andBen Shneiderman for feedback and suggestions on the datavisualization chapter and overall design tips

Trang 27

Part One

Preliminaries

Trang 28

Chapter 1

Introduction

1.1 What Is Data Mining?

The field of data mining is still relatively new and in astate of evolution The first International Conference onKnowledge Discovery and Data Mining (KDD) was held

in 1995, and there are a variety of definitions of datamining

A concise definition that captures the essence of data

mining is:

Extracting useful information from large data sets

(Hand et al., 2001)

A slightly longer version is:

Data mining is the process of exploration and analysis, byautomatic or semi-automatic means, of large quantities ofdata in order to discover meaningful patterns and rules.(Berry and Linoff, 1997, p 5)

Berry and Linoff later had cause to regret the 1997reference to “automatic and semi-automatic means,”feeling that it shortchanged the role of data exploration andanalysis analysis (Berry and Linoff, 2000)

Trang 29

Another definition comes from the Gartner Group, theinformation technology research firm:

[Data Mining is] the process of discovering meaningfulcorrelations, patterns and trends by sifting through largeamounts of data stored in repositories Data miningemploys pattern recognition technologies, as well asstatistical and mathematical techniques

(http://www.gartner.com/6_help/glossary, accessed May

14, 2010)

A summary of the variety of methods encompassed in the

term data mining is given at the beginning of Chapter 2.

1.2 Where Is Data Mining Used?

Data mining is used in a variety of fields and applications.The military use data mining to learn what roles variousfactors play in the accuracy of bombs Intelligenceagencies might use it to determine which of a hugequantity of intercepted communications are of interest.Security specialists might use these methods to determinewhether a packet of network data constitutes a threat.Medical researchers might use it to predict the likelihood

of a cancer relapse

Although data mining methods and tools have generalapplicability, most examples in this book are chosen fromthe business world Some common business questions thatone might address through data mining methods include:

Trang 30

1 From a large list of prospective customers, which aremost likely to respond? We can use classificationtechniques (logistic regression, classification trees, or othermethods) to identify those individuals whose demographicand other data most closely matches that of our bestexisting customers Similarly, we can use predictiontechniques to forecast how much individual prospects willspend.

2 Which customers are most likely to commit, forexample, fraud (or might already have committed it)? Wecan use classification methods to identify (say) medicalreimbursement applications that have a higher probability

of involving fraud and give them greater attention

3 Which loan applicants are likely to default? We can useclassification techniques to identify them (or logisticregression to assign a “probability of default” value)

4 Which customers are most likely to abandon asubscription service (telephone, magazine, etc.)? Again,

we can use classification techniques to identify them (orlogistic regression to assign a “probability of leaving”value) In this way, discounts or other enticements can beproffered selectively

1.3 Origins of Data Mining

Data mining stands at the confluence of the fields ofstatistics and machine learning (also known as artificialintelligence) A variety of techniques for exploring dataand building models have been around for a long time inthe world of statistics: linear regression, logistic

Trang 31

regression, discriminant analysis, and principalcomponents analysis, for example But the core tenets ofclassical statistics—computing is difficult and data arescarce—do not apply in data mining applications whereboth data and computing power are plentiful.

This gives rise to Daryl Pregibon’s description of datamining as “statistics at scale and speed” (Pregibon, 1999)

A useful extension of this is “statistics at scale, speed, andsimplicity.” Simplicity in this case refers not to thesimplicity of algorithms but, rather, to simplicity in thelogic of inference Due to the scarcity of data in theclassical statistical setting, the same sample is used tomake an estimate and also to determine how reliable thatestimate might be As a result, the logic of the confidenceintervals and hypothesis tests used for inference may seemelusive for many, and their limitations are not wellappreciated By contrast, the data mining paradigm offitting a model with one sample and assessing itsperformance with another sample is easily understood

Computer science has brought us machine learning

techniques, such as trees and neural networks, that rely on

computational intensity and are less structured thanclassical statistical models In addition, the growing field

of database management is also part of the picture

The emphasis that classical statistics places on inference(determining whether a pattern or interesting result mighthave happened by chance) is missing in data mining Incomparison to statistics, data mining deals with largedatasets in open-ended fashion, making it impossible to put

Trang 32

the strict limits around the question being addressed thatinference would require.

As a result, the general approach to data mining is

vulnerable to the danger of overfitting, where a model is fit

so closely to the available sample of data that it describesnot merely structural characteristics of the data but randompeculiarities as well In engineering terms, the model isfitting the noise, not just the signal

1.4 Rapid Growth of Data Mining

Perhaps the most important factor propelling the growth ofdata mining is the growth of data The mass retailerWal-Mart in 2003 captured 20 million transactions per day

in a 10-terabyte database (a terabyte is 1 millionmegabytes) In 1950, the largest companies had onlyenough data to occupy, in electronic form, several dozenmegabytes Lyman and Varian (2003) estimate that 5exabytes of information were produced in 2002, doublewhat was produced in 1999 (1 exabyte is 1 millionterabytes); 40% of this was produced in the United States.The growth of data is driven not simply by an expandingeconomy and knowledge base but by the decreasing costand increasing availability of automatic data capturemechanisms Not only are more events being recorded, butmore information per event is captured Scannable barcodes, point-of-sale (POS) devices, mouse click trails, andglobal positioning satellite (GPS) data are examples.The growth of the Internet has created a vast new arena forinformation generation Many of the same actions that

Trang 33

people undertake in retail shopping, exploring a library, orcatalog shopping have close analogs on the Internet, andall can now be measured in the most minute detail Inmarketing, a shift in focus from products and services to afocus on the customer and his or her needs has created ademand for detailed data on customers.

The operational databases used to record individualtransactions in support of routine business activity canhandle simple queries but are not adequate for morecomplex and aggregate analysis Data from theseoperational databases are therefore extracted, transformed,

and exported to a data warehouse, a large integrated data

storage facility that ties together the decision support

systems of an enterprise Smaller data marts devoted to a

single subject may also be part of the system They mayinclude data from external sources (e.g., credit rating data).Many of the exploratory and analytical techniques used indata mining would not be possible without today’scomputational power The constantly declining cost of datastorage and retrieval has made it possible to build thefacilities required to store and make available vast amounts

of data In short, the rapid and continuing improvement incomputing capacity is an essential enabler of the growth ofdata mining

1.5 Why Are There So Many Different Methods?

As can be seen in this book or any other resource on datamining, there are many different methods for predictionand classification You might ask yourself why theycoexist and whether some are better than others The

Trang 34

answer is that each method has advantages anddisadvantages The usefulness of a method can depend onfactors such as the size of the dataset, the types of patternsthat exist in the data, whether the data meet someunderlying assumptions of the method, how noisy the dataare, and the particular goal of the analysis A smallillustration is shown inFigure 1.1, where the goal is to find

a combination of household income level and household

lot size that separate buyers (solid circles) from nonbuyers

(hollow circles) of riding mowers The first method (leftpanel) looks only for horizontal and vertical lines toseparate buyers from nonbuyers, whereas the secondmethod (right panel) looks for a single diagonal line.Different methods can lead to different results, and theirperformance can vary It is therefore customary in datamining to apply several different methods and select theone that is most useful for the goal at hand

FIGURE 1.1 TWO METHODS FOR SEPARATING BUYERS FROM NONBUYERS

1.6 Terminology and Notation

Trang 35

Because of the hybrid parentry of data mining, itspractitioners often use multiple terms to refer to the samething For example, in the machine learning (artificialintelligence) field, the variable being predicted is theoutput variable or target variable To a statistician, it is thedependent variable or the response Here is a summary ofterms used:

Algorithm Refers to a specific procedure used to

implement a particular data mining technique:classification tree, discriminant analysis, and the like

Attribute See Predictor.

Case See Observation.

Confidence Has a specific meaning in association rules of

the type “IF A and B are purchased, C is also purchased.” Confidence is the conditional probability that C will be purchased IF A and B are purchased.

Confidence Also has a broader meaning in statistics

(confidence interval), concerning the degree of error in an

estimate that results from selecting one sample as opposed

to another

Dependent Variable See Response.

Estimation See Prediction.

Feature See Predictor.

Trang 36

Holdout Sample Is a sample of data not used in fitting a

model, used to assess the performance of that model; this

book uses the term validation set or, if one is used in the problem, test set instead of holdout sample.

Input Variable See Predictor.

Model Refers to an algorithm as applied to a dataset,

complete with its settings (many of the algorithms haveparameters that the user can adjust)

Observation Is the unit of analysis on which the

measurements are taken (a customer, a transaction, etc.);

also called case, record, pattern, or row (Each row

typically represents a record; each column, a variable.)

Outcome Variable See Response.

Output Variable See Response.

P (A | B) Is the conditional probability of event A

occurring given that event B has occurred Read as “the probability that A will occur given that B has occurred.”

Pattern Is a set of measurements on an observation (e.g.,

the height, weight, and age of a person)

Prediction The prediction of the value of a continuous

output variable; also called estimation.

Predictor Usually denoted by X, is also called a feature,

input variable, independent variable, or from a database

perspective, a field.

Trang 37

Record See Observation.

Response usually denoted by Y, is the variable being

predicted in supervised learning; also called dependent

variable, output variable, target variable, or outcome variable.

Score Refers to a predicted value or class Scoring new

data means to use a model developed with training data to

predict output values in new data

Success Class Is the class of interest in a binary outcome

(e.g., purchasers in the outcome purchase/no purchase).

Supervised Learning Refers to the process of providing

an algorithm (logistic regression, regression tree, etc.) withrecords in which an output variable of interest is knownand the algorithm “learns” how to predict this value withnew records where the output is unknown

Test Data (or test set) Refers to that portion of the data

used only at the end of the model building and selectionprocess to assess how well the final model might perform

on additional data

Training Data (or training set) Refers to that portion of

data used to fit a model

Unsupervised Learning Refers to analysis in which one

attempts to learn something about the data other thanpredicting an output value of interest (e.g., whether it fallsinto clusters)

Trang 38

Validation Data (or validation set) Refers to that portion

of the data used to assess how well the model fits, to adjustsome models, and to select the best model from amongthose that have been tried

Variable Is any measurement on the records, including

both the input (X) variables and the output (Y) variable.

FIGURE 1.2 DATA MINING FROM A PROCESS PERSPECTIVE NUMBERS IN PARENTHESES INDICATE CHAPTER NUMBERS

1.7 Road Maps to This Book

The book covers many of the widely used predictive andclassification methods as well as other data mining tools.Figure 1.2outlines data mining from a process perspectiveand where the topics in this book fit in Chapter numbersare indicated beside the topic Table 1.1 provides a

Trang 39

different perspective: It organizes data mining proceduresaccording to the type and structure of the data.

TABLE 1.1 ORGANIZATION OF DATA MINING METHODS IN THIS BOOK, ACCORDING TO THE

aNumbers in parentheses indicate chapter number

Order of Topics

The book is divided into five parts: Part I (Chapters 1–2)gives a general overview of data mining and itscomponents Part II (Chapters 3–4) focuses on the earlystage of data exploration and dimension reduction in whichtypically the most effort is expended

Part III (Chapter 4) discusses performance evaluation.Although it contains a single chapter, we discuss a variety

of topics, from predictive performance metrics tomisclassification costs The principles covered in this partare crucial for the proper evaluation and comparison ofsupervised learning methods

Part IV includes eight chapters (Chapters 5–12), covering avariety of popular supervised learning methods (forclassification and/or prediction) Within this part, thetopics are generally organized according to the level ofsophistication of the algorithms, their popularity, and ease

of understanding

Trang 40

Part V focuses on unsupervised learning, presentingassociation rules (Chapter 13) and cluster analysis(Chapter 14).

Part VI includes three chapters (Chapters 15–17), with thefocus on forecasting time series The first chapter coversgeneral issues related to handling and understanding timeseries The next two chapters present two popularforecasting approaches: regression-based forecasting andsmoothing methods

Finally, Part VII includes a set of cases

Although the topics in the book can be covered in the order

of the chapters, each chapter stands alone It is advised,however, to read Parts I–III before proceeding to thechapters in Parts IV–V, and similarly Chapter 15 shouldprecede other chapters in Part VI

USING XLMINER SOFTWARE

To facilitate hands-on data mining experience, thisbook comes with access to XLMiner, acomprehensive data mining add-in for Excel Forthose familiar with Excel, the use of an Exceladd-in dramatically shortens the software learningcurve XLMiner will help you get started quickly

on data mining and offers a variety of methods foranalyzing data The illustrations, exercises, andcases in this book are written in relation to this

Định dạng
Số trang	726
Dung lượng	12,24 MB