Foreword Preface to the second edition Preface to the first edition Acknowledgments PART I PRELIMINARIES Chapter 1 Introduction 1.1 What Is Data Mining?. 1.3 Origins of Data Mining 1.4 R
Trang 2Foreword
Preface to the second edition
Preface to the first edition
Acknowledgments
PART I PRELIMINARIES
Chapter 1 Introduction
1.1 What Is Data Mining?
1.2 Where Is Data Mining Used?
1.3 Origins of Data Mining
1.4 Rapid Growth of Data Mining
1.5 Why Are There So Many Different Methods?1.6 Terminology and Notation
1.7 Road Maps to This Book
Chapter 2 Overview of the Data Mining Process2.1 Introduction
2.2 Core Ideas in Data Mining
Trang 32.3 Supervised and Unsupervised Learning
2.4 Steps in Data Mining
Chapter 3 Data Visualization
3.1 Uses of Data Visualization
Trang 44.7 Principal Components Analysis
4.8 Dimension Reduction Using Regression Models
4.9 Dimension Reduction Using Classification andRegression Trees
PROBLEMS
PART III PERFORMANCE EVALUATION
Chapter 5 Evaluating Classification and PredictivePerformance
5.1 Introduction
5.2 Judging Classification Performance
5.3 Evaluating Predictive Performance
Trang 56.2 Explanatory versus Predictive modeling
6.3 Estimating the Regression Equation and Prediction6.4 Variable Selection in Linear Regression
PROBLEMS
Chapter 7 k-Nearest Neighbors (k-NN)
7.1 k-NN Classifier (categorical outcome)
7.2 k-NN for a Numerical Response
7.3 Advantages and Shortcomings of k-NN AlgorithmsPROBLEMS
Chapter 8 Naive Bayes
8.1 Introduction
8.2 Applying the Full (Exact) Bayesian Classifier
Trang 68.3 Advantages and Shortcomings of the Naive BayesClassifier
9.6 Classification Rules from Trees
9.7 Classification Trees for More Than two Classes
10.2 Logistic Regression Model
10.3 Evaluating Classification performance
Trang 710.4 Example of Complete Analysis: Predicting DelayedFlights
10.5 Appendix: logistic Regression for Profiling
PROBLEMS
Chapter 11 Neural Nets
11.1 Introduction
11.2 Concept And Structure Of A Neural Network
11.3 Fitting A Network To Data
11.4 Required User Input
11.5 Exploring The Relationship Between Predictors AndResponse
11.6 Advantages And Weaknesses Of Neural NetworksPROBLEMS
Chapter 12 Discriminant Analysis
12.1 Introduction
12.2 Distance of an Observation from a Class
12.3 Fisher’s Linear Classification Functions
12.4 Classification performance of Discriminant Analysis
Trang 812.5 Prior Probabilities
12.6 Unequal Misclassification Costs
12.7 Classifying more Than Two Classes
12.8 Advantages and Weaknesses
13.3 Generating Candidate Rules
13.4 Selecting Strong Rules
Trang 914.3 Measuring Distance Between Two Clusters
14.4 Hierarchical (Agglomerative) Clustering
14.5 Nonhierarchical Clustering: The k-Means AlgorithmPROBLEMS
PART VI FORECASTING TIME SERIES
Chapter 15 Handling Time Series
15.1 Introduction
15.2 Explanatory versus Predictive Modeling
15.3 Popular Forecasting Methods in Business
15.4 Time Series Components
15.5 Data Partitioning
PROBLEMS
Chapter 16 Regression-Based Forecasting
16.1 Model With Trend
16.2 Model With Seasonality
16.3 Model With Trend And Seasonality
16.4 Autocorrelation And ARIMA Models
Trang 10Chapter 17 Smoothing Methods
17.1 Introduction
17.2 Moving Average
17.3 Simple Exponential Smoothing
17.4 Advanced Exponential Smoothing
18.3 Tayko Software Cataloger
18.4 Segmenting Consumers of Bath Soap
Trang 11Index
Trang 13To our families
Boaz and Noa
Tehmi, Arjun, and in
memory of Aneesh
Liz, Lisa, and Allison
Trang 14Copyright 2010 by John Wiley & Sons, Inc All rightsreserved
Published by John Wiley & Sons, Inc., Hoboken, NewJersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in aretrieval system, or transmitted in any form or by anymeans, electronic, mechanical, photocopying, recording,scanning, or otherwise, except as permitted under Section
107 or 108 of the 1976 United States Copyright Act,without either the prior written permission of thePublisher, or authorization through payment of theappropriate per-copy fee to the Copyright ClearanceCenter, Inc., 222 Rosewood Drive, Danvers, MA 01923,(978) 750-8400, fax (978) 750-4470, or on the web atwww.copyright.com Requests to the Publisher forpermission should be addressed to the PermissionsDepartment, John Wiley & Sons, Inc., 111 River Street,Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008,
or online athttp://www.wiley.com/go/permission
Limit of Liability/Disclaimer of Warranty: While thepublisher and author have used their best efforts inpreparing this book, they make no representations orwarranties with respect to the accuracy or completeness ofthe contents of this book and specifically disclaim anyimplied warranties of merchantability or fitness for aparticular purpose No warranty may be created orextended by sales representatives or written sales
Trang 15materials The advice and strategies contained herein maynot be suitable for your situation You should consult with
a professional where appropriate Neither the publisher norauthor shall be liable for any loss of profit or any othercommercial damages, including but not limited to special,incidental, consequential, or other damages
For general information on our other products and services
or for technical support, please contact our Customer CareDepartment within the United States at (800) 762-2974,outside the United States at (317) 572-3993 or fax(317)572-4002
Wiley also publishes its books in a variety of electronicformats Some content that appears in print may not beavailable in electronic formats For more informationabout Wiley products, visit our web site atwww.wiley.com
Library of Congress Cataloging-in-Publication Data:
Shmueli, Galit,
1971-Data mining for business intelligence: concepts,techniques, and applications in Microsoft Office Excelwith XLMiner / Galit Shmueli, Nitin R Patel, Peter C.Bruce – 2nd ed
p cm
Includes bibliographical references and index
ISBN 978-0-470-52682-8 (cloth)
Trang 161 Business–Data processing 2 Data mining 3 MicrosoftExcel (Computer file) I Patel, Nitin R (Nitin Ratilal) II.Bruce, Peter C., 1953- III.
Title
HF5548.2.S44843 2010
005.54–dc22
2010005152
Trang 17Data mining—the art of extracting useful information fromlarge amounts of data—is of growing importance intoday’s world Your e-mail spam filter relies at least in part
on rules that a data mining algorithm has learned fromexamining millions of e-mail messages that have beenclassified as spam or not spam Real-time data miningmethods enable Web-based merchants to tell you that
“customers who purchased x are also likely to purchase y.”
Data mining helps banks determine which applicants arelikely to default on loans, helps tax authorities identifywhich tax returns are most likely to be fraudulent, andhelps catalog merchants target those customers most likely
to purchase
And data mining is not just about numbers—text miningtechniques help search engines like Google and Yahoofind what you are looking for by ordering documentsaccording to their relevance to your query In the processthey have effectively monetized search by orderingsponsored ads that are relevant to your query
The amount of data flowing from, to, and throughenterprises of all sorts is enormous, and growingrapidly—more rapidly than the capabilities oforganizations to use it Successful enterprises are those thatmake effective use of the abundance of data to which theyhave access: to make better predictions, better decisions,and better strategies The margin over a competitor may besmall (they, after all, have access to the same methods for
Trang 18making effective use of information), hence the need totake advantage of every possible avenue to advantage.
At no time has the need been greater for quantitativelyskilled managerial expertise Successful managers nowneed to know about the possibilities and limitations of datamining But at what level? A high-level overview canprovide a general idea of what data mining can do for theenterprise but fails to provide the intuition that could beattained by actually building models with real data A verytechnical approach from the computer science, database, orstatistical standpoint can get bogged down in detail thathas little bearing on decision making
It is essential that managers be able to translate business orother functional problems into the appropriate statisticalproblem before it can be “handed off” to a technical team.But it is difficult for managers to do this with confidenceunless they have actually had hands-on experiencedeveloping models for a variety of real problems using realdata That is the perspective of this book—the use of realdata, actual cases, and an Excel-based program to buildand compare models with a minimal learning curve
DARYLPREGIBON
Google Inc, 2006
Trang 19Preface to the Second Edition
Since the book’s appearance in early 2007, it has beenused in many classes, ranging from dedicated data miningclasses to more general business intelligence courses.Following feedback from instructors teaching both MBAand undergraduate courses, as well as students, we revisedsome of the existing chapters as well as covered two newtopics that are central in data mining: data visualizationand time series forecasting
We have added a set of three chapters on time seriesforecasting (Chapters 15–17), which present the mostcommonly used forecasting tools in the business world.They include a set of new datasets and exercises, and anew case (in Chapter 18)
The chapter on data visualization provides comprehensivecoverage of basic and advanced visualization techniquesthat support the exploratory step of data mining We alsoprovide a discussion of interactive visualization principlesand tools, and the chapter exercises include assignments tofamiliarize readers with interactive visualization inpractice
In the new edition we have created separate chapters for
the k-nearest-neighbor and naive Bayes methods The
explanation of the naive Bayes classifier is now clearer,and additional exercises have been added to both chapters.Another addition are brief chapter summaries at thebeginning of each chapter
Trang 20We have also reorganized the order of some chapters,following readers’ feedback The chapters are nowgrouped into seven parts: Preliminaries, Data Explorationand Dimension Reduction, Performance Evaluation,Prediction and Classification Methods, MiningRelationships Among Records, Forecasting Time Series,and Cases The new organization is aimed at helpinginstructors of various types of courses to choose subsets oftopics to teach.
Two-semester data mining courses could cover in detail
data exploration and dimension reduction and supervisedlearning in one term (choosing the type and amount ofprediction and classification methods according to thecourse flavor and the audience interest) Forecasting timeseries and unsupervised learning can be covered in thesecond term
Single-semester data mining courses would do best to
concentrate on the first parts of the book, and onlyintroduce time series forecasting as time allows This isespecially true if a dedicated forecasting course is offered
in the program
General business intelligence courses would best focus
on the first three parts, then choose a small number ofprediction/classification methods for illustration, andpresent the mining relationships chapters All these can becovered via a few cases, where students read the relevantchapters that support the analysis done in the case
A set of data mining courses that constitute a
concentration can be built according to the sequence of
Trang 21parts in the book The first three parts (Preliminaries, DataExploration and Dimension Reduction, and PerformanceEvaluation) should serve as requirements for the nextcourses Cases can be used either within appropriate topiccourses or as project-type courses.
In all courses, we strongly recommend including a projectcomponent, where data are either collected by studentsaccording to their interest or provided by the instructor(e.g., from the many data mining competition datasetsavailable) From our experience and other instructors’experience, such projects enhance the learning and providestudents with an excellent opportunity to understand thestrengths of data mining and the challenges that arise in theprocess
Trang 22Preface to the First Edition
This book arose out of a data mining course at MIT’sSloan School of Management and was refined during itsuse in data mining courses at the University of Maryland’s
R H Smith School of Business and at statistics.com.Preparation for the course revealed that there are a number
of excellent books on the business context of data mining,but their coverage of the statistical and machine-learningalgorithms that underlie data mining is not sufficientlydetailed to provide a practical guide if the instructor’s goal
is to equip students with the skills and tools to implementthose algorithms On the other hand, there are also anumber of more technical books about data miningalgorithms, but these are aimed at the statistical researcher
or more advanced graduate student, and do not provide thecase-oriented business focus that is successful in teachingbusiness students
Hence, this book is intended for the business student (andpractitioner) of data mining techniques, and its goal isthreefold:
1 To provide both a theoretical and a practicalunderstanding of the key methods of classification,prediction, reduction, and exploration that are at the heart
of data mining
2 To provide a business decision-making context for thesemethods
Trang 233 Using real business cases, to illustrate the applicationand interpretation of these methods.
The presentation of the cases in the book is structured sothat the reader can follow along and implement thealgorithms on his or her own with a very low learninghurdle
Just as a natural science course without a lab componentwould seem incomplete, a data mining course withoutpractical work with actual data is missing a key ingredient.The MIT data mining course that gave rise to this bookfollowed an introductory quantitative course that relied onExcel—this made its practical work universally accessible.Using Excel for data mining seemed a natural progression
An important feature of this book is the use of Excel, anenvironment familiar to business analysts All requireddata mining algorithms (plus illustrative datasets) areprovided in an Excel add-in, XLMiner Data for both the
www.dataminingbook.com
Although the genesis for this book lay in the need for acase-oriented guide to teaching data mining, analysts andconsultants who are considering the application of datamining techniques in contexts where they are not currently
in use will also find this a useful, practical guide
Trang 24The authors thank the many people who assisted us inimproving the first edition and improving it further in thesecond edition Anthony Babinec, who has been usingdrafts of this book for years in his data mining courses atstatistics.com, provided us with detailed and expertcorrections Similarly, Dan Toy and John Elder IV greetedour project with enthusiasm and provided detailed anduseful comments on earlier drafts Boaz Shmueli andRaquelle Azran gave detailed editorial comments andsuggestions on both editions; Bruce McCullough andAdam Hughes did the same for the first edition RaviBapna, who used an early draft in a data mining course atthe Indian School of Business, provided invaluablecomments and helpful suggestions Useful comments andfeedback have also come from the many instructors, toonumerous to mention, who have used the book in theirclasses
From the Smith School of Business at the University ofMaryland, colleagues Shrivardhan Lele, Wolfgang Jank,and Paul Zantek provided practical advice and comments
We thank Robert Windle, and MBA students TimothyRoach, Pablo Macouzet, and Nathan Birckhead forinvaluable datasets We also thank MBA students RobWhitener and Daniel Curtis for the heatmap and mapcharts And we thank the many MBA students for fruitfuldiscussions and interesting data mining projects that havehelped shape and improve the book
Trang 25This book would not have seen the light of day without thenurturing support of the faculty at the Sloan School ofManagement at MIT Our special thanks to DimitrisBertsimas, James Orlin, Robert Freund, Roy Welsch,Gordon Kaufmann, and Gabriel Bitran As teachingassistants for the data mining course at Sloan, AdamMersereau gave detailed comments on the notes and casesthat were the genesis of this book, Romy Shioda helpedwith the preparation of several cases and exercises usedhere, and Mahesh Kumar helped with the material onclustering We are grateful to the MBA students at Sloanfor stimulating discussions in the class that led torefinement of the notes as well as XLMiner.
Chris Albright, Gregory Piatetsky-Shapiro, WayneWinston, and Uday Karmarkar gave us helpful advice onthe use of XLMiner Anand Bodapati provided both dataand advice Suresh Ankolekar and Mayank Shah helpeddevelop several cases and provided valuable pedagogicalcomments Vinni Bhandari helped write the Charles BookClub case
We would like to thank Marvin Zelen, L J Wei, andCyrus Mehta at Harvard, as well as Anil Gore at PuneUniversity, for thought-provoking discussions on therelationship between statistics and data mining Our thanks
to Richard Larson of the Engineering Systems Division,MIT, for sparking many stimulating ideas on the role ofdata mining in modeling complex systems They helped usdevelop a balanced philosophical perspective on theemerging field of data mining
Trang 26Our thanks to Ajay Sathe, who energetically shepherdedXLMiner’s development over the years and continues to
do so, and to his colleagues on the XLMiner team: SureshAnkolekar, Poonam Baviskar, Kuber Deokar, RupaliDesai, Yogesh Gajjar, Ajit Ghanekar, Ayan Khare, BharatLande, Dipankar Mukhopadhyay, S V Sabnis, UshaSathe, Anurag Srivastava, V Subramaniam, RameshRaman, and Sanhita Yeolkar
Steve Quigley at Wiley showed confidence in this bookfrom the beginning and helped us navigate through thepublishing process with great speed Curt Hinrichs’ vision,tips, and encouragement helped bring this book to thestarting gate We are also grateful to Ashwini Kumthekar,Achala Sabane, Michael Shapard, and Heidi Sestrich whoassisted with typesetting, figures, and indexing, and toValerie Troiano who has shepherded many instructorsthrough the use of XLMiner and early drafts of this text
We also thank Catherine Plaisant at the University ofMaryland’s Human-Computer Interaction Lab, who helpedout in a major way by contributing exercises andillustrations to the data visualization chapter, MariettaTretter at Texas A&M for her helpful comments andthoughts on the time series chapters, and Stephen Few andBen Shneiderman for feedback and suggestions on the datavisualization chapter and overall design tips
Trang 27Part One
Preliminaries
Trang 28Chapter 1
Introduction
1.1 What Is Data Mining?
The field of data mining is still relatively new and in astate of evolution The first International Conference onKnowledge Discovery and Data Mining (KDD) was held
in 1995, and there are a variety of definitions of datamining
A concise definition that captures the essence of data
mining is:
Extracting useful information from large data sets
(Hand et al., 2001)
A slightly longer version is:
Data mining is the process of exploration and analysis, byautomatic or semi-automatic means, of large quantities ofdata in order to discover meaningful patterns and rules.(Berry and Linoff, 1997, p 5)
Berry and Linoff later had cause to regret the 1997reference to “automatic and semi-automatic means,”feeling that it shortchanged the role of data exploration andanalysis analysis (Berry and Linoff, 2000)
Trang 29Another definition comes from the Gartner Group, theinformation technology research firm:
[Data Mining is] the process of discovering meaningfulcorrelations, patterns and trends by sifting through largeamounts of data stored in repositories Data miningemploys pattern recognition technologies, as well asstatistical and mathematical techniques
(http://www.gartner.com/6_help/glossary, accessed May
14, 2010)
A summary of the variety of methods encompassed in the
term data mining is given at the beginning of Chapter 2.
1.2 Where Is Data Mining Used?
Data mining is used in a variety of fields and applications.The military use data mining to learn what roles variousfactors play in the accuracy of bombs Intelligenceagencies might use it to determine which of a hugequantity of intercepted communications are of interest.Security specialists might use these methods to determinewhether a packet of network data constitutes a threat.Medical researchers might use it to predict the likelihood
of a cancer relapse
Although data mining methods and tools have generalapplicability, most examples in this book are chosen fromthe business world Some common business questions thatone might address through data mining methods include:
Trang 301 From a large list of prospective customers, which aremost likely to respond? We can use classificationtechniques (logistic regression, classification trees, or othermethods) to identify those individuals whose demographicand other data most closely matches that of our bestexisting customers Similarly, we can use predictiontechniques to forecast how much individual prospects willspend.
2 Which customers are most likely to commit, forexample, fraud (or might already have committed it)? Wecan use classification methods to identify (say) medicalreimbursement applications that have a higher probability
of involving fraud and give them greater attention
3 Which loan applicants are likely to default? We can useclassification techniques to identify them (or logisticregression to assign a “probability of default” value)
4 Which customers are most likely to abandon asubscription service (telephone, magazine, etc.)? Again,
we can use classification techniques to identify them (orlogistic regression to assign a “probability of leaving”value) In this way, discounts or other enticements can beproffered selectively
1.3 Origins of Data Mining
Data mining stands at the confluence of the fields ofstatistics and machine learning (also known as artificialintelligence) A variety of techniques for exploring dataand building models have been around for a long time inthe world of statistics: linear regression, logistic
Trang 31regression, discriminant analysis, and principalcomponents analysis, for example But the core tenets ofclassical statistics—computing is difficult and data arescarce—do not apply in data mining applications whereboth data and computing power are plentiful.
This gives rise to Daryl Pregibon’s description of datamining as “statistics at scale and speed” (Pregibon, 1999)
A useful extension of this is “statistics at scale, speed, andsimplicity.” Simplicity in this case refers not to thesimplicity of algorithms but, rather, to simplicity in thelogic of inference Due to the scarcity of data in theclassical statistical setting, the same sample is used tomake an estimate and also to determine how reliable thatestimate might be As a result, the logic of the confidenceintervals and hypothesis tests used for inference may seemelusive for many, and their limitations are not wellappreciated By contrast, the data mining paradigm offitting a model with one sample and assessing itsperformance with another sample is easily understood
Computer science has brought us machine learning
techniques, such as trees and neural networks, that rely on
computational intensity and are less structured thanclassical statistical models In addition, the growing field
of database management is also part of the picture
The emphasis that classical statistics places on inference(determining whether a pattern or interesting result mighthave happened by chance) is missing in data mining Incomparison to statistics, data mining deals with largedatasets in open-ended fashion, making it impossible to put
Trang 32the strict limits around the question being addressed thatinference would require.
As a result, the general approach to data mining is
vulnerable to the danger of overfitting, where a model is fit
so closely to the available sample of data that it describesnot merely structural characteristics of the data but randompeculiarities as well In engineering terms, the model isfitting the noise, not just the signal
1.4 Rapid Growth of Data Mining
Perhaps the most important factor propelling the growth ofdata mining is the growth of data The mass retailerWal-Mart in 2003 captured 20 million transactions per day
in a 10-terabyte database (a terabyte is 1 millionmegabytes) In 1950, the largest companies had onlyenough data to occupy, in electronic form, several dozenmegabytes Lyman and Varian (2003) estimate that 5exabytes of information were produced in 2002, doublewhat was produced in 1999 (1 exabyte is 1 millionterabytes); 40% of this was produced in the United States.The growth of data is driven not simply by an expandingeconomy and knowledge base but by the decreasing costand increasing availability of automatic data capturemechanisms Not only are more events being recorded, butmore information per event is captured Scannable barcodes, point-of-sale (POS) devices, mouse click trails, andglobal positioning satellite (GPS) data are examples.The growth of the Internet has created a vast new arena forinformation generation Many of the same actions that
Trang 33people undertake in retail shopping, exploring a library, orcatalog shopping have close analogs on the Internet, andall can now be measured in the most minute detail Inmarketing, a shift in focus from products and services to afocus on the customer and his or her needs has created ademand for detailed data on customers.
The operational databases used to record individualtransactions in support of routine business activity canhandle simple queries but are not adequate for morecomplex and aggregate analysis Data from theseoperational databases are therefore extracted, transformed,
and exported to a data warehouse, a large integrated data
storage facility that ties together the decision support
systems of an enterprise Smaller data marts devoted to a
single subject may also be part of the system They mayinclude data from external sources (e.g., credit rating data).Many of the exploratory and analytical techniques used indata mining would not be possible without today’scomputational power The constantly declining cost of datastorage and retrieval has made it possible to build thefacilities required to store and make available vast amounts
of data In short, the rapid and continuing improvement incomputing capacity is an essential enabler of the growth ofdata mining
1.5 Why Are There So Many Different Methods?
As can be seen in this book or any other resource on datamining, there are many different methods for predictionand classification You might ask yourself why theycoexist and whether some are better than others The
Trang 34answer is that each method has advantages anddisadvantages The usefulness of a method can depend onfactors such as the size of the dataset, the types of patternsthat exist in the data, whether the data meet someunderlying assumptions of the method, how noisy the dataare, and the particular goal of the analysis A smallillustration is shown inFigure 1.1, where the goal is to find
a combination of household income level and household
lot size that separate buyers (solid circles) from nonbuyers
(hollow circles) of riding mowers The first method (leftpanel) looks only for horizontal and vertical lines toseparate buyers from nonbuyers, whereas the secondmethod (right panel) looks for a single diagonal line.Different methods can lead to different results, and theirperformance can vary It is therefore customary in datamining to apply several different methods and select theone that is most useful for the goal at hand
FIGURE 1.1 TWO METHODS FOR SEPARATING BUYERS FROM NONBUYERS
1.6 Terminology and Notation
Trang 35Because of the hybrid parentry of data mining, itspractitioners often use multiple terms to refer to the samething For example, in the machine learning (artificialintelligence) field, the variable being predicted is theoutput variable or target variable To a statistician, it is thedependent variable or the response Here is a summary ofterms used:
Algorithm Refers to a specific procedure used to
implement a particular data mining technique:classification tree, discriminant analysis, and the like
Attribute See Predictor.
Case See Observation.
Confidence Has a specific meaning in association rules of
the type “IF A and B are purchased, C is also purchased.” Confidence is the conditional probability that C will be purchased IF A and B are purchased.
Confidence Also has a broader meaning in statistics
(confidence interval), concerning the degree of error in an
estimate that results from selecting one sample as opposed
to another
Dependent Variable See Response.
Estimation See Prediction.
Feature See Predictor.
Trang 36Holdout Sample Is a sample of data not used in fitting a
model, used to assess the performance of that model; this
book uses the term validation set or, if one is used in the problem, test set instead of holdout sample.
Input Variable See Predictor.
Model Refers to an algorithm as applied to a dataset,
complete with its settings (many of the algorithms haveparameters that the user can adjust)
Observation Is the unit of analysis on which the
measurements are taken (a customer, a transaction, etc.);
also called case, record, pattern, or row (Each row
typically represents a record; each column, a variable.)
Outcome Variable See Response.
Output Variable See Response.
P (A | B) Is the conditional probability of event A
occurring given that event B has occurred Read as “the probability that A will occur given that B has occurred.”
Pattern Is a set of measurements on an observation (e.g.,
the height, weight, and age of a person)
Prediction The prediction of the value of a continuous
output variable; also called estimation.
Predictor Usually denoted by X, is also called a feature,
input variable, independent variable, or from a database
perspective, a field.
Trang 37Record See Observation.
Response usually denoted by Y, is the variable being
predicted in supervised learning; also called dependent
variable, output variable, target variable, or outcome variable.
Score Refers to a predicted value or class Scoring new
data means to use a model developed with training data to
predict output values in new data
Success Class Is the class of interest in a binary outcome
(e.g., purchasers in the outcome purchase/no purchase).
Supervised Learning Refers to the process of providing
an algorithm (logistic regression, regression tree, etc.) withrecords in which an output variable of interest is knownand the algorithm “learns” how to predict this value withnew records where the output is unknown
Test Data (or test set) Refers to that portion of the data
used only at the end of the model building and selectionprocess to assess how well the final model might perform
on additional data
Training Data (or training set) Refers to that portion of
data used to fit a model
Unsupervised Learning Refers to analysis in which one
attempts to learn something about the data other thanpredicting an output value of interest (e.g., whether it fallsinto clusters)
Trang 38Validation Data (or validation set) Refers to that portion
of the data used to assess how well the model fits, to adjustsome models, and to select the best model from amongthose that have been tried
Variable Is any measurement on the records, including
both the input (X) variables and the output (Y) variable.
FIGURE 1.2 DATA MINING FROM A PROCESS PERSPECTIVE NUMBERS IN PARENTHESES INDICATE CHAPTER NUMBERS
1.7 Road Maps to This Book
The book covers many of the widely used predictive andclassification methods as well as other data mining tools.Figure 1.2outlines data mining from a process perspectiveand where the topics in this book fit in Chapter numbersare indicated beside the topic Table 1.1 provides a
Trang 39different perspective: It organizes data mining proceduresaccording to the type and structure of the data.
TABLE 1.1 ORGANIZATION OF DATA MINING METHODS IN THIS BOOK, ACCORDING TO THE
aNumbers in parentheses indicate chapter number
Order of Topics
The book is divided into five parts: Part I (Chapters 1–2)gives a general overview of data mining and itscomponents Part II (Chapters 3–4) focuses on the earlystage of data exploration and dimension reduction in whichtypically the most effort is expended
Part III (Chapter 4) discusses performance evaluation.Although it contains a single chapter, we discuss a variety
of topics, from predictive performance metrics tomisclassification costs The principles covered in this partare crucial for the proper evaluation and comparison ofsupervised learning methods
Part IV includes eight chapters (Chapters 5–12), covering avariety of popular supervised learning methods (forclassification and/or prediction) Within this part, thetopics are generally organized according to the level ofsophistication of the algorithms, their popularity, and ease
of understanding
Trang 40Part V focuses on unsupervised learning, presentingassociation rules (Chapter 13) and cluster analysis(Chapter 14).
Part VI includes three chapters (Chapters 15–17), with thefocus on forecasting time series The first chapter coversgeneral issues related to handling and understanding timeseries The next two chapters present two popularforecasting approaches: regression-based forecasting andsmoothing methods
Finally, Part VII includes a set of cases
Although the topics in the book can be covered in the order
of the chapters, each chapter stands alone It is advised,however, to read Parts I–III before proceeding to thechapters in Parts IV–V, and similarly Chapter 15 shouldprecede other chapters in Part VI
USING XLMINER SOFTWARE
To facilitate hands-on data mining experience, thisbook comes with access to XLMiner, acomprehensive data mining add-in for Excel Forthose familiar with Excel, the use of an Exceladd-in dramatically shortens the software learningcurve XLMiner will help you get started quickly
on data mining and offers a variety of methods foranalyzing data The illustrations, exercises, andcases in this book are written in relation to this