Contents Overview 1 Training a Data Mining Model 12 Building a Data Mining Model with Browsing the Dependency Network 23 Lab A: Creating a Decision Tree with Review 32 Module 17: I
Trang 1Contents
Overview 1
Training a Data Mining Model 12
Building a Data Mining Model with
Browsing the Dependency Network 23
Lab A: Creating a Decision Tree with
Review 32
Module 17: Introduction
to Data Mining
Trang 2purpose, without the express written permission of Microsoft Corporation If, however, your only means of access is electronic, permission to print one copy is hereby granted
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property
2000 Microsoft Corporation All rights reserved
Microsoft, BackOffice, MS-DOS, Windows, Windows NT, <plus other appropriate product
names or titles Replace this example list with list of trademarks provided by copy editor Microsoft is listed first, followed by all other Microsoft trademarks in alphabetical order > are either registered trademarks or trademarks of Microsoft Corporation in the U.S.A and/or other countries
<This is where mention of specific, contractually obligated to, third party trademarks, which are added by the Copy Editor>
The names of companies, products, people, characters, and/or data mentioned herein are fictitious and are in no way intended to represent any real individual, company, product, or event, unless otherwise noted
Other product and company names mentioned herein may be the trademarks of their respective owners
Trang 3Instructor Notes
This module introduces students to data mining and explains how to build and browse data mining models by using Microsoft® SQL Server™ 2000 Analysis Services Students will learn fundamental data mining terminology, concepts, techniques, and algorithms
This is an overview module that focuses on the use of built-in Analysis Manager wizards It is not intended to provide in-depth knowledge of data mining
After completing this module, students will be able to:
! Describe data mining characteristics, applications, and modeling techniques
! Describe the process of training a model
! Use the online analytical processing (OLAP) Mining Model Wizard to edit, process, and explore the decision trees
! Analyze relational data relationships in the dependency network browser
! Describe the steps required to build a clustering model by using OLAP data
Materials and Preparation
This section lists the required materials and preparation tasks that you need to teach this module
Required Materials
To teach this module, you need Microsoft PowerPoint® file 2074A_17.ppt
Preparation Tasks
To prepare for this module, you should:
! Read all the materials for this module
! Read the instructor notes and margin notes
! Practice combining the lecture with the demonstrations
! Complete the lab
! Review the Trainer Preparation presentation for this module on the Trainer Materials compact disc
! Review any relevant white papers that are located on the Trainer Materials compact disc
Presentation:
40 Minutes
Lab:
20 Minutes
Trang 4Demonstration: Determining Why Students Attend College
The following demonstration procedures provide information that will not fit
in the margin notes or is not appropriate for student notes
In this demonstration, you will create a data mining model by using a decision tree with relational data Specifically, you will create a decision tree that determines why students attend college
You will create a new OLAP database with a data source connecting to the
Module 17 relational database
1 In Analysis Manager, expand the Analysis Servers folder, right-click your
local server, and then click New Database
2 Enter Module 17 as the database name, and then click OK
3 Expand the Module 17 database, right-click the Data Sources folder, and then click New Data Source
4 On the Provider tab of the Data Link Properties dialog box, click
Microsoft OLE DB Provider for SQL Server Click Next
5 Type localhost in Step 1
6 In Step 2, click Use Windows NT Integrated security
7 In Step 3, click Module 17 from the list of databases Click OK
In this procedure, you will create the data mining model by selecting source, case table, data mining technique, and key column
1 In the Module 17 database, right-click the Mining Models folder, and then click New Mining Model
2 At the welcome page, click Next
3 From the Select source type step of the Mining Model Wizard, click
Relational data, and then click Next
Point out that either relational tables or OLAP cubes can be used as source data For this model, you are accessing relational data
4 From the Select case tables step, in the Available tables list, click College
Plans, and then click Next
5 From the Select data mining technique step, in the Technique list, click
Microsoft Decision Trees, and then click Next
Two algorithms ship with Analysis Services: Microsoft Decision Trees and Microsoft Clustering Use the Decision Trees algorithm for this
demonstration
6 From the Select the key column step, in the Case key column list, click
StudentID, and then click Next
Demonstration:
10 Minutes
Trang 5! To select input and predictable columns for the mining model
1 From the Select input and predictable columns step of the Mining Model Wizard, in the Available columns list, click CollegePlans at the bottom of
the column list
2 Click the top arrow (>) to choose CollegePlans as a predictable column
3 In the Available columns list, click Gender, and then click the bottom arrow (>) to choose that column as an input column
4 In the Available columns list, click ParentIncome, and then click the bottom arrow (>) to choose that column as an input column
5 In the Available columns list, click IQ, and then click the bottom arrow (>)
to select that column as an input column
6 In the Available columns list, click ParentEncouragement, and then click the bottom arrow (>) to select that column as an input column Click Next
In this procedure, you name the model, initiate processing and then close the wizard
1 From the Finish the mining model wizard step, in the Model name box, type CollegePlans
2 Click Finish to create and process the model
3 When the model has completed processing, click Close to close the Process
dialog box
1 In the Relational Mining Model Editor, click the Content tab
2 In the Content Detail pane, click the All node
View the Totals tab of the Attributes pane, and point out that more than 67
percent of the students interviewed do not plan to attend college
3 Click the Parent Encouragement = Encouraged node
Point out to the students that parental encouragement is the most dominant attribute in this model More than 57 percent of students that are encouraged
by their parents plan to attend college
4 Click Parent Encouragement = Not Encouraged
Fewer than 7 percent of students who are not encouraged by their parents plan to attend college
5 Close the Relational Mining Model Editor
Trang 6Module Strategy
Use the following strategy to present this module:
The structure of this module is multiple demonstrations showing students how
to build and browse various types of data mining models Except for the first example about students attending college, the demonstrations are documented directly in the student manual Integrate your lecture with live demonstration following the procedures included in the student notes Encourage students to follow along with your demonstrations on their computers Some students may choose to watch your demonstrations only, which is also acceptable
! Introducing Data Mining The case study introduces students to data mining Data mining may be new
to many students and should be described in very simple terms highlighting the business application and uses Emphasize to students why this
technology is useful and complementary to the other forms of analysis they have been exposed to Then describe the various data mining techniques that are available
! Training a Data Mining Model Describe the process required to create a data mining model Define training data and cases
! Building a Data Mining Model with OLAP Data Introduce students to the membership card scenario Use the membership card scenario to step students through the process of building a data mining model with OLAP data by using the Mining Model Wizard Describe each step in the process—selecting the data mining technique, selecting the case, selecting the training data, creating a dimension and virtual cube, and browsing the data mining model
! Browsing the Dependency Network Demonstrate how to browse the dependency network Explain that the Dependency Network Browser can be used to view all the relationships in your model
Trang 7Overview
! Introducing Data Mining
! Training a Data Mining Model
! Building a Data Mining Model with OLAP Data
This module provides you with an introduction to Microsoft® SQL Server™
2000 Analysis Services Data Mining
The objective of the module is to introduce you to both data mining principles and applications while exploring the Analysis Services wizard-driven interface for creating data mining models
After completing this module, you will be able to:
! Describe data mining characteristics, applications, and modeling techniques
! Describe the process of training a model
! Use the online analytical processing (OLAP) Mining Model Wizard to edit, process, and explore the decision trees
! Analyze relational data relationships in the dependency network browser
! Describe the steps required to build a clustering model by using OLAP data
In this module, you will learn
about data mining, how data
mining can be used to
address business
application requirements,
and how to create data
mining models by using the
Analysis Manager
Trang 8# Introducing Data Mining
! Defining Data Mining
! Data Mining Applications
! Data Mining Models
! Introductory Example
! Exploring the Decision Tree
This section introduces data mining concepts, including:
! Defining data mining
! Discussing how data mining can be applied to solve common business applications
! Describing what data mining models are available
! Presenting a simple example of how data mining can be used
! Exploring the decision tree
Topic Objective
To introduce the concept of
data mining
Lead-in
In this section, you will be
introduced to a simple case
study example In that
example, data mining will be
defined, common
applications and techniques
discussed, and its role in the
data warehouse explored
Trang 9Defining Data Mining
! Is The Process of Deducing Meaningful Patterns and Rules from Large Quantities of Data
! Searches for Patterns in Data Rather than Answering Predefined Questions
! Is Used To:
$ Provide historical insights
$ Predict future values or outcomes
$ Close the loop for analysis
In many organizations, data volumes are so large that it is difficult, even for the most seasoned analyst, to identify the key information most relevant to
managing the business
Data mining is the automatic or semi-automatic process of deducing meaningful patterns and rules from large quantities of data These patterns provide valuable insights to business managers and offer information that may be overlooked by more traditional manual methods of analysis
Data mining programs search for patterns in data rather than answer predefined questions Because of this, they can be used for knowledge discovery in addition to hypothesis testing
Data mining is used to:
! Provide insight into historical data
! Predict future values or outcomes based on historical patterns
! Close the analysis loop by taking action based on the information derived from the analysis
Topic Objective
To provide a definition of
data mining
Lead-in
Data mining provides a
means by which the system
deduces knowledge from
the data by identifying
correlations and other
patterns in the data
Trang 10Data Mining Applications
! Advertising on the Internet
$ “What banner will I display to this visitor?”
$ “What other products is this customer likely to buy?
! Detecting Fraud
$ “Is this insurance claim a fraud?”
! Pricing Insurance
$ “How much of a discount will I offer to this customer?”
! Managing Credit Risk
$ “Will I approve the loan for this customer?”
Data mining techniques are used in a variety of applications This section provides some interesting examples
Advertising on the Internet
You can use data mining to classify groups of customers with similar information into segments for targeting advertising or special offers
Following are two Internet customer examples:
! An e-commerce Web site sells sporting equipment When a customer registers, a database management system collects information about the customer, such as gender, marital status, favorite sport, and age
By using data mining techniques, the Web site displays a masculine banner
ad with a golfing motif for the male, golf-loving, 40-year-old who returns to the Web site after registering
! When you purchase merchandise on the Internet, you are sometimes offered additional merchandise that the Web site predicts you might be interested in—for example, a book similar to the one you are currently purchasing Such recommendations are based on data mining techniques that search out purchase patterns of customers who purchased the same book you are now buying The system recommends: “If you like xyz books, check out the additional books below.”
Detecting Fraud
You can use a data mining system to identify characteristics of suspicious insurance claims by analyzing characteristics of legitimate and fraudulent claims For example, specific types of injuries that are difficult to diagnose, such as neck and back injuries, may be more likely candidates for a fraudulent claim
applications We are now
going to talk about some
common uses
Delivery Tips
Incorporate your own
examples of how data
mining is used to solve
business problems Ask
students for examples from
their businesses
Point out that data mining is
no longer an art used by just
PhDs This technology is
available and useful to a
variety of businesses
Trang 11Pricing Insurance
In the insurance industry, you use data mining techniques to analyze historical data such as age, marital status, gender, and driving history All these factors play a role in predicting the likelihood of a specific driver for getting into an automobile accident Data mining techniques help you to weigh and factor these data points into pricing for an individual insurance policy
Managing Credit Risk
When you apply for a loan, the bank collects a broad range of information about you—for example, income, years of employment at a current job, marital status, and credit standing
By using data mining techniques applied to historical loan application information, the bank can predict whether you are a good or bad credit risk and can use this information when deciding on loan approval
Trang 12Data Mining Models
! Analysis Services Models
Analysis Services Models
Analysis Services includes two data mining techniques—Microsoft Clustering and Microsoft Decision Trees
Clustering
You use the clustering technique, sometimes called K-nearest neighbor, to group data records that are similar to each other You often use this common technique as the starting point for market or customer analysis
For example, you may want to segment your market so that you can offer customized programs and pricing to specific customer groups With clustering, you can segment your customers into groups with similar characteristics
Decision Trees
Decision trees are a popular method for both classifying and predicting By using a series of questions and rules to categorize data cases, you can predict the likelihood of certain types of cases having a specific outcome
For example, insurance companies use a decision tree to predict the likelihood
of high claims by analyzing statistical data organized by a set of rules that help predict the likelihood of high claims
Topic Objective
To describe different data
mining models and how they
apply to data analysis
Lead-in
A variety of data mining
models are available These
techniques represent
different approaches to
classification and prediction
Delivery Tip
Do not spend much time
describing the different
models Simply discuss that
various models are available
for analysis and that
Microsoft provides two of
the models in Analysis
Services
Trang 13Other Models
Analysis Services provides two types of data mining models—clustering and decision trees However, users may define their own models or use other proprietary data mining algorithms Common data mining models include market basket analysis, memory-based reasoning, and neural networks
Market Basket Analysis (Affinity Grouping)
Market basket analysis, sometimes called affinity grouping, is used for finding groups of items that occur frequently together in a single transaction
For example, customers who buy gin may also purchase tonic water, which is a frequent accompaniment Customers who buy potato chips frequently buy potato chip dips on the same shopping trip Understanding when products sell together helps a retail store manage placement of items on shelves to maximize affinity group purchases
Memory-Based Reasoning
Memory-based reasoning (MBR) is a directed data mining technique that is used for prediction and classification MBR analyzes a collection of the known instances of the nearest neighbor and from that information makes predictions about unknown instances
For example, if a patient exhibits a series of symptoms, doctors apply their experience with similar patients to diagnose the current case The doctors perform their diagnoses by using a form of MBR
Suppose you want to sell your car Several factors affect the sales price, such as the age of the car, its condition, its manufacturer and model, and so forth Analyzing historical car prices, the neural network can create a series of input and output factors to predict the sales price
Summary of Models
The following table defines commonly used data mining models and their typical usages
Market basket analysis (affinity grouping)
Clustering or affinity grouping
Trang 14Introductory Example
Why Do High School Students Attend College?
A survey was conducted recently in the United States asking high-school seniors to answer the following five questions:
1 What is your gender?
2 What is your parents' income?
3 What is your IQ?
4 Do your parents encourage or not encourage you to go to college?
5 Do you plan to attend college?
Data from the survey was compiled into a table shown in the preceding illustration
Glancing at the table, you cannot easily determine how many students plan to attend college and how many do not You can see that roughly 50 percent will attend based on the first 22 records of this file This result may or may not be representative of the whole set of 9,000 cases
To determine how many students plan to attend college, you can execute a query that counts students grouped by those planning on attending and those not planning on attending
Suppose you are interested in determining the attribute or combination of attributes that have the highest potential of predicting the likelihood of a student for attending college This is a more complex question and involves segmenting the data based on various attributes you collect
To answer the question, you can spend several hours exploring the data manually, or you can use data mining to explore the data automatically
Topic Objective
To introduce an example of
how data mining can be
used for prediction
Lead-in
What do you think is the
principal attribute for
predicting whether students
attend college? What, if
anything, can you conclude
from the information in the
table?
Delivery Tips
Browse the actual relational
table data when discussing
the case study You can find
the CollegePlans table in
the Module 17 SQL Server
2000 database
Ask students what they think
are the most dominant
attributes that will predict
whether a student plans to
attend college
Trang 15Demonstration: Determining Why Students Attend College
In this demonstration, you will create a data mining model by using a decision tree with relational data Specifically, you will create a decision tree that determines what causes students to attend college
Topic Objective
To demonstrate how to
create a data mining model
by using a decision tree with
relational data
Lead-in
In this demonstration, you
will learn how to create a
decision tree that
determines what causes
students to attend college
Delivery Tips
The steps for this
demonstration are included
in the Instructor Notes
Encourage students to
follow your demonstration
on their computers
Trang 16Exploring the Decision Tree
Attend College:
33% Yes 67% No
All Students
Parental Encouragement?
Attend College :
57% Yes 43% No
Parents Encourage = Yes
Attend College:
6% Yes 94% No
Parents Encourage = No
Attend College:
74% Yes 26% No
Attend College:
29% Yes 71% No
High IQ Low IQ
Attend College :
18% Yes 82% No
Attend College :
9% Yes 91% No
Attend College:
4% Yes 96% No
The most dominant attribute is always the first rule in the decision tree
! Students who received encouragement from their parents had a 57.27 percent probability of planning to attend This is much higher than the general population Of the students who were encouraged by their parents:
• Those with an IQ higher than 110.25 had more than a 74 percent probability of attending college
• Those who also had parents with a high income were even more likely to attend college—77 percent
! Students who did not receive encouragement had a very low probability, 6.22 percent, of planning to attend Of the students who were not encouraged by their parents:
• Those students with a very high IQ had a higher probability than those with a lower IQ Of students with an IQ higher than 118.25, 17.96 percent plan to attend versus 3.52 percent of students with an IQ lower than 99.25
• Parental income had no impact on the likelihood of planning to attend college if the student were exceptionally smart with an IQ higher than 118.25
Topic Objective
To demonstrate how data
mining is applied by using a
decision tree
Lead-in
Looking at all the students
interviewed, roughly 33
percent plan to attend and
the remaining do not plan to
attend
Delivery Tips
After switching to the slide,
ask students the following
question: Of the collected
attributes, which do you
think is most likely to have
an impact on a student’s
decision to attend college?
Then use the build slide to
step through the results
Switch to Analysis Manager
to show the same results in
the Relational Mining Model
Editor
Tip
Trang 17This example demonstrates that data mining allows you to validate or discredit specific hypothesis Data mining also helps you identify patterns that you may not expect or notice by analyzing the data manually
Trang 18Training a Data Mining Model
Mining Model
DM Engine
Data
To Predict
DM Engine Predicted Data
To create a model, you must assemble a set of data where the attributes to be
predicted are known Such a data set is called the training data During the
training process, data is inserted into the data mining model The data mining model analyzes the training data and looks for rules and patterns that can be used later to determine the predictive columns
You perform training by processing the data mining model in Analysis Manager
The training data has two characteristics:
! It is typically historical data
! It is statistically representative of the cases for which you are building a predictive model
The case is the basic unit for analysis in the mining model The case is the
element that is used for classifying and grouping the data
As depicted in the preceding illustration, the data mining engine evaluates the cases identified in the training data and creates the model based on the algorithm selected When the model is built, it can be applied to future data to predict outcomes or classify data
Topic Objective
To explain the methodology
for creating a mining model
and to define terminology
Lead-in
When creating a data
mining model, you need a
training data set This is
typically historical data
where the attributes to be
predicted are known
Delivery Tip
Use the build slide to
explain how Analysis Server
evaluates training data to
build a data mining model,
and then uses the model to
predict future outcomes
based on new data sets
Trang 19# Building a Data Mining Model with OLAP Data
! Introducing the Membership Card Scenario
! Selecting the Data Mining Technique
! Selecting the Case
! Selecting Predicted Entity
! Selecting Training Data
! Creating a Dimension and Virtual Cube
! Browsing the Data Mining Model
You can use the Mining Model Wizard in Analysis Manager to create a data mining model This section uses the Membership Card scenario to demonstrate the creation of a data mining model
Building and reviewing a data mining model entails several steps:
1 Selecting the data mining technique
2 Identifying the case
3 Selecting the entity to be predicted
4 Identifying the training data
5 Optionally creating a dimension and virtual cube from the resulting model
6 Processing the model and browsing the results
Topic Objective
To describe the steps used
to build a data mining model
with OLAP data
Lead-in
These are a variety of steps
involved in building a data
mining model with OLAP
data
Trang 20Introducing the Membership Card Scenario
$ Identify opportunities for enhancing services at each current card level
$ Market programs based on customer demographics
$ Find membership card selection patterns
$ Select Customer as the mined dimension
$ Select the Member Card property as the pattern identifier
$ Use Customer demographics to train the model
$ Browse the decision tree
The Vice President of Marketing of Foodmart wants to evaluate current
member card programs To improve customer retention and satisfaction, she specifically wants to identify opportunities for enhancing services provided at each card level:
! Golden
! Silver
! Bronze
! Normal Demographic information about customers is available The information includes:
! Gender
! Marital status
! Yearly income
! Education level
In this card membership scenario, you will learn how historical data in the
Foodmart 2000 Sales cube predicts the likelihood of customers applying for
different levels of membership cards based on a variety of attributes
wants to evaluate the
current member card
programs
Delivery Tip
Use this example to
describe each of the
following pages in this
section