As shown here, data mining applies algorithms such as decision trees, clustering, association, time series, and so on to a data set, and then analyzes its contents.. 4 Chapter 1 ■ Introd
Trang 1Figure 1-1 Student table
In contrast, the data mining approach for this problem is almost the reverse
of the query-and-explore method Instead of guessing a hypothesis and trying
it out in different ways, you ask the question in terms of the data that can
support many hypotheses, and allow your data mining system to explore them
for you
In this case, you indicate that the columns IQ, Gender, ParentIncome,
and ParentEncouragement are to be used as hypotheses in determining
CollegePlans As the data mining system passes over the data, it analyzes the
influence of each input column on the target column
Figure 1-2 shows the hypothetical result of a decision tree algorithm
operat-ing on this data set In this case, each path from the root node to the leaf node
forms a rule about the data Looking at this tree, you see that students with IQs
greater than 100 and who are encouraged by their parents are highly likely to
attend college In this case, you have extracted knowledge from the data
As shown here, data mining applies algorithms such as decision trees,
clustering, association, time series, and so on to a data set, and then analyzes
its contents This analysis produces patterns, which can be explored for
valuable information Depending on the underlying algorithm, these patterns
can be in the form of trees, rules, clusters, or simply a set of mathematical
formulas The information found in the patterns can be used for reporting (to
Trang 24 Chapter 1 ■ Introduction to Data Mining in SQL Server 2008
guide marketing strategies, for instance) and for prediction For example, if
you could collect data about undecided students, you could select those who
are likely to be interested in continued education and preemptively market to
that audience
Attend College:
55% Yes 45% No
Attend College:
35% Yes 65% No
Attend College:
79% Yes 21% No
IQ > 100 IQ ≤ 100
Attend College:
69% Yes 31% No
Attend College:
94% Yes
6% No
Encouragement =
Encouraged
Encouragement = Not Encouraged
Figure 1-2 Decision tree
Business Problems for Data Mining
Data mining techniques can be used in virtually all business applications,
answering various types of businesses questions In truth, given the software
available today, all you need is the motivation and the know-how In general,
data mining can be applied whenever something could be known, but is not
The following examples describe some scenarios:
Recommendation generation— What products or services should you
offer to your customers? Generating recommendations is an important
business challenge for retailers and service providers Customers who
are provided appropriate and timely recommendations are likely to be
more valuable (because they purchase more) and more loyal (because
they feel a stronger relationship to the vendor) For example, if you go to
online stores such as Amazon.com or Barnesandnoble.com to purchase
an item, you are provided with recommendations about other items
you may be interested in These recommendations are derived from
using data mining to analyze purchase behavior of all of the retailer’s
customers, and applying the derived rules to your personal information
Trang 3Anomaly detection— How do you know whether your data is ‘‘good’’
or not? Data mining can analyze your data and pick out those items that
don’t fit with the rest Credit card companies use data mining–driven
anomaly detection to determine if a particular transaction is valid If
the data mining system flags the transaction as anomalous, you get a
call to see if it was really you who used your card Insurance
compa-nies also use anomaly detection to determine if claims are fraudulent
Because these companies process thousands of claims a day, it is
impos-sible to investigate each case, and data mining can identify which claims
are likely to be false Anomaly detection can even be used to validate
data entry — checking to see if the data entered is correct at the point
of entry
Churn analysis— Which customers are most likely to switch to a
petitor? The telecom, banking, and insurance industries face severe
com-petition On average, obtaining a single new mobile phone subscriber
costs more than $200 Every business would like to retain as many
cus-tomers as possible Churn analysis can help marketing managers identify
the customers who are likely to leave and why, and as a result, they can
improve customer relations and retain customers
Risk management— Should a loan be approved for a particular
cus-tomer? Since the subprime mortgage meltdown, this is the single most
common question in banking Data mining techniques are used to
deter-mine the risk of a loan application, helping the loan officer make
appro-priate decisions on the cost and validity of each application
Customer segmentation— How do you think of your customers? Are
your customers the indescribable masses, or can you learn more about
your customers to have a more intimate and appropriate discussion with
them Customer segmentation determines the behavioral and descriptive
profiles for your customers These profiles are then used to provide
per-sonalized marketing programs and strategies that are appropriate for
each group
Targeted ads— Web retailers or portal sites like to personalize their
content for their Web customers Using navigation or online purchase
patterns, these sites can use data mining solutions to display targeted
advertisements to their Web navigators.
Forecasting— How many cases of wine will you sell next week in this
store? What will the inventory level be in one month? Data mining
fore-casting techniques can be used to answer these types of time-related
questions
Trang 46 Chapter 1 ■ Introduction to Data Mining in SQL Server 2008
Data Mining Tasks
For each question that can be asked of a data mining system, there are many
tasks that may be applied In some cases, an answer will become obvious
with the application of a single task In others, you will explore and combine
multiple tasks to arrive at a solution The following sections describe the
general data mining tasks
Classification
Classification is the most common data mining task Business problems such
as churn analysis, risk management, and targeted advertising usually involve
classification
Classification is the act of assigning a category to each case Each case
contains a set of attributes, one of which is the class attribute The task requires
finding a model that describes the class attribute as a function of input
attributes In the College Plans data set shown in Figure 1-1, the class is the
CollegePlansattribute with two states:YesandNo A classification model will
use the other attributes of a case (the input attributes) to determine patterns
about the class (the output attribute) Data mining algorithms that require a
target to learn against are considered supervised algorithms.
Typical classification algorithms include decision trees, neural network, and
Na¨ıve Bayes
Clustering
Clustering is also called segmentation It is used to identify natural groupings of
cases based on a set of attributes Cases within the same group have more or
less similar attribute values
Figure 1-3 shows a very simple customer data set containing two attributes:
AgeandIncome The clustering algorithm groups the data set into three
seg-ments based on these two attributes Cluster 1 contains a younger population
with low income Cluster 2 contains middle age customers with higher income
Cluster 3 is a group of older individuals with a relatively low income
Clustering is an unsupervised data mining task There is no single attribute
used to guide the training process, so all input attributes are treated equally
Most clustering algorithms build the model through a number of iterations,
and stop when the model converges (that is, the boundaries of these segments
are stabilized)
Trang 5Cluster 2
Age
Figure 1-3 Clustering
Association
Association is also called market basket analysis A typical association business
problem is to analyze a sales transaction table and identify those products
often in the same shopping basket The common usage of association is to
identify common sets of items and rules for the purpose of cross-selling, as
shown in Figure 1-4
Juice
Beef Donut
Figure 1-4 Product association
In terms of association, each piece of information is considered an item
The association task has two goals: to find those items that appear together
frequently, and from that, to determine rules about the associations
Trang 68 Chapter 1 ■ Introduction to Data Mining in SQL Server 2008
Regression
The regression task is similar to classification, except that instead of looking for
patterns that describe a class, the goal is to find patterns to determine a numerical
value Simple linear line-fitting techniques are an example of regression, where
the result is a function to determine the output based on the values of the
inputs More advanced forms of regression support categorical inputs as well
as numerical inputs The most popular techniques used for regression are
linear regression and logistic regression Other techniques supported by SQL
Server Data Mining are regression trees (part of the Microsoft Decision Trees
algorithm) and neural networks
Regression is used to solve many business problems — for example, to
predict a coupon redemption rate based on the face value, distribution method,
distribution volume, and season, or to predict wind velocities based on
temperature, air pressure, and humidity
Forecasting
Forecasting is yet another important data mining task What will the stock
value of Microsoft Corporation (NASDAQ symbol MSFT) be tomorrow? What
will the sales amount of wine be next month? Forecasting can help answer
these questions As input, it takes sequences of numbers indicating a series
of values through time, and then it imputes future values of those series
using a variety of machine-learning and statistical techniques that deal with
seasonality, trending, and noisiness of data
Figure 1-5 shows two curves The solid line curve is the actual time-series
data on Microsoft stock value, and the dotted curve is a time-series model that
predicts values based on past values
38
36
34
32
30
28
26
24
22
20
MSFT 3-year price history
Figure 1-5 Time series
Trang 7Sequence Analysis
Sequence analysis is used to find patterns in a series of events called a sequence.
For example, a DNA sequence is a long series composed of four different states:
A, G, C, and T A click sequence on the Web contains a series of URLs In certain
circumstances, you may model customer purchases as a sequence of data For
example, a customer first buys a computer, and then buys speakers, and
finally buys a webcam Both sequence and time-series data are similar in that
they contain adjacent observations that are order-dependent The difference is
that where a time series contains numerical data, a sequence series contains
discrete states
Figure 1-6 shows Web click sequences from a news website Each node
is a URL category, and the lines represent transitions between them Each
transition is associated with a weight, representing the probability of the
transition between one URL and another
0.2
0.3
0.2 0.3
0.2
0.1 0.2
Home
Page
Business
Weather
Science
Figure 1-6 Web navigation sequence
Deviation Analysis
Deviation analysis is used to find rare cases that behave very differently from
the norm Deviation analysis is widely applicable, the most common usage
being credit card fraud detection Identifying abnormal cases among millions
of transactions is a very challenging task Other applications include network
intrusion detection, manufacture error analysis, and so on
There is no standard technique for deviation analysis Usually, analysts
apply decision trees, clustering, or neural network algorithms for this task
Data Mining Project Cycle
From the initial business problem formation through to deployment and
sustained management, most data mining projects pass through the same
phases
Trang 810 Chapter 1 ■ Introduction to Data Mining in SQL Server 2008
Business Problem Formation
What are the problems you are trying to solve? What techniques are you going
to apply to solve the problem? How do you know if you will be successful?
These are important questions to ask before embarking on any project
You may find that a simple OLAP, reporting, or data integration solution
may be sufficient A predictive or data mining solution involves determining
the unknown, relying on a belief that making sense of that unknown will add
value This is a shaky precipice from which to begin any business endeavor
Luckily, successful data mining solutions have been shown to have an average
of 150-percent return on investment (ROI), so that makes justification easier
Data Collection
Business data is stored in many systems across an enterprise For example,
at Microsoft, there are hundreds of online transaction processing (OLTP)
databases and more than 70 data warehouses The first step is to pull the
relevant data into a database or a data mart where the data analysis is applied
For example, if you want to analyze your website’s click stream, the first step
is to download the log data from your web servers
Sometimes you might be lucky and find that there is already an existing
data warehouse on the subject of your analysis However, in many cases, the
data in the data warehouse is not rich enough and must be supplemented with
additional data For example, the log data from the web servers contains only
data about web behavior and little (if any) data about the customers You may
need to gather customer information from other company systems or purchase
demographic data to build models that meet your business requirements
Data Cleaning and Transformation
Data cleaning and transformation are the most resource-consuming steps in
a data mining project The purpose of data cleaning is to remove noise and
irrelevant information from the data set The purpose of data transformation is
to modify the source data in ways that make it useful for mining
Various techniques are applied to clean and transform data, including the
following:
Numerical transformation— For continuous data such as income and
age, a typical transformation is to bin (or discretize) the data into buckets.
For example, you may want to binAgeinto five predefined age groups
SQL Server Data Mining has automatic discretization methods, but if
you have meaningful groupings, they may be more informative both
from a business sense and an algorithmic sense Additionally,
continu-ous data is often normalized Normalization maps all numerical values to
Trang 9a range (such as between 0 and 1) or to have a specific standard deviation
(such as 1)
Grouping— Discrete data often has more distinct values than are
use-ful You can group these values to reduce the model complexity For
example, the columnProfessionmay have many different types of
engi-neers, such as Software Engineer, Telecom Engineer, Mechanical
Engi-neer, and so on You can group all of these professions to the single value
Engineer
Aggregation— Aggregation is an important transformation to derive
additional value from your data Suppose you want to group customers
based on their phone usage If the call detail record information is too
detailed for the model, you must aggregate all the calls into a few
derived attributes such as total number of calls and the average call
duration These derived attributes can later be used in the model
Missing value handling— Most data sets contain missing values This
can be caused by many different things For example, you may have two
customer tables coming from two OLTP databases that, when merged,
have missing values because the tables are not aligned Another example
occurs when customers don’t supply data values such as age Another is
when you have stock market values with blanks because the markets are
closed on weekends and holidays
Addressing missing values is important, because it is reflected in the
business value of your solution You may need to retain the missing
data (for example, customers who refuse to report their age may have
other interesting things in common) You may need to discard the entire
record (having too many unknowns could pollute your model) Or, you
may simply be able to replace missing values with some other value
(such as the previous value for time-series data such as stock market
val-ues, or the most popular value) For more advanced cases, you can use
data mining to predict the most likely value for each missing case
Removing outliers — Outliers are abnormal data and can be real or (as
is often the case) errors Abnormal data has an effect on the quality of
your results The best way to deal with outliers typically is to simply
remove them before beginning the analysis For example, you could
remove 0.5 percent of the customers with highest or lowest income to
eliminate any situations of people having negative or extremely unlikely
incomes
SQL Server Integration Services (SSIS), which is included with Microsoft SQL
Server, is an excellent tool for performing data cleaning and transformation
tasks
Trang 1012 Chapter 1 ■ Introduction to Data Mining in SQL Server 2008
Model Building
Model building is the core of data mining, though it is not as time- and
resource-intensive as data transformation When you understand the shape of
the business problem and the type of data mining task, it is relatively easy to
pick algorithms that are suitable Usually, you don’t know which algorithm
is the best fit for the problem until you have built the model The accuracy
of an algorithm depends on the nature of the data For example, a decision
tree algorithm is usually a very good choice for any classifications However,
if the relationships among attributes are complicated, a neural network may
perform better
A good approach is to build multiple models using different algorithms,
and then compare the accuracy of these models Even with a single algorithm,
you can tune the parameter settings to optimize the model accuracy
Model Assessment
In the model assessment stage, you use tools to determine the accuracy of
the models that were created, and you examine the models to determine the
meaning of discovered patterns and how they apply to your business For
example, a model may determine that Relationship = Husband➪Gender = Male
with 100-percent confidence Although the rule is valid, it doesn’t contain any
business value It is very important to work with business analysts who have
the proper domain knowledge to validate the discoveries
Sometimes, the model doesn’t contain useful patterns This is generally
because the set of variables in the model are not the right ones to solve your
business problem You may need to repeat the data cleaning and
transforma-tion steps, or even redefine your problem in order to derive more meaningful
variables Data mining is an exploratory process, and it often takes a few
iterations before you find the right model
Reporting and Prediction
In many organizations, the goal of data miners is to deliver reports to marketing
executives SQL Server Data Mining is integrated with SQL Server Reporting
Services to generate reports directly from data mining results Reports may
contain predictions (such as lists of customers with the highest value potential)
or the rules found in the data mining analysis
To provide predictions, you apply the selected model against new cases of
data Consider a banking scenario where you build a model about loan risk
prediction Every day there are thousands of new loan applications You can