Tài liệu Wiley - Data Mining with Microsoft SQL Server 2008 (2009)02 pptx

As shown here, data mining applies algorithms such as decision trees, clustering, association, time series, and so on to a data set, and then analyzes its contents.. 4 Chapter 1 ■ Introd

Trang 1

Figure 1-1 Student table

In contrast, the data mining approach for this problem is almost the reverse

of the query-and-explore method Instead of guessing a hypothesis and trying

it out in different ways, you ask the question in terms of the data that can

support many hypotheses, and allow your data mining system to explore them

for you

In this case, you indicate that the columns IQ, Gender, ParentIncome,

and ParentEncouragement are to be used as hypotheses in determining

CollegePlans As the data mining system passes over the data, it analyzes the

influence of each input column on the target column

Figure 1-2 shows the hypothetical result of a decision tree algorithm

operat-ing on this data set In this case, each path from the root node to the leaf node

forms a rule about the data Looking at this tree, you see that students with IQs

greater than 100 and who are encouraged by their parents are highly likely to

attend college In this case, you have extracted knowledge from the data

As shown here, data mining applies algorithms such as decision trees,

clustering, association, time series, and so on to a data set, and then analyzes

its contents This analysis produces patterns, which can be explored for

valuable information Depending on the underlying algorithm, these patterns

can be in the form of trees, rules, clusters, or simply a set of mathematical

formulas The information found in the patterns can be used for reporting (to

Trang 2

4 Chapter 1 ■ Introduction to Data Mining in SQL Server 2008

guide marketing strategies, for instance) and for prediction For example, if

you could collect data about undecided students, you could select those who

are likely to be interested in continued education and preemptively market to

that audience

Attend College:

55% Yes 45% No

Attend College:

35% Yes 65% No

Attend College:

79% Yes 21% No

IQ > 100 IQ ≤ 100

Attend College:

69% Yes 31% No

Attend College:

94% Yes

6% No

Encouragement =

Encouraged

Encouragement = Not Encouraged

Figure 1-2 Decision tree

Business Problems for Data Mining

Data mining techniques can be used in virtually all business applications,

answering various types of businesses questions In truth, given the software

available today, all you need is the motivation and the know-how In general,

data mining can be applied whenever something could be known, but is not

The following examples describe some scenarios:

Recommendation generation— What products or services should you

offer to your customers? Generating recommendations is an important

business challenge for retailers and service providers Customers who

are provided appropriate and timely recommendations are likely to be

more valuable (because they purchase more) and more loyal (because

they feel a stronger relationship to the vendor) For example, if you go to

online stores such as Amazon.com or Barnesandnoble.com to purchase

an item, you are provided with recommendations about other items

you may be interested in These recommendations are derived from

using data mining to analyze purchase behavior of all of the retailer’s

customers, and applying the derived rules to your personal information

Trang 3

Anomaly detection— How do you know whether your data is ‘‘good’’

or not? Data mining can analyze your data and pick out those items that

don’t fit with the rest Credit card companies use data mining–driven

anomaly detection to determine if a particular transaction is valid If

the data mining system flags the transaction as anomalous, you get a

call to see if it was really you who used your card Insurance

compa-nies also use anomaly detection to determine if claims are fraudulent

Because these companies process thousands of claims a day, it is

impos-sible to investigate each case, and data mining can identify which claims

are likely to be false Anomaly detection can even be used to validate

data entry — checking to see if the data entered is correct at the point

of entry

Churn analysis— Which customers are most likely to switch to a

petitor? The telecom, banking, and insurance industries face severe

com-petition On average, obtaining a single new mobile phone subscriber

costs more than $200 Every business would like to retain as many

cus-tomers as possible Churn analysis can help marketing managers identify

the customers who are likely to leave and why, and as a result, they can

improve customer relations and retain customers

Risk management— Should a loan be approved for a particular

cus-tomer? Since the subprime mortgage meltdown, this is the single most

common question in banking Data mining techniques are used to

deter-mine the risk of a loan application, helping the loan officer make

appro-priate decisions on the cost and validity of each application

Customer segmentation— How do you think of your customers? Are

your customers the indescribable masses, or can you learn more about

your customers to have a more intimate and appropriate discussion with

them Customer segmentation determines the behavioral and descriptive

profiles for your customers These profiles are then used to provide

per-sonalized marketing programs and strategies that are appropriate for

each group

Targeted ads— Web retailers or portal sites like to personalize their

content for their Web customers Using navigation or online purchase

patterns, these sites can use data mining solutions to display targeted

advertisements to their Web navigators.

Forecasting— How many cases of wine will you sell next week in this

store? What will the inventory level be in one month? Data mining

fore-casting techniques can be used to answer these types of time-related

questions

Trang 4

Data Mining Tasks

For each question that can be asked of a data mining system, there are many

tasks that may be applied In some cases, an answer will become obvious

with the application of a single task In others, you will explore and combine

multiple tasks to arrive at a solution The following sections describe the

general data mining tasks

Classification

Classification is the most common data mining task Business problems such

as churn analysis, risk management, and targeted advertising usually involve

classification

Classification is the act of assigning a category to each case Each case

contains a set of attributes, one of which is the class attribute The task requires

finding a model that describes the class attribute as a function of input

attributes In the College Plans data set shown in Figure 1-1, the class is the

CollegePlansattribute with two states:YesandNo A classification model will

use the other attributes of a case (the input attributes) to determine patterns

about the class (the output attribute) Data mining algorithms that require a

target to learn against are considered supervised algorithms.

Typical classification algorithms include decision trees, neural network, and

Na¨ıve Bayes

Clustering

Clustering is also called segmentation It is used to identify natural groupings of

cases based on a set of attributes Cases within the same group have more or

less similar attribute values

Figure 1-3 shows a very simple customer data set containing two attributes:

AgeandIncome The clustering algorithm groups the data set into three

seg-ments based on these two attributes Cluster 1 contains a younger population

with low income Cluster 2 contains middle age customers with higher income

Cluster 3 is a group of older individuals with a relatively low income

Clustering is an unsupervised data mining task There is no single attribute

used to guide the training process, so all input attributes are treated equally

Most clustering algorithms build the model through a number of iterations,

and stop when the model converges (that is, the boundaries of these segments

are stabilized)

Trang 5

Cluster 2

Age

Figure 1-3 Clustering

Association

Association is also called market basket analysis A typical association business

problem is to analyze a sales transaction table and identify those products

often in the same shopping basket The common usage of association is to

identify common sets of items and rules for the purpose of cross-selling, as

shown in Figure 1-4

Juice

Beef Donut

Figure 1-4 Product association

In terms of association, each piece of information is considered an item

The association task has two goals: to find those items that appear together

frequently, and from that, to determine rules about the associations

Trang 6

Regression

The regression task is similar to classification, except that instead of looking for

patterns that describe a class, the goal is to find patterns to determine a numerical

value Simple linear line-fitting techniques are an example of regression, where

the result is a function to determine the output based on the values of the

inputs More advanced forms of regression support categorical inputs as well

as numerical inputs The most popular techniques used for regression are

linear regression and logistic regression Other techniques supported by SQL

Server Data Mining are regression trees (part of the Microsoft Decision Trees

algorithm) and neural networks

Regression is used to solve many business problems — for example, to

predict a coupon redemption rate based on the face value, distribution method,

distribution volume, and season, or to predict wind velocities based on

temperature, air pressure, and humidity

Forecasting

Forecasting is yet another important data mining task What will the stock

value of Microsoft Corporation (NASDAQ symbol MSFT) be tomorrow? What

will the sales amount of wine be next month? Forecasting can help answer

these questions As input, it takes sequences of numbers indicating a series

of values through time, and then it imputes future values of those series

using a variety of machine-learning and statistical techniques that deal with

seasonality, trending, and noisiness of data

Figure 1-5 shows two curves The solid line curve is the actual time-series

data on Microsoft stock value, and the dotted curve is a time-series model that

predicts values based on past values

38

36

34

32

30

28

26

24

22

20

MSFT 3-year price history

Figure 1-5 Time series

Trang 7

Sequence Analysis

Sequence analysis is used to find patterns in a series of events called a sequence.

For example, a DNA sequence is a long series composed of four different states:

A, G, C, and T A click sequence on the Web contains a series of URLs In certain

circumstances, you may model customer purchases as a sequence of data For

example, a customer first buys a computer, and then buys speakers, and

finally buys a webcam Both sequence and time-series data are similar in that

they contain adjacent observations that are order-dependent The difference is

that where a time series contains numerical data, a sequence series contains

discrete states

Figure 1-6 shows Web click sequences from a news website Each node

is a URL category, and the lines represent transitions between them Each

transition is associated with a weight, representing the probability of the

transition between one URL and another

0.2

0.3

0.2 0.3

0.2

0.1 0.2

Home

Page

Business

Weather

Science

Figure 1-6 Web navigation sequence

Deviation Analysis

Deviation analysis is used to find rare cases that behave very differently from

the norm Deviation analysis is widely applicable, the most common usage

being credit card fraud detection Identifying abnormal cases among millions

of transactions is a very challenging task Other applications include network

intrusion detection, manufacture error analysis, and so on

There is no standard technique for deviation analysis Usually, analysts

apply decision trees, clustering, or neural network algorithms for this task

Data Mining Project Cycle

From the initial business problem formation through to deployment and

sustained management, most data mining projects pass through the same

phases

Trang 8

Business Problem Formation

What are the problems you are trying to solve? What techniques are you going

to apply to solve the problem? How do you know if you will be successful?

These are important questions to ask before embarking on any project

You may find that a simple OLAP, reporting, or data integration solution

may be sufficient A predictive or data mining solution involves determining

the unknown, relying on a belief that making sense of that unknown will add

value This is a shaky precipice from which to begin any business endeavor

Luckily, successful data mining solutions have been shown to have an average

of 150-percent return on investment (ROI), so that makes justification easier

Data Collection

Business data is stored in many systems across an enterprise For example,

at Microsoft, there are hundreds of online transaction processing (OLTP)

databases and more than 70 data warehouses The first step is to pull the

relevant data into a database or a data mart where the data analysis is applied

For example, if you want to analyze your website’s click stream, the first step

is to download the log data from your web servers

Sometimes you might be lucky and find that there is already an existing

data warehouse on the subject of your analysis However, in many cases, the

data in the data warehouse is not rich enough and must be supplemented with

additional data For example, the log data from the web servers contains only

data about web behavior and little (if any) data about the customers You may

need to gather customer information from other company systems or purchase

demographic data to build models that meet your business requirements

Data Cleaning and Transformation

Data cleaning and transformation are the most resource-consuming steps in

a data mining project The purpose of data cleaning is to remove noise and

irrelevant information from the data set The purpose of data transformation is

to modify the source data in ways that make it useful for mining

Various techniques are applied to clean and transform data, including the

following:

Numerical transformation— For continuous data such as income and

age, a typical transformation is to bin (or discretize) the data into buckets.

For example, you may want to binAgeinto five predefined age groups

SQL Server Data Mining has automatic discretization methods, but if

you have meaningful groupings, they may be more informative both

from a business sense and an algorithmic sense Additionally,

continu-ous data is often normalized Normalization maps all numerical values to

Trang 9

a range (such as between 0 and 1) or to have a specific standard deviation

(such as 1)

Grouping— Discrete data often has more distinct values than are

use-ful You can group these values to reduce the model complexity For

example, the columnProfessionmay have many different types of

engi-neers, such as Software Engineer, Telecom Engineer, Mechanical

Engi-neer, and so on You can group all of these professions to the single value

Engineer

Aggregation— Aggregation is an important transformation to derive

additional value from your data Suppose you want to group customers

based on their phone usage If the call detail record information is too

detailed for the model, you must aggregate all the calls into a few

derived attributes such as total number of calls and the average call

duration These derived attributes can later be used in the model

Missing value handling— Most data sets contain missing values This

can be caused by many different things For example, you may have two

customer tables coming from two OLTP databases that, when merged,

have missing values because the tables are not aligned Another example

occurs when customers don’t supply data values such as age Another is

when you have stock market values with blanks because the markets are

closed on weekends and holidays

Addressing missing values is important, because it is reflected in the

business value of your solution You may need to retain the missing

data (for example, customers who refuse to report their age may have

other interesting things in common) You may need to discard the entire

record (having too many unknowns could pollute your model) Or, you

may simply be able to replace missing values with some other value

(such as the previous value for time-series data such as stock market

val-ues, or the most popular value) For more advanced cases, you can use

data mining to predict the most likely value for each missing case

Removing outliers — Outliers are abnormal data and can be real or (as

is often the case) errors Abnormal data has an effect on the quality of

your results The best way to deal with outliers typically is to simply

remove them before beginning the analysis For example, you could

remove 0.5 percent of the customers with highest or lowest income to

eliminate any situations of people having negative or extremely unlikely

incomes

SQL Server Integration Services (SSIS), which is included with Microsoft SQL

Server, is an excellent tool for performing data cleaning and transformation

tasks

Trang 10

Model Building

Model building is the core of data mining, though it is not as time- and

resource-intensive as data transformation When you understand the shape of

the business problem and the type of data mining task, it is relatively easy to

pick algorithms that are suitable Usually, you don’t know which algorithm

is the best fit for the problem until you have built the model The accuracy

of an algorithm depends on the nature of the data For example, a decision

tree algorithm is usually a very good choice for any classifications However,

if the relationships among attributes are complicated, a neural network may

perform better

A good approach is to build multiple models using different algorithms,

and then compare the accuracy of these models Even with a single algorithm,

you can tune the parameter settings to optimize the model accuracy

Model Assessment

In the model assessment stage, you use tools to determine the accuracy of

the models that were created, and you examine the models to determine the

meaning of discovered patterns and how they apply to your business For

example, a model may determine that Relationship = Husband➪Gender = Male

with 100-percent confidence Although the rule is valid, it doesn’t contain any

business value It is very important to work with business analysts who have

the proper domain knowledge to validate the discoveries

Sometimes, the model doesn’t contain useful patterns This is generally

because the set of variables in the model are not the right ones to solve your

business problem You may need to repeat the data cleaning and

transforma-tion steps, or even redefine your problem in order to derive more meaningful

variables Data mining is an exploratory process, and it often takes a few

iterations before you find the right model

Reporting and Prediction

In many organizations, the goal of data miners is to deliver reports to marketing

executives SQL Server Data Mining is integrated with SQL Server Reporting

Services to generate reports directly from data mining results Reports may

contain predictions (such as lists of customers with the highest value potential)

or the rules found in the data mining analysis

To provide predictions, you apply the selected model against new cases of

data Consider a banking scenario where you build a model about loan risk

prediction Every day there are thousands of new loan applications You can

Tiêu đề	Introduction to Data Mining in SQL Server 2008
Tác giả	Maclennan
Chuyên ngành	Data Mining
Thể loại	presentation
Năm xuất bản	2008

Định dạng
Số trang	10
Dung lượng	371,07 KB