Hướng dẫn học Microsoft SQL Server 2008 part 164 pps

Data Mining with Analysis Services IN THIS CHAPTER Overview of the data mining process Creating mining structures and models Evaluating model accuracy Deploying data mining functionality

Trang 1

Most every organization can benefit from better data availability and analysis and Excel is a great place

to start

Trang 2

Data Mining with Analysis Services

IN THIS CHAPTER Overview of the data mining process

Creating mining structures and models

Evaluating model accuracy Deploying data mining functionality in applications Mining algorithms and viewers Mining integration with Analysis Services Cubes

Many business questions can be answered directly by querying a

database — for example, ‘‘What is the most popular page on our

web-site?’’ or ‘‘Who are our top customers?’’ Other, often more important,

questions require deeper exploration — for example, the most popular paths

through the website or common characteristics of top customers Data mining

provides the tools to answer such non-obvious questions

The term data mining has suffered from a great deal of misuse One favorite

anec-dote is the marketing person who intended to ‘‘mine’’ data in a spreadsheet by

staring at it until inspiration struck In this book, data mining is not something

performed by intuition, direct query, or simple statistics Instead, it is the

algo-rithmic discovery of non-obvious information from large quantities of data

Analysis Services implements algorithms to extract information addressing several

categories of questions:

■ Segmentation: Groups items with similar characteristics For example,

develop profiles of top customers or spot suspect values on a data entry

page

■ Classification: Places items into categories For example, determine

which customers are likely to respond to a marketing campaign or

which e-mails are likely to be spam

■ Association: Sometimes called market basket analysis, this determines

which items tend to occur together For example, which web pages are

normally viewed together on the site, or ‘‘Customers who bought this

book also bought ’’

■ Estimation: Estimates a value For example, estimating revenue from a

customer or the life span of a piece of equipment

■ Forecasting: Predicts what a time series will look like in the future For

example, when will we run out of disk space, or what revenue do we

expect in the upcoming quarter?

Trang 3

The Data Mining Process

A traditional use of data mining is to train a data mining model using data for which an outcome is

already known and then use that model to predict the outcome of new data as it becomes available This

use of data mining requires several steps, only some of which happen within Analysis Services:

■ Business and data understanding: Understand the questions that are important and the data available to answer those questions Insights gained must be relevant to business goals to be of use Data must be of acceptable quality and relevance to obtain reliable answers

■ Prepare data: The effort to get data ready for mining can range from simple to painstaking depending on the situation Some of the tasks to consider include the following:

■ Eliminate rows of low data quality Here, the measure of quality is domain specific, but

it may include too small an underlying sample size, values outside of expected norms, or failing any test that proves the row describes an impossible or highly improbable case

■ General cleaning by scaling, formatting, and so on; and by eliminating duplicates, invalid values, or inconsistent values

■ Analysis Services accepts a single primary case table, and optionally one or more child nested tables If the source data is spread among several tables, then denormalization by

creating views or preprocessing will be required

■ Erratic time series data may benefit from smoothing Smoothing algorithms remove the dramatic variations from noisy data at the cost of accuracy, so experimentation may be necessary to choose an algorithm that does not adversely impact the data mining outcome

■ Derived attributes can be useful in the modeling process, typically either calculating a value from other attributes (e.g., Profit= Income − Cost) or simplifying the range of a complex domain (e.g., mapping numeric survey responses to High, Medium, or Low)

Some types of preparation can be accomplished within the Analysis Services data source view using named queries and named calculations When possible, this is highly recom-mended, as it avoids reprocessing data sets if changes become necessary

■ Finally, it is necessary to split the prepared data into two data sets: A training data set that is used to set up the model, and a testing data set that is used to evaluate the model’s accuracy Testing data can be held out either in the mining structure itself or during the data preparation process The Integration Services Row Sampling and Percentage Sampling transforms are useful to randomly split data, typically saving 20 to 30 percent of rows for testing

■ Model: Analysis Services models are built by first defining a data mining structure that speci-fies the tables to be used as input Then, data mining models (different algorithms) are added

to the structure Finally, all the models within the structure are trained simultaneously using the training data

Trang 4

■ Evaluate: Evaluating the accuracy and usefulness of the candidate mining models is

simpli-fied by Analysis Services’ Mining Accuracy Chart Use the testing data set to understand the

expected accuracy of each model and compare it to business needs

■ Deploy: Integrate prediction queries into applications to predict the outcomes of interest

For a more detailed description of the data mining process, see www.crisp-dm.org

While this process is typical of data mining tasks, it does not cover every situation Occasionally,

explor-ing a data set is an end in itself, providexplor-ing a better understandexplor-ing of the data and its relationships The

process in this case may just iterate between prepare/model/evaluate cycles At the other end of the

spec-trum, an application may build, train, and query a model to accomplish a task, such as identifying

out-lier rows in a data set Regardless of the situation, understanding this typical process will aid in building

appropriate adaptations

Modeling with Analysis Services

Open an Analysis Services project within Business Intelligence Development Studio to create a data

min-ing structure When deployed, the Analysis Services project will create an Analysis Services database on

the target server Often, data mining structures are deployed in conjunction with related cubes in the

same database

Begin the modeling process by telling Analysis Services where the training and testing data reside:

■ Define data source(s) that reference the location of data to be used in modeling

■ Create data source views that include all training tables When nested tables are used, the data

source view must show the relationship between the case and nested tables

For information on creating and managing data sources and data source views, see

Chapter 71, ‘‘Building Multidimensional Cubes with Analysis Services.’’

Data Mining Wizard

The Data Mining Wizard steps through the process of defining a new data mining structure and

option-ally the first model within that structure Right-click on the Mining Structures node within the Solution

Explorer and choose New Mining Model to start the wizard The wizard consists of several pages:

■ Select the Definition Method: Options include relational (from existing relational database

or data warehouse) or cube (from existing cube) source data For this example, choose

relational (See the section ‘‘OLAP Integration’’ later in this chapter for differences between

relational-based and cube-based mining structures.)

■ Create the Data Mining Structure: Choose the algorithm to use in the structure’s first

mining model (See the ‘‘Algorithms’’ section in this chapter for common algorithm usage)

Alternately, a mining structure can be created with no models, and one or more models can be

added to the structure later

■ Select Data Source View: Choose the data source view containing the source data table(s)

■ Specify Table Types: Choose the case table containing the source data and any associated

nested tables Nested tables always have one-to-many relationships with the case table, such as

a list of orders as the case table, and associated order line items in the nested table

Trang 5

predictable columns as well The Suggest button may aid in selection once the predictable columns have been identified by scoring columns by relevance based on a sample of the training data, but take care to avoid inputs with values that are unlikely to occur again

as input to a trained model For example, a customer ID, name, or address might be very effective at training a model, but once the model is built to look for a specific ID or address, it is very unlikely new customers will ever match those values Conversely, gender and occupation values are very likely to reappear in new customer records

■ Predictable: Identify all columns the model should be able to predict

■ Specify Columns’ Content and Data Type: Review and adjust the data type (Boolean, Date, Double, Long, Text) as needed Review and adjust the content type as well; pressing the Detect button to calculate continuous versus discrete for numeric data types may help Available content types include the following:

■ Key: Contains a value that, either alone or with other keys, uniquely identifies a row in the training table

■ Key Sequence: Acts as a key and provides order to the rows in a table It is used to order rows for the sequence-clustering algorithm

■ Key Time: Acts as a key and provides order to the rows in a table based on a time scale It

is used to order rows for the time series algorithm

■ Continuous: Continuous numeric data — often the result of some calculation or measure-ment, such as age, height, or price

■ Discrete: Data that can be thought of as a choice from a list, such as occupation, model,

or shipping method

■ Discretized: Analysis Services will transform a continuous column into a set of discrete buckets, such as ages 0–10, 11–20, and so on In addition to choosing this option, other column properties must be set once the wizard is complete Open the mining structure, select the column, and then set theDiscretizationBucketCountand

DiscretizationMethodproperties to direct how the ‘‘bucketization’’ will be performed

■ Ordered: Defines an ordering on the training data but without assigning significance to the values used to order For example, if values of 5 and 10 are used to order two rows, then

10 simply comes after 5; it is not ‘‘twice as good’’ as 5

■ Cyclical: Similar to ordered data but repeats values, thus defining a cycle in the data, such

as day of month or month of quarter This enables the mining model to account for cycles

in the data such as sales peaks at the end of a quarter or annually during the holidays

■ Create Testing Set: In SQL Server 2008, the mining structure can hold both the training and the testing data directly, instead of manually splitting the data into separate tables Specify the percentage or number of rows to be held out for testing models in this structure if testing data

is included in the source table(s)

Trang 6

■ Completing the Wizard: Provide names for the overall mining structure and the first

min-ing model within that structure Select Allow Drill Thru to enable the direct examination of

training cases from within the data mining viewers

Once the wizard finishes, the new mining structure with a single mining model is created, and the new

structure is opened in the Data Mining Designer The initial Designer view, Mining Structure, enables

columns to be added or removed from the structure, and column properties, such asContent(type)

orDiscretizationMethod, to be modified

Mining Models view

The Mining Models view of the Data Mining Designer enables different data mining algorithms to be

configured on the data defined by the mining structure Add new models as follows (see Figure 76-1):

FIGURE 76-1

Adding a new model to an existing structure

1 Right-click the structure/model matrix pane and choose New Mining Model.

2 Supply a name for the model.

3 Select the desired algorithm and click OK.

Trang 7

cific to a subset of the source data For example, targeting different customer groups can be performed

by training filtered models in a single mining structure Right-click on a model and choose Set Model

Filter to apply a filter to a model Once set, the current filter is viewable in the model’s properties

In addition to the optional model filter, each mining model has both properties and algorithm

parame-ters Select a model (column) to view and change the properties common to all algorithms in the

Prop-erties pane, including Name, Description, and AllowDrillThrough Right-click on a model and choose Set

Algorithm Parameters to change an algorithm’s default settings

Once both the structure and model definitions are in place, the structure must be deployed to the target

server to process and train the models The process of deploying a model consists of two parts:

1 During the build phase, the structure definition (or changes to the definition as appropriate)

is sent to the target Analysis Services Server Examine the progress of the build in the output pane

2 During the process phase, the Analysis Services server queries the source data, caches that

data in the mining structure, and trains the models with all the data that has not been either filtered out or held out for testing

Before the first time a project is deployed, set the target server by right-clicking on the project in the

Solution Explorer pane containing the mining structure and choose Properties Then, select the

Deploy-ment topic and enter the appropriate server name, adjusting the target database name at the same time

(deploying creates an Analysis Services database named, by default, after the project)

Deploy the structure by choosing either Process Model or Process Mining Structure and All Models

from the context menu The same options are available from the Mining Model menu as well After

processing, the Mining Model Viewer tab contains processing results; here, one or more viewers are

available depending on which models are included in the structure The algorithm-specific viewers assist

in understanding the rules and relationships discovered by the models (see the ‘‘Algorithms’’ section later

in this chapter)

Model evaluation

Evaluate the trained models to determine which model predicts the outcome most reliably, and to

decide whether the accuracy will be adequate to meet business goals The Mining Accuracy Chart view

provides tools for performing the evaluation

The charts visible within this view are enabled by supplying data for testing under the Input Selection

tab Choose one of three sources:

■ Use mining model test cases: Uses test data held out in the mining structure but applies any model filters in selecting data for each model

■ Use mining structure test cases: Uses test data held out in the mining structure, ignoring any model filters

Trang 8

■ Specify a different data set: Allows the selection and mapping of an external table to supply

test data After selecting this option, press the ellipses to display the Specify Column Mapping

dialog Then, press the Select Case Table button on the right-hand table and choose the table

containing the test data The joins between the selected table and the mining structure will

map automatically for matching column names, or they can be manually mapped by

drag-and-drop when a match is not found Verify that each non-key column in the mining structure

participates in a join

If the value being predicted is discrete, then the Input Selection tab also allows choosing a particular

outcome for evaluation If a Predict Value is not selected, then accuracy for all outcomes is evaluated

Lift charts and scatter plots

Once the source data and any Predict Value have been specified, switch to the Lift Chart tab, and

ver-ify that Lift Chart (Scatter Plot for continuous outcomes) is selected from the Chart Type list box (see

Figure 76-2) Because the source data contains the predicted column(s), the lift chart can compare each

model’s prediction against the actual outcome The lift chart plots this information on the Target

Popula-tion % (percent of cases correct) versus Overall PopulaPopula-tion % (percent of cases tested) axes, so when 50

percent of the population has been checked, the perfect model will have predicted 50 percent correctly

In fact, the chart automatically includes two useful reference lines: the Ideal Model, which indicates the

best possible performance, and the Random Guess, which indicates how often randomly assigned

out-comes happen to be correct

The profit chart extends the lift chart and aids in calculating the maximum return from marketing

cam-paigns and similar efforts Press the Settings button to specify the number of prospects, the fixed and

per-case cost, and the expected return from a successfully identified case; then choose Profit Chart from

the Chart Type list box The resulting chart indicates profit versus population percent included, offering

a guide as to how much of the population should be included in the effort either by maximizing profit

or by locating a point of diminishing returns

Classification matrix

The simplest view of model accuracy is offered by the Classification Matrix tab, which creates one table

for each model, with predicted outcomes listed down the left side of the table and actual values across

the top, similar to the example shown in Table 76-1 This example shows that for red cases, this model

correctly predicted red for 95 and incorrectly predicted blue for 37 Likewise, for cases that were

actu-ally blue, the model correctly predicted blue 104 times while incorrectly predicting red 21 times

TABLE 76-1

Example Classification Matrix

Predicted Red (Actual) Blue (Actual)

Trang 9

Cross validation

Cross validation is a very effective technique for evaluating a model for stability and how well it will

generalize for unseen cases The concept is to partition available data into some number of equal sized

buckets called folds, and then train the model on all but one of those folds and test with the remaining

fold, repeating until each of the folds has been used for testing For example, if three folds were

selected, the model would be trained on 2 and 3 and tested with 1, then trained on 1 and 3 and tested

on 2, and finally trained on 1 and 2 and tested on 3

Switch to the Cross Validation tab and specify the parameters for the evaluation:

■ Fold Count: The number of partitions into which the data will be placed

■ Max Cases: The number of cases from which the folds will be constructed For example, 1,000 cases and 10 folds will result in approximately 100 cases per fold Because of the large amount of processing required to perform cross validation, it is often useful to limit the number of cases Setting this value to 0 results in all cases being used

Trang 10

■ Target Attribute and State: The prediction to validate.

■ Target Threshold: Sets the minimum probability required before assuming a positive result

For example, if you were identifying customers for an expensive marketing promotion, a

mini-mum threshold of 80 percent likely to purchase could be set to target only the best prospects

Knowing that this threshold will be used enables a more realistic evaluation of the model

Once the cross-validation has run, a report like the one shown in Figure 76-3 displays the outcome for

each fold across a number of different measures In addition to how well a model performs in each line

item, the standard deviation of the results of each measure should be relatively small If the variation is

large between folds, then it is an indication that the model will not generalize well in practical use

FIGURE 76-3

Cross Validation tab

Troubleshooting models

Models seldom approach perfection in the real world If these evaluation techniques show a model

falling short of your needs, then consider these common problems:

■ A non-random split of data into training and test data sets If the split method used was based

on a random algorithm, rerun the random algorithm to obtain a more random result

Định dạng
Số trang	10
Dung lượng	1,23 MB