Data Mining with Analysis Services IN THIS CHAPTER Overview of the data mining process Creating mining structures and models Evaluating model accuracy Deploying data mining functionality
Trang 1Most every organization can benefit from better data availability and analysis and Excel is a great place
to start
Trang 2Data Mining with Analysis Services
IN THIS CHAPTER Overview of the data mining process
Creating mining structures and models
Evaluating model accuracy Deploying data mining functionality in applications Mining algorithms and viewers Mining integration with Analysis Services Cubes
Many business questions can be answered directly by querying a
database — for example, ‘‘What is the most popular page on our
web-site?’’ or ‘‘Who are our top customers?’’ Other, often more important,
questions require deeper exploration — for example, the most popular paths
through the website or common characteristics of top customers Data mining
provides the tools to answer such non-obvious questions
The term data mining has suffered from a great deal of misuse One favorite
anec-dote is the marketing person who intended to ‘‘mine’’ data in a spreadsheet by
staring at it until inspiration struck In this book, data mining is not something
performed by intuition, direct query, or simple statistics Instead, it is the
algo-rithmic discovery of non-obvious information from large quantities of data
Analysis Services implements algorithms to extract information addressing several
categories of questions:
■ Segmentation: Groups items with similar characteristics For example,
develop profiles of top customers or spot suspect values on a data entry
page
■ Classification: Places items into categories For example, determine
which customers are likely to respond to a marketing campaign or
which e-mails are likely to be spam
■ Association: Sometimes called market basket analysis, this determines
which items tend to occur together For example, which web pages are
normally viewed together on the site, or ‘‘Customers who bought this
book also bought ’’
■ Estimation: Estimates a value For example, estimating revenue from a
customer or the life span of a piece of equipment
■ Forecasting: Predicts what a time series will look like in the future For
example, when will we run out of disk space, or what revenue do we
expect in the upcoming quarter?
Trang 3The Data Mining Process
A traditional use of data mining is to train a data mining model using data for which an outcome is
already known and then use that model to predict the outcome of new data as it becomes available This
use of data mining requires several steps, only some of which happen within Analysis Services:
■ Business and data understanding: Understand the questions that are important and the data available to answer those questions Insights gained must be relevant to business goals to be of use Data must be of acceptable quality and relevance to obtain reliable answers
■ Prepare data: The effort to get data ready for mining can range from simple to painstaking depending on the situation Some of the tasks to consider include the following:
■ Eliminate rows of low data quality Here, the measure of quality is domain specific, but
it may include too small an underlying sample size, values outside of expected norms, or failing any test that proves the row describes an impossible or highly improbable case
■ General cleaning by scaling, formatting, and so on; and by eliminating duplicates, invalid values, or inconsistent values
■ Analysis Services accepts a single primary case table, and optionally one or more child nested tables If the source data is spread among several tables, then denormalization by
creating views or preprocessing will be required
■ Erratic time series data may benefit from smoothing Smoothing algorithms remove the dramatic variations from noisy data at the cost of accuracy, so experimentation may be necessary to choose an algorithm that does not adversely impact the data mining outcome
■ Derived attributes can be useful in the modeling process, typically either calculating a value from other attributes (e.g., Profit= Income − Cost) or simplifying the range of a complex domain (e.g., mapping numeric survey responses to High, Medium, or Low)
Some types of preparation can be accomplished within the Analysis Services data source view using named queries and named calculations When possible, this is highly recom-mended, as it avoids reprocessing data sets if changes become necessary
■ Finally, it is necessary to split the prepared data into two data sets: A training data set that is used to set up the model, and a testing data set that is used to evaluate the model’s accuracy Testing data can be held out either in the mining structure itself or during the data preparation process The Integration Services Row Sampling and Percentage Sampling transforms are useful to randomly split data, typically saving 20 to 30 percent of rows for testing
■ Model: Analysis Services models are built by first defining a data mining structure that speci-fies the tables to be used as input Then, data mining models (different algorithms) are added
to the structure Finally, all the models within the structure are trained simultaneously using the training data
Trang 4■ Evaluate: Evaluating the accuracy and usefulness of the candidate mining models is
simpli-fied by Analysis Services’ Mining Accuracy Chart Use the testing data set to understand the
expected accuracy of each model and compare it to business needs
■ Deploy: Integrate prediction queries into applications to predict the outcomes of interest
For a more detailed description of the data mining process, see www.crisp-dm.org
While this process is typical of data mining tasks, it does not cover every situation Occasionally,
explor-ing a data set is an end in itself, providexplor-ing a better understandexplor-ing of the data and its relationships The
process in this case may just iterate between prepare/model/evaluate cycles At the other end of the
spec-trum, an application may build, train, and query a model to accomplish a task, such as identifying
out-lier rows in a data set Regardless of the situation, understanding this typical process will aid in building
appropriate adaptations
Modeling with Analysis Services
Open an Analysis Services project within Business Intelligence Development Studio to create a data
min-ing structure When deployed, the Analysis Services project will create an Analysis Services database on
the target server Often, data mining structures are deployed in conjunction with related cubes in the
same database
Begin the modeling process by telling Analysis Services where the training and testing data reside:
■ Define data source(s) that reference the location of data to be used in modeling
■ Create data source views that include all training tables When nested tables are used, the data
source view must show the relationship between the case and nested tables
For information on creating and managing data sources and data source views, see
Chapter 71, ‘‘Building Multidimensional Cubes with Analysis Services.’’
Data Mining Wizard
The Data Mining Wizard steps through the process of defining a new data mining structure and
option-ally the first model within that structure Right-click on the Mining Structures node within the Solution
Explorer and choose New Mining Model to start the wizard The wizard consists of several pages:
■ Select the Definition Method: Options include relational (from existing relational database
or data warehouse) or cube (from existing cube) source data For this example, choose
relational (See the section ‘‘OLAP Integration’’ later in this chapter for differences between
relational-based and cube-based mining structures.)
■ Create the Data Mining Structure: Choose the algorithm to use in the structure’s first
mining model (See the ‘‘Algorithms’’ section in this chapter for common algorithm usage)
Alternately, a mining structure can be created with no models, and one or more models can be
added to the structure later
■ Select Data Source View: Choose the data source view containing the source data table(s)
■ Specify Table Types: Choose the case table containing the source data and any associated
nested tables Nested tables always have one-to-many relationships with the case table, such as
a list of orders as the case table, and associated order line items in the nested table
Trang 5predictable columns as well The Suggest button may aid in selection once the predictable columns have been identified by scoring columns by relevance based on a sample of the training data, but take care to avoid inputs with values that are unlikely to occur again
as input to a trained model For example, a customer ID, name, or address might be very effective at training a model, but once the model is built to look for a specific ID or address, it is very unlikely new customers will ever match those values Conversely, gender and occupation values are very likely to reappear in new customer records
■ Predictable: Identify all columns the model should be able to predict
■ Specify Columns’ Content and Data Type: Review and adjust the data type (Boolean, Date, Double, Long, Text) as needed Review and adjust the content type as well; pressing the Detect button to calculate continuous versus discrete for numeric data types may help Available content types include the following:
■ Key: Contains a value that, either alone or with other keys, uniquely identifies a row in the training table
■ Key Sequence: Acts as a key and provides order to the rows in a table It is used to order rows for the sequence-clustering algorithm
■ Key Time: Acts as a key and provides order to the rows in a table based on a time scale It
is used to order rows for the time series algorithm
■ Continuous: Continuous numeric data — often the result of some calculation or measure-ment, such as age, height, or price
■ Discrete: Data that can be thought of as a choice from a list, such as occupation, model,
or shipping method
■ Discretized: Analysis Services will transform a continuous column into a set of discrete buckets, such as ages 0–10, 11–20, and so on In addition to choosing this option, other column properties must be set once the wizard is complete Open the mining structure, select the column, and then set theDiscretizationBucketCountand
DiscretizationMethodproperties to direct how the ‘‘bucketization’’ will be performed
■ Ordered: Defines an ordering on the training data but without assigning significance to the values used to order For example, if values of 5 and 10 are used to order two rows, then
10 simply comes after 5; it is not ‘‘twice as good’’ as 5
■ Cyclical: Similar to ordered data but repeats values, thus defining a cycle in the data, such
as day of month or month of quarter This enables the mining model to account for cycles
in the data such as sales peaks at the end of a quarter or annually during the holidays
■ Create Testing Set: In SQL Server 2008, the mining structure can hold both the training and the testing data directly, instead of manually splitting the data into separate tables Specify the percentage or number of rows to be held out for testing models in this structure if testing data
is included in the source table(s)
Trang 6■ Completing the Wizard: Provide names for the overall mining structure and the first
min-ing model within that structure Select Allow Drill Thru to enable the direct examination of
training cases from within the data mining viewers
Once the wizard finishes, the new mining structure with a single mining model is created, and the new
structure is opened in the Data Mining Designer The initial Designer view, Mining Structure, enables
columns to be added or removed from the structure, and column properties, such asContent(type)
orDiscretizationMethod, to be modified
Mining Models view
The Mining Models view of the Data Mining Designer enables different data mining algorithms to be
configured on the data defined by the mining structure Add new models as follows (see Figure 76-1):
FIGURE 76-1
Adding a new model to an existing structure
1 Right-click the structure/model matrix pane and choose New Mining Model.
2 Supply a name for the model.
3 Select the desired algorithm and click OK.
Trang 7cific to a subset of the source data For example, targeting different customer groups can be performed
by training filtered models in a single mining structure Right-click on a model and choose Set Model
Filter to apply a filter to a model Once set, the current filter is viewable in the model’s properties
In addition to the optional model filter, each mining model has both properties and algorithm
parame-ters Select a model (column) to view and change the properties common to all algorithms in the
Prop-erties pane, including Name, Description, and AllowDrillThrough Right-click on a model and choose Set
Algorithm Parameters to change an algorithm’s default settings
Once both the structure and model definitions are in place, the structure must be deployed to the target
server to process and train the models The process of deploying a model consists of two parts:
1 During the build phase, the structure definition (or changes to the definition as appropriate)
is sent to the target Analysis Services Server Examine the progress of the build in the output pane
2 During the process phase, the Analysis Services server queries the source data, caches that
data in the mining structure, and trains the models with all the data that has not been either filtered out or held out for testing
Before the first time a project is deployed, set the target server by right-clicking on the project in the
Solution Explorer pane containing the mining structure and choose Properties Then, select the
Deploy-ment topic and enter the appropriate server name, adjusting the target database name at the same time
(deploying creates an Analysis Services database named, by default, after the project)
Deploy the structure by choosing either Process Model or Process Mining Structure and All Models
from the context menu The same options are available from the Mining Model menu as well After
processing, the Mining Model Viewer tab contains processing results; here, one or more viewers are
available depending on which models are included in the structure The algorithm-specific viewers assist
in understanding the rules and relationships discovered by the models (see the ‘‘Algorithms’’ section later
in this chapter)
Model evaluation
Evaluate the trained models to determine which model predicts the outcome most reliably, and to
decide whether the accuracy will be adequate to meet business goals The Mining Accuracy Chart view
provides tools for performing the evaluation
The charts visible within this view are enabled by supplying data for testing under the Input Selection
tab Choose one of three sources:
■ Use mining model test cases: Uses test data held out in the mining structure but applies any model filters in selecting data for each model
■ Use mining structure test cases: Uses test data held out in the mining structure, ignoring any model filters
Trang 8■ Specify a different data set: Allows the selection and mapping of an external table to supply
test data After selecting this option, press the ellipses to display the Specify Column Mapping
dialog Then, press the Select Case Table button on the right-hand table and choose the table
containing the test data The joins between the selected table and the mining structure will
map automatically for matching column names, or they can be manually mapped by
drag-and-drop when a match is not found Verify that each non-key column in the mining structure
participates in a join
If the value being predicted is discrete, then the Input Selection tab also allows choosing a particular
outcome for evaluation If a Predict Value is not selected, then accuracy for all outcomes is evaluated
Lift charts and scatter plots
Once the source data and any Predict Value have been specified, switch to the Lift Chart tab, and
ver-ify that Lift Chart (Scatter Plot for continuous outcomes) is selected from the Chart Type list box (see
Figure 76-2) Because the source data contains the predicted column(s), the lift chart can compare each
model’s prediction against the actual outcome The lift chart plots this information on the Target
Popula-tion % (percent of cases correct) versus Overall PopulaPopula-tion % (percent of cases tested) axes, so when 50
percent of the population has been checked, the perfect model will have predicted 50 percent correctly
In fact, the chart automatically includes two useful reference lines: the Ideal Model, which indicates the
best possible performance, and the Random Guess, which indicates how often randomly assigned
out-comes happen to be correct
The profit chart extends the lift chart and aids in calculating the maximum return from marketing
cam-paigns and similar efforts Press the Settings button to specify the number of prospects, the fixed and
per-case cost, and the expected return from a successfully identified case; then choose Profit Chart from
the Chart Type list box The resulting chart indicates profit versus population percent included, offering
a guide as to how much of the population should be included in the effort either by maximizing profit
or by locating a point of diminishing returns
Classification matrix
The simplest view of model accuracy is offered by the Classification Matrix tab, which creates one table
for each model, with predicted outcomes listed down the left side of the table and actual values across
the top, similar to the example shown in Table 76-1 This example shows that for red cases, this model
correctly predicted red for 95 and incorrectly predicted blue for 37 Likewise, for cases that were
actu-ally blue, the model correctly predicted blue 104 times while incorrectly predicting red 21 times
TABLE 76-1
Example Classification Matrix
Predicted Red (Actual) Blue (Actual)
Trang 9Cross validation
Cross validation is a very effective technique for evaluating a model for stability and how well it will
generalize for unseen cases The concept is to partition available data into some number of equal sized
buckets called folds, and then train the model on all but one of those folds and test with the remaining
fold, repeating until each of the folds has been used for testing For example, if three folds were
selected, the model would be trained on 2 and 3 and tested with 1, then trained on 1 and 3 and tested
on 2, and finally trained on 1 and 2 and tested on 3
Switch to the Cross Validation tab and specify the parameters for the evaluation:
■ Fold Count: The number of partitions into which the data will be placed
■ Max Cases: The number of cases from which the folds will be constructed For example, 1,000 cases and 10 folds will result in approximately 100 cases per fold Because of the large amount of processing required to perform cross validation, it is often useful to limit the number of cases Setting this value to 0 results in all cases being used
Trang 10■ Target Attribute and State: The prediction to validate.
■ Target Threshold: Sets the minimum probability required before assuming a positive result
For example, if you were identifying customers for an expensive marketing promotion, a
mini-mum threshold of 80 percent likely to purchase could be set to target only the best prospects
Knowing that this threshold will be used enables a more realistic evaluation of the model
Once the cross-validation has run, a report like the one shown in Figure 76-3 displays the outcome for
each fold across a number of different measures In addition to how well a model performs in each line
item, the standard deviation of the results of each measure should be relatively small If the variation is
large between folds, then it is an indication that the model will not generalize well in practical use
FIGURE 76-3
Cross Validation tab
Troubleshooting models
Models seldom approach perfection in the real world If these evaluation techniques show a model
falling short of your needs, then consider these common problems:
■ A non-random split of data into training and test data sets If the split method used was based
on a random algorithm, rerun the random algorithm to obtain a more random result