As the following code example shows, the prediction join relates a mining model and a set of data to be predicted cases.. This example returns the probability that the listed case will p
Trang 1■ Input columns are too case specific (e.g., IDs, names, etc.) Adjust the mining structure to ignore data items containing values that occur in the training data but then never reappear for test or production data
■ Too few rows (cases) in the training data set to accurately characterize the population of cases
Look for additional sources of data for best results If additional data is not available, then better results may be obtained by limiting the special cases considered by an algorithm (e.g., increasing theMINIMUM_SUPPORTparameter)
■ If all models are closer to the Random Guess line than the Ideal Model line, then the input data does not correlate with the outcome being predicted
Note that some algorithms, such asTime_Series, do not support the Mining Accuracy Chart view at
all Regardless of the tools available within the development environment, it is important to perform an
evaluation of the trained model using test data held in reserve for that purpose Then, modify the data
and model definitions until the results meet the business goals at hand
Deploying
Several methods are available for interfacing applications with data mining functionality:
■ Directly constructing XMLA, communicating with Analysis Services via SOAP This exposes all functionality at the price of in-depth programming
■ Analysis Management Objects (AMO) provides an environment for creating and managing mining structures and other meta-data, but not for prediction queries
■ The Data Mining Extensions (DMX) language supports most model creation and training tasks and has a robust prediction query capability DMX can be sent to Analysis Services via the following:
■ ADOMD.NET for managed (.NET) languages
■ OLE DB for C++ code
■ ADO for other languages DMX is a SQL-like language modified to accommodate mining structures and tasks For purposes of
per-forming prediction queries against a trained model, the primary language feature is the prediction join
As the following code example shows, the prediction join relates a mining model and a set of data to be
predicted (cases) Because the DMX query is issued against the Analysis Services database, the model[TM
Decision Tree]can be directly referenced, while the cases must be gathered via anOPENQUERYcall
against the relational database The corresponding columns are matched in theONclause like a standard
relational join, and theWHEREandORDER BYclauses function as expected
DMX also adds a number of mining-specific functions such as thePredictand
PredictProbabilityfunctions shown here, which return the most likely outcome and the
probabil-ity of that outcome, respectively Overall, this example returns a list of IDs, names, and probabilities for
prospects who are more than 60 percent likely to purchase a bike, sorted by descending probability:
SELECT t.ProspectAlternateKey,t.FirstName, t.LastName, PredictProbability([TM Decision Tree].[Bike Buyer]) as Prob FROM [TM Decision Tree]
PREDICTION JOIN OPENQUERY([Adventure Works DW],
Trang 2ProspectAlternateKey, FirstName, LastName, MaritalStatus,
Gender, YearlyIncome, TotalChildren, NumberChildrenAtHome,
Education, Occupation, HouseOwnerFlag, NumberCarsOwned,
StateProvince-Code
FROM dbo.ProspectiveBuyer;’) AS t
ON
[TM Decision Tree].[Marital Status] = t.MaritalStatus AND
[TM Decision Tree].Gender = t.Gender AND
[TM Decision Tree].[Yearly Income] = t.YearlyIncome AND
[TM Decision Tree].[Total Children] = t.TotalChildren AND
[TM Decision Tree].[Number Children At Home] = t.NumberChildrenAtHome AND
[TM Decision Tree].Education = t.Education AND
[TM Decision Tree].Occupation = t.Occupation AND
[TM Decision Tree].[House Owner Flag] = t.HouseOwnerFlag AND
[TM Decision Tree].[Number Cars Owned] = t.NumberCarsOwned AND
[TM Decision Tree].Region = t.StateProvinceCode
WHERE PredictProbability([TM Decision Tree].[Bike Buyer]) > 0.60
AND Predict([TM Decision Tree].[Bike Buyer])=1
ORDER BY PredictProbability([TM Decision Tree].[Bike Buyer]) DESC
Another useful form of the prediction join is a singleton query, whereby data is provided directly by the
application instead of read from a relational table, as shown in the next example Because the names
exactly match those of the mining model, aNATURAL PREDICTION JOINis used, not requiring an
ONclause This example returns the probability that the listed case will purchase a bike (i.e.,[Bike
Buyer]=1):
SELECT
PredictProbability([TM Decision Tree].[Bike Buyer],1)
FROM [TM Decision Tree]
NATURAL PREDICTION JOIN
(SELECT 47 AS [Age], ‘2-5 Miles’ AS [Commute Distance],
‘Graduate Degree’ AS [Education], ‘M’ AS [Gender],
‘1’ AS [House Owner Flag], ‘M’ AS [Marital Status],
2 AS [Number Cars Owned], 0 AS [Number Children At Home],
‘Professional’ AS [Occupation], ‘North America’ AS [Region],
0 AS [Total Children], 80000 AS [Yearly Income]) AS t
Business Intelligence Development Studio aids in the construction of DMX queries via the Query Builder
within the mining model prediction view Just like the Mining Accuracy Chart, select the model and
case table to be queried, or alternately press the singleton button in the toolbar to specify values Specify
SELECTcolumns and prediction functions in the grid at the bottom SQL Server Management Studio
also offers a DMX query type with meta-data panes for drag-and-drop access to mining structure column
names and prediction functions
Numerous prediction functions are available, including the following:
■ Predict: Returns the expected outcome for a predictable column
■ PredictProbability: Returns the probability (between 0 and 1) of the expected outcome, or
for a specific case if specified
Trang 3■ PredictSupport: Returns the number of training cases on which the expected outcome is based, or on which a specific case is based if specified
■ PredictHistogram: Returns a nested table with all possible outcomes for a given case, listing probability, support, and other information for each outcome
■ Cluster: Returns the cluster to which a case is assigned (clustering algorithm specific)
■ ClusterProbability: Returns the probability the case belongs to a given cluster (clustering algorithm specific)
■ PredictSequence: Predicts the next values in a sequence (sequence clustering algorithm specific)
■ PredictAssociation: Predicts associative membership (association algorithm specific)
■ PredictTimeSeries: Predicts future values in a time series (time series algorithm specific) Like PredictHistogram, this function returns a nested table
Algorithms
When working with data mining, it is useful to understand mining algorithm basics and when to apply
each algorithm Table 76-2 summarizes common algorithm usage for the problem categories presented at
the beginning of this chapter
TABLE 76-2
Common Mining Algorithm Usage Problem Type Primary Algorithms
Segmentation Clustering, Sequence Clustering
Classification Decision Trees, Naive Bayes, Neural Network, Logistic Regression
Association Association Rules, Decision Trees
Estimation Decision Trees, Linear Regression, Logistic Regression, Neural Network
Forecasting Time Series
Sequence Analysis Sequence Clustering
These usage guidelines are useful as an orientation, but not every data mining problem falls neatly into
one of these types, and other algorithms will work for several of these problem types Fortunately, with
evaluation tools such as the lift chart, it’s usually simple to identify which algorithm provides the best
results for a given problem
Decision tree
This algorithm is the most accurate for many problems It operates by building a decision tree beginning
with the All node, corresponding to all the training cases (see Figure 76-4) Then, an attribute is chosen
Trang 4that best splits those cases into groups, and each of those groups is examined for an attribute that best
splits those cases, and so on The goal is to generate leaf nodes with a single predictable outcome For
example, if the goal is to identify who will purchase a bike, then leaf nodes should contain cases that
are either bike buyers or not bike buyers, but no combinations (or as close to that goal as possible)
FIGURE 76-4
Decision Tree Viewer
The Decision Tree Viewer shown in Figure 76-4 graphically displays the resulting tree Age is the first
attribute chosen in this example, splitting cases into groups such as under 35, 35 to 42, and so on For
the under-35 crowd, Number Cars Owned was chosen to further split the cases, while Commute
Dis-tance was chosen for the 56 to 70 cases The Mining Legend pane displays the details of any selected
node, including how the cases break out by the predictable variable (in this case, 796 buyers and 1,538
non-buyers) both in count and probability Many more node levels can be expanded using the Show
Level control in the toolbar or the expansion controls (+/-) on each node Note that much of the tree
is not expanded in this figure due to space restrictions
The Dependency Network Viewer is also available for decision trees, displaying both input and
predictable columns as nodes, with arrows indicating what predicts what Move the slider to the bottom
to see only the most significant predictions Click on a node to highlight its relationships
Trang 5Linear regression
The linear regression algorithm is implemented as a variant of decision trees and is a good choice for
con-tinuous data that relates more or less linearly The result of the regression is an equation in the form
Y= B0+ A1 ∗(X
1+ B1)+ A2 ∗(X
2+ B2)+
where Y is the column being predicted, Xi is the input columns, and Ai/Bi are constants determined by
the regression Because this algorithm is a special case of decision trees, it shares the same mining
view-ers While, by definition, the Tree Viewer will show a single All node, the Mining Legend pane displays
the prediction equation The equation can be either used directly or queried in the mining model via the
Predict function The Dependency Network Viewer provides a graphical interpretation of the weights
used in the equation
Clustering
The clustering algorithm functions by gathering similar cases together into groups called clusters and then
iteratively refining the cluster definition until no further improvement can be gained This approach
makes clustering uniquely suited for segmentation/profiling of populations Several viewers display data
from the finished model:
■ Cluster Diagram: This viewer displays each cluster as a shaded node with connecting lines between similar clusters — the darker the line, the more similar the cluster Move the slider
to the bottom to see only lines connecting the most similar clusters Nodes are shaded darker
to represent more cases By default, the cases are counted from the entire population, but changing the Shading Variable and State pull-downs specifies shading to be based on particular variable values (e.g., which clusters contain homeowners)
■ Cluster Profiles: Unlike node shading in the Cluster Diagram Viewer, where one variable value can be examined at a time, the Cluster Profiles Viewer shows all variables and clusters
in a single matrix Each cell of the matrix is a graphical representation of that variable’s dis-tribution in the given cluster (see Figure 76-5) Discrete variables are shown as stacked bars describing how many cases contain each of the possible variable values Continuous variables are shown as diamond charts, with each diamond centered on the mean (average) value for cases in that cluster, while the top and bottom of the diamond are the mean+/− the standard deviation, respectively Thus, the taller the diamond, the less uniform the variable values in that cluster Click on a cell (chart) to see the full distribution for a cluster/variable combi-nation in the Mining Legend, or hover over a cell for the same information in a tooltip In Figure 76-5, the tooltip displayed shows the full population’s occupation distribution, while the Mining Legend shows Cluster 3’s total children distribution
■ Cluster Characteristics: This view displays the list of characteristics that make up a cluster and the probability that each characteristic will appear
■ Cluster Discrimination: Similar to the Characteristics Viewer, this shows which character-istics favor one cluster versus another It also enables the comparison of a cluster to its own complement, clearly showing what is and is not in a given cluster
Once you gain a better understanding of the clusters for a given model, it is often useful to rename each
cluster to something more descriptive than the default ‘‘Cluster n.’’ From within either the Diagram or
Profiles Viewer, right-click on a cluster and choose Rename Cluster to give it a new name
Trang 6FIGURE 76-5
Cluster Profiles Viewer
Sequence clustering
As the name implies, this algorithm still gathers cases together into clusters, but based on a sequence of
events or items, rather than on case attributes For example, the sequence of web pages visited during
user sessions can be used to define the most common paths through that website
The nature of this algorithm requires input data with a nested table, whereby the parent row is the
session or order (e.g., shopping cart ID) and the nested table contains the sequence of events during
that session (e.g., order line items) In addition, the nested table’s key column must be marked as a Key
Sequence content type in the mining structure
Once the model is trained, the same four cluster viewers described above are available to describe the
characteristics of each In addition, the State Transition Viewer displays transitions between two items
(e.g., a pair of web pages), with its associated probability of that transition happening Move the slider
to the bottom to see only the most likely transitions Select a node to highlight the possible transitions
from that item to its possible successors The short arrows that don’t connect to a second node denote a
state that can be its own successor
Trang 7Neural Network
This famous algorithm is generally slower than other alternatives, but often handles more complex
sit-uations The network is built using input, hidden (middle), and output layers of neurons whereby the
output of each layer becomes the input of the next layer Each neuron accepts inputs that are combined
using weighted functions that determine the output Training the network consists of determining the
weights for each neuron
The Neural Network Viewer presents a list of characteristics (variable/value combinations) and how
those characteristics favor given outputs (outcomes) Choose the two outcomes being compared in
the Output area at the upper right (see Figure 76-6) Leaving the Input area in the upper left blank
compares characteristics for the entire population, whereas specifying a combination of input values
allows a portion of the population to be explored For example, Figure 76-6 displays the characteristics
that affect the buying decisions of adults less than 36 years of age with no children
FIGURE 76-6
Neural Network Viewer
Trang 8Logistic regression
Logistic regression is a special case of the neural network algorithm whereby no hidden layer of neurons
is built While logistic regression can be used for many tasks, it is specially suited for estimation
prob-lems for which linear regression would be a good fit However, because the predicted value is discrete,
the linear approach tends to predict values outside the allowed range — for example, predicting
proba-bilities over 100 percent for a certain combination of inputs
Because it is derived from the neural network algorithm, logistic regression shares the same viewer
Naive Bayes
Naive Bayes is a very fast algorithm with accuracy that is adequate for many applications It does not,
however, operate on continuous variables The Naive portion of its name derives from this algorithm’s
assumption that every input is independent For example, the probability of a married person
purchas-ing a bike is computed from how often married and bike buyer appear together in the trainpurchas-ing data
without considering any other columns The probability of a new case is just the normalized product of
the individual probabilities
Several viewers display data from the finished model:
■ Dependency Network: Displays both input and predictable columns as nodes with arrows
indicating what predicts what; a simple example is shown in Figure 76-7 Move the slider
to the bottom to see only the most significant predictions Click on a node to highlight its
relationships
■ Attribute Profiles: Similar in function to the Cluster Profiles Viewer, this shows all variables
and predictable outcomes in a single matrix Each cell of the matrix is a graphical
representa-tion of that variable’s distriburepresenta-tion for a given outcome Click on a cell (chart) to see the full
distribution for that outcome/variable combination in the Mining Legend, or hover over a cell
for the same information in a tooltip
■ Attribute Characteristics: This viewer displays the list of characteristics associated with the
selected outcome
■ Attribute Discrimination: This viewer is similar to the Characteristics Viewer, but it shows
which characteristics favor one outcome versus another
Association rules
This algorithm operates by finding attributes that appear together in cases with sufficient frequency to be
significant These attribute groupings are called itemsets, which are in turn used to build the rules used
to generate predictions While Association Rules can be used for many tasks, it is specially suited to
market basket analysis Generally, data will be prepared for market basket analysis using a nested table,
whereby the parent row is a transaction (e.g., Order) and the nested table contains the individual items
Three viewers provide insight into a trained model:
■ Rules: Similar in layout and controls to itemsets, but lists rules instead of itemsets Each rule
has the form A, B≥ C, meaning that cases that contain A and B are likely to contain C (e.g.,
people who bought pasta and sauce also bought cheese) Each rule is listed with its probability
(likelihood of occurrence) and importance (usefulness in performing predictions)
Trang 9■ Itemsets: Displays the list of itemsets discovered in the training data, each with its associated size (number of items in the set) and support (number of training cases in which this set appears) Several controls for filtering the list are provided, including the Filter Itemset text box, which searches for any string entered (e.g., ‘‘Region= Europe’’ will display only itemsets that include that string)
■ Dependency Network: Similar to the Dependency Network used for other algorithms, with nodes representing items in the market basket analysis Note that nodes have a tendency to predict each other (dual-headed arrows) The slider will hide the less probable (not the less important) associations Select a node to highlight its related nodes
FIGURE 76-7
Naive Bayes Dependency Network Viewer
Time series
The time series algorithm predicts the future values for a series of continuous data points (e.g., web
traffic for the next six months given traffic history) Unlike the algorithms already presented, prediction
does not require new cases on which to base the prediction, just the number of steps to extend the
Trang 10series into the future Input data must contain a time key to provide the algorithm’s time attribute Time
keys can be defined using date, double, or long columns
Once the algorithm has run, it generates a decision tree for each series being forecast The decision tree
defines one or more regions in the forecast and an equation for each region, which can be reviewed
using the Decision Tree Viewer For example, a node may be labeledWidget.Sales-4 < 10,000,
which is interpreted as ‘‘use the equation in this node when widget sales from four time-steps back is
less than 10,000.’’ Selecting a node will display two associated equations in the Mining Legend, and
hovering over the node will display the equation as a tooltip — SQL Server 2008 added the second
equation, providing better long-term forecasts by blending these different estimation techniques
Note the Tree pull-down at the top of the viewer that enables the models for different series to be
examined Each node also displays a diamond chart whose width denotes the variance of the
pre-dicted attribute at that node In other words, the narrower the diamond chart, the more accurate the
prediction
The second Time Series Viewer, labeled simply Charts, plots the actual and predicted values of the
selected series over time Choose the series to be plotted from the drop-down list in the upper-right
corner of the chart Use the Abs button to toggle between absolute (series) units and relative (percent
change) values The Show Deviations check box will add error bars to display expected variations on the
predicted values, and the Prediction Steps control enables the number of predictions displayed Drag the
mouse to highlight the horizontal portion of interest and then click within the highlighted area to zoom
into that region Undo a zoom with the zoom controls on the toolbar
Because prediction is not case based, the Mining Accuracy Chart does not function for this algorithm
Instead, keep later periods out of the training data and compare predicted values against the test data’s
actuals
Cube Integration
Data mining can use Analysis Services cube data as input instead of using a relational table (see the first
page of the Data Mining Wizard section earlier in this chapter); cube data behaves much the same as
relational tables, with some important differences:
■ Whereas a relational table can be included from most any data source, the cube and the
mining structure that references it must be defined within the same project
■ The case ‘‘table’’ is defined by a single dimension and its related measure groups When
additional data mining attributes are needed, add them via a nested table
■ When selecting mining structure keys for a relational table, the usual choice is the primary key
of the table Choose mining structure keys from dimension data at the highest (least granular)
level possible For example, generating a quarterly forecast requires that quarter to be chosen
as the key time attribute, not the time dimension’s key (which is likely day or hour)
■ Data and content type defaults tend to be less reliable for cube data, so review and adjust type
properties as needed
■ Some dimension attributes based on numeric or date data may appear to the data mining
interface with a text data type A little background is required to understand why this
hap-pens: When a dimension is built, it is required to have the Key column property specified