Hướng dẫn học Microsoft SQL Server 2008 part 165 docx

As the following code example shows, the prediction join relates a mining model and a set of data to be predicted cases.. This example returns the probability that the listed case will p

Trang 1

■ Input columns are too case specific (e.g., IDs, names, etc.) Adjust the mining structure to ignore data items containing values that occur in the training data but then never reappear for test or production data

■ Too few rows (cases) in the training data set to accurately characterize the population of cases

Look for additional sources of data for best results If additional data is not available, then better results may be obtained by limiting the special cases considered by an algorithm (e.g., increasing theMINIMUM_SUPPORTparameter)

■ If all models are closer to the Random Guess line than the Ideal Model line, then the input data does not correlate with the outcome being predicted

Note that some algorithms, such asTime_Series, do not support the Mining Accuracy Chart view at

all Regardless of the tools available within the development environment, it is important to perform an

evaluation of the trained model using test data held in reserve for that purpose Then, modify the data

and model definitions until the results meet the business goals at hand

Deploying

Several methods are available for interfacing applications with data mining functionality:

■ Directly constructing XMLA, communicating with Analysis Services via SOAP This exposes all functionality at the price of in-depth programming

■ Analysis Management Objects (AMO) provides an environment for creating and managing mining structures and other meta-data, but not for prediction queries

■ The Data Mining Extensions (DMX) language supports most model creation and training tasks and has a robust prediction query capability DMX can be sent to Analysis Services via the following:

■ ADOMD.NET for managed (.NET) languages

■ OLE DB for C++ code

■ ADO for other languages DMX is a SQL-like language modified to accommodate mining structures and tasks For purposes of

per-forming prediction queries against a trained model, the primary language feature is the prediction join

As the following code example shows, the prediction join relates a mining model and a set of data to be

predicted (cases) Because the DMX query is issued against the Analysis Services database, the model[TM

Decision Tree]can be directly referenced, while the cases must be gathered via anOPENQUERYcall

against the relational database The corresponding columns are matched in theONclause like a standard

relational join, and theWHEREandORDER BYclauses function as expected

DMX also adds a number of mining-specific functions such as thePredictand

PredictProbabilityfunctions shown here, which return the most likely outcome and the

probabil-ity of that outcome, respectively Overall, this example returns a list of IDs, names, and probabilities for

prospects who are more than 60 percent likely to purchase a bike, sorted by descending probability:

SELECT t.ProspectAlternateKey,t.FirstName, t.LastName, PredictProbability([TM Decision Tree].[Bike Buyer]) as Prob FROM [TM Decision Tree]

PREDICTION JOIN OPENQUERY([Adventure Works DW],

Trang 2

ProspectAlternateKey, FirstName, LastName, MaritalStatus,

Gender, YearlyIncome, TotalChildren, NumberChildrenAtHome,

Education, Occupation, HouseOwnerFlag, NumberCarsOwned,

StateProvince-Code

FROM dbo.ProspectiveBuyer;’) AS t

ON

[TM Decision Tree].[Marital Status] = t.MaritalStatus AND

[TM Decision Tree].Gender = t.Gender AND

[TM Decision Tree].[Yearly Income] = t.YearlyIncome AND

[TM Decision Tree].[Total Children] = t.TotalChildren AND

[TM Decision Tree].[Number Children At Home] = t.NumberChildrenAtHome AND

[TM Decision Tree].Education = t.Education AND

[TM Decision Tree].Occupation = t.Occupation AND

[TM Decision Tree].[House Owner Flag] = t.HouseOwnerFlag AND

[TM Decision Tree].[Number Cars Owned] = t.NumberCarsOwned AND

[TM Decision Tree].Region = t.StateProvinceCode

WHERE PredictProbability([TM Decision Tree].[Bike Buyer]) > 0.60

AND Predict([TM Decision Tree].[Bike Buyer])=1

ORDER BY PredictProbability([TM Decision Tree].[Bike Buyer]) DESC

Another useful form of the prediction join is a singleton query, whereby data is provided directly by the

application instead of read from a relational table, as shown in the next example Because the names

exactly match those of the mining model, aNATURAL PREDICTION JOINis used, not requiring an

ONclause This example returns the probability that the listed case will purchase a bike (i.e.,[Bike

Buyer]=1):

SELECT

PredictProbability([TM Decision Tree].[Bike Buyer],1)

FROM [TM Decision Tree]

NATURAL PREDICTION JOIN

(SELECT 47 AS [Age], ‘2-5 Miles’ AS [Commute Distance],

‘Graduate Degree’ AS [Education], ‘M’ AS [Gender],

‘1’ AS [House Owner Flag], ‘M’ AS [Marital Status],

2 AS [Number Cars Owned], 0 AS [Number Children At Home],

‘Professional’ AS [Occupation], ‘North America’ AS [Region],

0 AS [Total Children], 80000 AS [Yearly Income]) AS t

Business Intelligence Development Studio aids in the construction of DMX queries via the Query Builder

within the mining model prediction view Just like the Mining Accuracy Chart, select the model and

case table to be queried, or alternately press the singleton button in the toolbar to specify values Specify

SELECTcolumns and prediction functions in the grid at the bottom SQL Server Management Studio

also offers a DMX query type with meta-data panes for drag-and-drop access to mining structure column

names and prediction functions

Numerous prediction functions are available, including the following:

■ Predict: Returns the expected outcome for a predictable column

■ PredictProbability: Returns the probability (between 0 and 1) of the expected outcome, or

for a specific case if specified

Trang 3

■ PredictSupport: Returns the number of training cases on which the expected outcome is based, or on which a specific case is based if specified

■ PredictHistogram: Returns a nested table with all possible outcomes for a given case, listing probability, support, and other information for each outcome

■ Cluster: Returns the cluster to which a case is assigned (clustering algorithm specific)

■ ClusterProbability: Returns the probability the case belongs to a given cluster (clustering algorithm specific)

■ PredictSequence: Predicts the next values in a sequence (sequence clustering algorithm specific)

■ PredictAssociation: Predicts associative membership (association algorithm specific)

■ PredictTimeSeries: Predicts future values in a time series (time series algorithm specific) Like PredictHistogram, this function returns a nested table

Algorithms

When working with data mining, it is useful to understand mining algorithm basics and when to apply

each algorithm Table 76-2 summarizes common algorithm usage for the problem categories presented at

the beginning of this chapter

TABLE 76-2

Common Mining Algorithm Usage Problem Type Primary Algorithms

Segmentation Clustering, Sequence Clustering

Classification Decision Trees, Naive Bayes, Neural Network, Logistic Regression

Association Association Rules, Decision Trees

Estimation Decision Trees, Linear Regression, Logistic Regression, Neural Network

Forecasting Time Series

Sequence Analysis Sequence Clustering

These usage guidelines are useful as an orientation, but not every data mining problem falls neatly into

one of these types, and other algorithms will work for several of these problem types Fortunately, with

evaluation tools such as the lift chart, it’s usually simple to identify which algorithm provides the best

results for a given problem

Decision tree

This algorithm is the most accurate for many problems It operates by building a decision tree beginning

with the All node, corresponding to all the training cases (see Figure 76-4) Then, an attribute is chosen

Trang 4

that best splits those cases into groups, and each of those groups is examined for an attribute that best

splits those cases, and so on The goal is to generate leaf nodes with a single predictable outcome For

example, if the goal is to identify who will purchase a bike, then leaf nodes should contain cases that

are either bike buyers or not bike buyers, but no combinations (or as close to that goal as possible)

FIGURE 76-4

Decision Tree Viewer

The Decision Tree Viewer shown in Figure 76-4 graphically displays the resulting tree Age is the first

attribute chosen in this example, splitting cases into groups such as under 35, 35 to 42, and so on For

the under-35 crowd, Number Cars Owned was chosen to further split the cases, while Commute

Dis-tance was chosen for the 56 to 70 cases The Mining Legend pane displays the details of any selected

node, including how the cases break out by the predictable variable (in this case, 796 buyers and 1,538

non-buyers) both in count and probability Many more node levels can be expanded using the Show

Level control in the toolbar or the expansion controls (+/-) on each node Note that much of the tree

is not expanded in this figure due to space restrictions

The Dependency Network Viewer is also available for decision trees, displaying both input and

predictable columns as nodes, with arrows indicating what predicts what Move the slider to the bottom

to see only the most significant predictions Click on a node to highlight its relationships

Trang 5

Linear regression

The linear regression algorithm is implemented as a variant of decision trees and is a good choice for

con-tinuous data that relates more or less linearly The result of the regression is an equation in the form

Y= B0+ A1 ∗(X

1+ B1)+ A2 ∗(X

2+ B2)+

where Y is the column being predicted, Xi is the input columns, and Ai/Bi are constants determined by

the regression Because this algorithm is a special case of decision trees, it shares the same mining

view-ers While, by definition, the Tree Viewer will show a single All node, the Mining Legend pane displays

the prediction equation The equation can be either used directly or queried in the mining model via the

Predict function The Dependency Network Viewer provides a graphical interpretation of the weights

used in the equation

Clustering

The clustering algorithm functions by gathering similar cases together into groups called clusters and then

iteratively refining the cluster definition until no further improvement can be gained This approach

makes clustering uniquely suited for segmentation/profiling of populations Several viewers display data

from the finished model:

■ Cluster Diagram: This viewer displays each cluster as a shaded node with connecting lines between similar clusters — the darker the line, the more similar the cluster Move the slider

to the bottom to see only lines connecting the most similar clusters Nodes are shaded darker

to represent more cases By default, the cases are counted from the entire population, but changing the Shading Variable and State pull-downs specifies shading to be based on particular variable values (e.g., which clusters contain homeowners)

■ Cluster Profiles: Unlike node shading in the Cluster Diagram Viewer, where one variable value can be examined at a time, the Cluster Profiles Viewer shows all variables and clusters

in a single matrix Each cell of the matrix is a graphical representation of that variable’s dis-tribution in the given cluster (see Figure 76-5) Discrete variables are shown as stacked bars describing how many cases contain each of the possible variable values Continuous variables are shown as diamond charts, with each diamond centered on the mean (average) value for cases in that cluster, while the top and bottom of the diamond are the mean+/− the standard deviation, respectively Thus, the taller the diamond, the less uniform the variable values in that cluster Click on a cell (chart) to see the full distribution for a cluster/variable combi-nation in the Mining Legend, or hover over a cell for the same information in a tooltip In Figure 76-5, the tooltip displayed shows the full population’s occupation distribution, while the Mining Legend shows Cluster 3’s total children distribution

■ Cluster Characteristics: This view displays the list of characteristics that make up a cluster and the probability that each characteristic will appear

■ Cluster Discrimination: Similar to the Characteristics Viewer, this shows which character-istics favor one cluster versus another It also enables the comparison of a cluster to its own complement, clearly showing what is and is not in a given cluster

Once you gain a better understanding of the clusters for a given model, it is often useful to rename each

cluster to something more descriptive than the default ‘‘Cluster n.’’ From within either the Diagram or

Profiles Viewer, right-click on a cluster and choose Rename Cluster to give it a new name

Trang 6

FIGURE 76-5

Cluster Profiles Viewer

Sequence clustering

As the name implies, this algorithm still gathers cases together into clusters, but based on a sequence of

events or items, rather than on case attributes For example, the sequence of web pages visited during

user sessions can be used to define the most common paths through that website

The nature of this algorithm requires input data with a nested table, whereby the parent row is the

session or order (e.g., shopping cart ID) and the nested table contains the sequence of events during

that session (e.g., order line items) In addition, the nested table’s key column must be marked as a Key

Sequence content type in the mining structure

Once the model is trained, the same four cluster viewers described above are available to describe the

characteristics of each In addition, the State Transition Viewer displays transitions between two items

(e.g., a pair of web pages), with its associated probability of that transition happening Move the slider

to the bottom to see only the most likely transitions Select a node to highlight the possible transitions

from that item to its possible successors The short arrows that don’t connect to a second node denote a

state that can be its own successor

Trang 7

Neural Network

This famous algorithm is generally slower than other alternatives, but often handles more complex

sit-uations The network is built using input, hidden (middle), and output layers of neurons whereby the

output of each layer becomes the input of the next layer Each neuron accepts inputs that are combined

using weighted functions that determine the output Training the network consists of determining the

weights for each neuron

The Neural Network Viewer presents a list of characteristics (variable/value combinations) and how

those characteristics favor given outputs (outcomes) Choose the two outcomes being compared in

the Output area at the upper right (see Figure 76-6) Leaving the Input area in the upper left blank

compares characteristics for the entire population, whereas specifying a combination of input values

allows a portion of the population to be explored For example, Figure 76-6 displays the characteristics

that affect the buying decisions of adults less than 36 years of age with no children

FIGURE 76-6

Neural Network Viewer

Trang 8

Logistic regression

Logistic regression is a special case of the neural network algorithm whereby no hidden layer of neurons

is built While logistic regression can be used for many tasks, it is specially suited for estimation

prob-lems for which linear regression would be a good fit However, because the predicted value is discrete,

the linear approach tends to predict values outside the allowed range — for example, predicting

proba-bilities over 100 percent for a certain combination of inputs

Because it is derived from the neural network algorithm, logistic regression shares the same viewer

Naive Bayes

Naive Bayes is a very fast algorithm with accuracy that is adequate for many applications It does not,

however, operate on continuous variables The Naive portion of its name derives from this algorithm’s

assumption that every input is independent For example, the probability of a married person

purchas-ing a bike is computed from how often married and bike buyer appear together in the trainpurchas-ing data

without considering any other columns The probability of a new case is just the normalized product of

the individual probabilities

Several viewers display data from the finished model:

■ Dependency Network: Displays both input and predictable columns as nodes with arrows

indicating what predicts what; a simple example is shown in Figure 76-7 Move the slider

to the bottom to see only the most significant predictions Click on a node to highlight its

relationships

■ Attribute Profiles: Similar in function to the Cluster Profiles Viewer, this shows all variables

and predictable outcomes in a single matrix Each cell of the matrix is a graphical

representa-tion of that variable’s distriburepresenta-tion for a given outcome Click on a cell (chart) to see the full

distribution for that outcome/variable combination in the Mining Legend, or hover over a cell

for the same information in a tooltip

■ Attribute Characteristics: This viewer displays the list of characteristics associated with the

selected outcome

■ Attribute Discrimination: This viewer is similar to the Characteristics Viewer, but it shows

which characteristics favor one outcome versus another

Association rules

This algorithm operates by finding attributes that appear together in cases with sufficient frequency to be

significant These attribute groupings are called itemsets, which are in turn used to build the rules used

to generate predictions While Association Rules can be used for many tasks, it is specially suited to

market basket analysis Generally, data will be prepared for market basket analysis using a nested table,

whereby the parent row is a transaction (e.g., Order) and the nested table contains the individual items

Three viewers provide insight into a trained model:

■ Rules: Similar in layout and controls to itemsets, but lists rules instead of itemsets Each rule

has the form A, B≥ C, meaning that cases that contain A and B are likely to contain C (e.g.,

people who bought pasta and sauce also bought cheese) Each rule is listed with its probability

(likelihood of occurrence) and importance (usefulness in performing predictions)

Trang 9

■ Itemsets: Displays the list of itemsets discovered in the training data, each with its associated size (number of items in the set) and support (number of training cases in which this set appears) Several controls for filtering the list are provided, including the Filter Itemset text box, which searches for any string entered (e.g., ‘‘Region= Europe’’ will display only itemsets that include that string)

■ Dependency Network: Similar to the Dependency Network used for other algorithms, with nodes representing items in the market basket analysis Note that nodes have a tendency to predict each other (dual-headed arrows) The slider will hide the less probable (not the less important) associations Select a node to highlight its related nodes

FIGURE 76-7

Naive Bayes Dependency Network Viewer

Time series

The time series algorithm predicts the future values for a series of continuous data points (e.g., web

traffic for the next six months given traffic history) Unlike the algorithms already presented, prediction

does not require new cases on which to base the prediction, just the number of steps to extend the

Trang 10

series into the future Input data must contain a time key to provide the algorithm’s time attribute Time

keys can be defined using date, double, or long columns

Once the algorithm has run, it generates a decision tree for each series being forecast The decision tree

defines one or more regions in the forecast and an equation for each region, which can be reviewed

using the Decision Tree Viewer For example, a node may be labeledWidget.Sales-4 < 10,000,

which is interpreted as ‘‘use the equation in this node when widget sales from four time-steps back is

less than 10,000.’’ Selecting a node will display two associated equations in the Mining Legend, and

hovering over the node will display the equation as a tooltip — SQL Server 2008 added the second

equation, providing better long-term forecasts by blending these different estimation techniques

Note the Tree pull-down at the top of the viewer that enables the models for different series to be

examined Each node also displays a diamond chart whose width denotes the variance of the

pre-dicted attribute at that node In other words, the narrower the diamond chart, the more accurate the

prediction

The second Time Series Viewer, labeled simply Charts, plots the actual and predicted values of the

selected series over time Choose the series to be plotted from the drop-down list in the upper-right

corner of the chart Use the Abs button to toggle between absolute (series) units and relative (percent

change) values The Show Deviations check box will add error bars to display expected variations on the

predicted values, and the Prediction Steps control enables the number of predictions displayed Drag the

mouse to highlight the horizontal portion of interest and then click within the highlighted area to zoom

into that region Undo a zoom with the zoom controls on the toolbar

Because prediction is not case based, the Mining Accuracy Chart does not function for this algorithm

Instead, keep later periods out of the training data and compare predicted values against the test data’s

actuals

Cube Integration

Data mining can use Analysis Services cube data as input instead of using a relational table (see the first

page of the Data Mining Wizard section earlier in this chapter); cube data behaves much the same as

relational tables, with some important differences:

■ Whereas a relational table can be included from most any data source, the cube and the

mining structure that references it must be defined within the same project

■ The case ‘‘table’’ is defined by a single dimension and its related measure groups When

additional data mining attributes are needed, add them via a nested table

■ When selecting mining structure keys for a relational table, the usual choice is the primary key

of the table Choose mining structure keys from dimension data at the highest (least granular)

level possible For example, generating a quarterly forecast requires that quarter to be chosen

as the key time attribute, not the time dimension’s key (which is likely day or hour)

■ Data and content type defaults tend to be less reliable for cube data, so review and adjust type

properties as needed

■ Some dimension attributes based on numeric or date data may appear to the data mining

interface with a text data type A little background is required to understand why this

hap-pens: When a dimension is built, it is required to have the Key column property specified

Định dạng
Số trang	10
Dung lượng	731,72 KB