This training data, in the form of dimensions, levels, member properties and measures, is used to process the OLAP data mining model and further define the data mining column structure f
Trang 1FIGURE 51.57 Creating KPIs in the cube designer
previously known As you create dimensions, you can even choose a data mining model
as the basis for a dimension
Basically, a data mining model is a reference structure that represents the grouping and
predictive analysis of relational or multidimensional data It is composed of rules,
patterns, and other statistical information of the data that it was analyzing These are
called cases A case set is simply a means for viewing the physical data Different case sets
can be constructed from the same physical data Basically, a case is defined from a
particu-lar point of view If the algorithm you are using supports the view, you can use mining
models to make predictions based on these findings
Another aspect of a data mining model is using training data This process determines the
relative importance of each attribute in a data mining model It does this by recursively
partitioning data into smaller groups until no more splitting can occur During this
parti-tioning process, information is gathered from the attributes used to determine the split
Probability can be established for each categorization of data in these splits This type of
data can be used to help determine factors about other data utilizing these probabilities
This training data, in the form of dimensions, levels, member properties and measures, is
used to process the OLAP data mining model and further define the data mining column
structure for the case set
In SSAS, Microsoft provides several data mining algorithms (or techniques):
likely to appear together in a transaction The rules help predict when the presence
of one item is likely with another item (which has appeared in the same type of
transaction before)
dataset into clusters that contain similar characteristics This is one of the best
algo-rithms, and it can be used to find general groupings in data
Trang 2clustering, and it identifies clusters of similarly ordered events in a sequence The
clusters can be used to predict the likely ordering of events in a sequence, based on
known characteristics
supports the prediction of both discrete and continuous attributes
It is a configuration variation of the Decision Trees algorithm, obtained by disabling
splits (The whole regression formula is built in a single root node.) The algorithm
supports the prediction of continuous attributes
It is a configuration variation of the Neural Network algorithm, obtained by
elimi-nating the hidden layer This algorithm supports the prediction of both discrete and
continuous attributes
predictive modeling It supports only discrete attributes, and it considers all the
input attributes to be independent, given the predictable attribute
multilayer networks to predict multiple attributes It can be used for classification of
discrete attributes as well as regression of continuous attributes
ana-lyze time-related data, such as monthly sales data or yearly profits The patterns it
discovers can be used to predict values for future time steps across a time horizon
To create an OLAP data mining model, SSAS uses either an existing source OLAP cube or
an existing relational database/data warehouse, a particular data mining
technique/algo-rithm, case dimension and level, predicted entity, or, optionally, training data The source
OLAP cube provides the information needed to create a case set for the data mining
model You then select the data mining technique (decision tree, clustering, or one of the
others) It uses the dimension and level that you choose to establish key columns for the
case sets The case dimension and level provide a certain orientation for the data mining
model into the cube for creating a case set The predicted entity can be either a measure
from the source OLAP cube, a member property of the case dimension and level, or any
member of another dimension in the source OLAP cube
NOTE
The Data Mining Wizard can also create a new dimension for a source cube and
enables users to query the data mining data model data just as they would query OLAP
data (using the SQL DMX extension or the mining structures browser)
In Visual Studio, you simply initiate the Data Mining Wizard by right-clicking the Mining
Structuresentry in the Solution Explorer You cannot create new mining structures from
Trang 3SSMS When you are past the wizard’s splash screen, you have the option of creating your
mining model from either an existing relational database (or data warehouse) or an
exist-ing OLAP cube (as shown in Figure 51.58)
You want to define a data mining model that can shed light on product (SKU) sales
char-acteristics and that will be based on the data and structure you have created so far in your
Comp Sales Unleashed cube For this example, you choose to use the existing OLAP cube
you already have (from the existing cube method)
You must now select the data mining technique you think will help you find value in
your cube’s data Clustering is probably the best one to start from because it finds natural
groupings of data in a multidimensional space It is useful when you want to see general
groupings in your data, such as hot spots You are trying to find just such things with
sales of products (for example, things that sell together or belong together) Figure 51.59
shows the data mining technique Microsoft Clustering being selected
Now you have to identify the source cube dimension to use to build the mining structure
As you can see in Figure 51.60, you choose Product Dimension to fit the mining
inten-tions stated earlier
You then select the case key or point of view for the mining analysis Figure 51.61
illus-trates the case to be based on the product dimension and at the SKU level (that is, the
individual product level)
FIGURE 51.58 Selecting the definition method to used for the mining structure in the Data
Mining Wizard
Trang 4FIGURE 51.59 Using clustering to identify natural groups in the Data Mining Wizard
FIGURE 51.60 Identifying the product dimension as the basis for the mining structure in the
Data Mining Wizard
Trang 5You now specify the attributes and measures as case-level columns of the new mining
structure Figure 51.62 shows the possible selections You can simply choose all the data
measures for this mining structure Then you click the Next button
As you can see in Figure 51.63, the next few wizard dialogs allow you to specify the
mining structure column’s content and data types (use the defaults that were detected for
most items unless we specifically describe something different), identify a filtered slice to
use for the model training (you don’t need to use this now because you want the whole
cube), and finally identify the number of cases to be reserved for model testing (use a
percentage of data for testing to be about 33%)
The mining model is now specified and must be named and processed Figure 51.64 shows
what you have named the mining structure (Product Dimension MS) and the mining
model name itself (Product Dimension MM) Also, you select the Allow Drill Through
option so you can look further into the data in the mining model after it is processed
Then you click the Finish button
When the Data Mining Wizard is complete, the mining structure viewer pops up, with
your mining structure case-level column’s specification (on the center left) and its
correla-tion to your cube (see Figure 51.65)
You must now process the mining structure to see what you come up with You do this by
selecting the Mining Model toolbar option and selecting the Process option You then see
the usual Process dialog, and you have to choose to run this (process the mining
struc-ture) After the mining structure processing completes, a quick click on the Cluster
FIGURE 51.61 Identifying the basic unit of analysis for the mining model in the Data Mining
Wizard
Trang 6FIGURE 51.62 Specifying the measure for the mining model in the Data Mining Wizard
FIGURE 51.63 Specifying a column’s content, slice filters, and model data training
percent-ages
Diagram tab shows the results of the clustering analysis (see Figure 51.66) Notice that
because you selected to allow drill through, you can simply right-click any of the clusters
identified and see the data that is part of the cluster (and choose Drill Through) This
viewer clearly shows that there is some clustering of SKU values that might indicate
prod-ucts that sell together or belong together
Trang 7ptg FIGURE 51.64 Naming the mining model and completing the Data Mining Wizard
FIGURE 51.65 Your new mining structure in the mining structure viewer
If you click the Cluster Profiles tab of this viewer, you see the data value profile
character-istics that were processed (see Figure 51.67)
Figure 51.68 shows the clusters of data values of each data measure in the data mining
model This characteristic information gives you a good idea of what the actual data
values are and how they cluster together
Trang 8FIGURE 51.66 Clustering results and drilling through to the data in the mining model viewer
FIGURE 51.67 Cluster data profiles in the mining model viewer
Finally, you can see the cluster node contents at the detail level by changing the mining
model viewer type to Microsoft Generic Content Tree Viewer, which is just below the
Mining Model Viewer tab on top Figure 51.69 shows the detail contents of each model
node and its technical specification of a report format
If you want, you can now build new cube dimensions that can help you do predictive
modeling based on the findings of the data mining structures you just processed In this
way, you could predict sales units of one SKU and the number of naturally clustered SKUs
quite easily (based on the past data mining analysis) This type of predictive modeling is
very powerful
Trang 9ptg FIGURE 51.68 Cluster characteristics of the data values for each measure in the mining
model viewer
FIGURE 51.69 The Microsoft Generic Content Tree Viewer of the cluster nodes in the mining
model viewer
Trang 10SSIS
SSIS provides a robust means to move data between sources and targets Data can be
exported, validated, cleaned up, consolidated, transformed, and then imported into a
destination of any kind With any OLAP/SSAS implementation, you will undoubtedly
have to transform, clean, or preprocess data in some way You can now tap into SSIS
capa-bilities from within the SSAS platform
You can combine multiple column values into a single calculated destination column or
divide column values from a single source column into multiple destination columns You
might need to translate values in operational systems For example, many OLTP systems
use product codes stored as numeric data Few people are willing to memorize an entire
collection of product codes An entry of 100235 for a type of shampoo in a product
dimension table is useless to a vice president of marketing who is interested in how much
of that shampoo was sold in California in the past quarter
Cleanup and validation of data are critical to the data’s value in the data warehouse The
old saying “garbage in, garbage out” applies If data is missing, redundant, or inconsistent,
high-level aggregations can be inaccurate, so you should at least know that these
condi-tions exist Perhaps data should be rejected for use in the warehouse until the source data
can be reconciled If the shampoo of interest to the vice president is called Shamp in one
database and Shampoo in another, aggregations on either value would not produce
complete information about the product
The SSIS packages define the steps in a transformation workflow You can execute the
steps serially and in combinations of serially, in parallel, or conditionally For more
infor-mation on SSIS, refer to Chapter 46, “SQLCR: Developing SQL Server Objects in NET.”
OLAP Performance
Performance is a big emphasis of SSAS Usage-based aggregation is at the heart of much of
what you can do to help in this area In addition, the proactive caching mechanism in
SSAS has allowed much of what was previously a bottleneck (and a slowdown) to be
circumvented
When designing cubes for deployment, you should consider the data scope of all the data
accesses (that is, all the OLAP queries that will ever touch the cube) You should only
build a cube that is big enough to handle these known data scopes If you don’t have
requirements for something, you shouldn’t build it This helps keep things a smaller, more
manageable size (that is, smaller cubes), which translates into faster overall performance
for those who use the cube
You can also take caching to the extreme by relocating the OLAP physical storage
compo-nents on a solid-state disk device (that is, a persistent memory device) This can give you
tenfold performance gains The price of this type of technology has been dramatically