FIGURE 75-1Excel PivotTable based on Analysis Services cube Once the PivotTable is added to a worksheet, available data fields are displayed in the PivotTable Field List, ready for dragg
Trang 1Data retrieved into a table takes on all the capabilities of an Excel table:
■ Table formatting and totals: Click inside the table, then choose a style from the Design tab, and the entire table’s formatting will change to that style The Total Row check box here enables a row at the bottom of the table for summary functions (e.g.,SUM,AVERAGE,COUNT) that will apply to all extracted data, regardless of how many rows are returned at the next refresh
■ Conditional formatting: Select a column in the table, choose a format from the Conditional Formatting menu on the Home tab, and the color, data bars, or icons will overlay the table data to highlight variations in values
■ Filter and sort: Clicking on the column header menu enables visible rows to be filtered by picking individual values in that column, by defining conditions (e.g., greater or less than value or average, top 10, etc.), or based on conditional formatting applied to that column
Similarly the column can be sorted by either value or conditional formatting
■ Add/Remove columns: Insert a new column into the table and enter an Excel formula into any cell within that column to create a calculated column Additionally, entire columns can be eliminated from the table by deleting that column without the need to change the connection definition Similarly, you can also remove rows from the table However, these rows will reappear the next time the table is refreshed
The latest data from the database can be retrieved at any time by right-clicking on the table and
choos-ing the Refresh item, or by chooschoos-ing one of the Refresh options from the Data tab None of the changes
made to the Excel table will change data in the source database
PivotTables
PivotTables and PivotCharts are powerful analysis tools that work for both relational and Analysis
Services data The way Excel interacts with the source data is fundamentally different between these two
types of data, however For relational data sources, Excel reads the entire data set from the database as
soon as the PivotTable is created, storing it invisibly within the workbook in aPivotCacheobject
This enables the PivotTable to respond to changes without querying the underlying data each time, but
it can make for a very large workbook when the data set is large
By contrast, Analysis Services data sources are queried for each update to the PivotTable, keeping
the workbook size down and relying on the responsiveness of Analysis Services PivotTables created
on Analysis Services data sources reflect the latest data with every change to the PivotTable, whereas
relationally based PivotTables only reflect new data when explicitly refreshed (or refreshed by the
connection definition)
Start a PivotTable by either choosing a connection from the Data tab or choosing PivotTable from
the Insert tab The idea of pivoting data is to display summaries based on categories that are placed
as row and column headers As categories are dropped onto the header areas, the table quickly
reformats itself to display values grouped by all the currently selected category values, as shown in
Figure 75-1
Trang 2FIGURE 75-1
Excel PivotTable based on Analysis Services cube
Once the PivotTable is added to a worksheet, available data fields are displayed in the PivotTable Field
List, ready for dragging onto one of the four table areas:
■ Values: The center of the table that displays data aggregates, such as the Internet Order Count
shown in Figure 75-1
■ Row Labels: Category data that provides row headers on the left side of the table (e.g.,
Calendar Year in Figure 75-1)
■ Column Labels: Category data that provides column headers along the top of the table (e.g.,
Stage-Province in Figure 75-1)
■ Report Filter: Provides an overall filter for the PivotTable that does not change the layout of
the table (e.g., Country in Figure 75-1)
While the Field List panel is basically the same for both relational and Analysis Services data sources,
the Analysis Services version includes additional information Values (called measures in Analysis
Trang 3Ser-filter at the top of the panel restricts the Field List to only those items in the selected group of values
(called measure groups in Analysis Services) Category items within Field List can be organized into
fold-ers by setting an item’s AttributeHierarchyDisplayFolder property in Analysis Services, which cause the
folders to appear next to Contacts and other category groups in Figure 75-1 Finally, Analysis Services
defines hierarchies that allow drill-down paths, such as Calendar and Customer Geography in Figure
75-1, which enable details to be toggled in outline form
Once fields have been placed in the PivotTable, field-specific settings are available Right-click on a field
in the PivotTable to access the following:
■ Field settings: These provide control over subtotals, layout, number format, and how val-ues are calculated Calculation options include basic aggregation functions (SUM,COUNT, AVERAGE, etc.), as well as ‘‘% of,’’ ‘‘Running total,’’ and several other options
■ Sort settings: Choose to sort rows or columns based on either headers or values
■ Filter: Individual header values can be selected, Label filters can be defined (e.g., State-Province does not contain ‘‘Wales’’), or Value filters can be defined (e.g., show only periods with more than 100 orders)
■ Properties: Analysis Services data sources associate properties with many of the values listed
in the header Some of these values may not be available directly in the Field List Properties may be exposed either directly as columns in the spreadsheet or as a tooltip when the cursor hovers over a header
■ Additional Actions: Analysis Services can associate actions, such as running reports, with header values
Because PivotTables display summary data, it is often useful to drill into the details behind a sum or
count Double-clicking on any value will create a new worksheet with the associated detail rows By
default, Analysis Services data sources limit the rows returned by a drill-through to 1,000, but this
maximum is configurable via the Connection Properties dialog
After a bit of practice, generating a desired view in this environment is extremely time efficient,
limited mostly by the speed of the underlying data source Insights into data can be gained at a
surprising rate
PivotCharts
PivotCharts (see Figure 75-2 for an example) are bound to a PivotTable, displaying the contents of
the table as it changes The PivotTable’s row headers appear as axis labels in the chart, and its column
headers appear as entries in the legend You can create a PivotChart either by choosing the PivotChart
option when the PivotTable is created or by clicking inside of an existing PivotTable and inserting an
Excel chart
You can control the content of the PivotChart with either the full-featured PivotTable Field List or the
simplified PivotChart Filter pane The majority of Excel chart functions are available for a PivotChart,
including creating a full-page chart by right-clicking and choosing the Move option
Trang 4FIGURE 75-2
Excel PivotChart based on Analysis Services cube
Advanced Data Analysis
The SQL Server Data Mining Add-ins for Office 2007 make a number of additional features available in
Excel for analyzing data This free download enhances Excel with features that make it easier to explore
and prepare data sets, perform common analyses using data mining, and allow Excel to act as a full data
mining client
This approach of encapsulating common data mining analyses in Excel is extremely powerful, allowing
a much wider audience to use data mining than would otherwise access them Note that most of these
features require an Analysis Services server to execute the associated data mining processing
See Chapter 76, ‘‘Data Mining with Analysis Services’’ for more detail on how to approach
data mining projects and available algorithms.
Trang 5Start by downloading and installing the add-ins The product page for the add-ins,www.microsoft
.com/sqlserver/2008/en/us/data-mining-addins.aspx, includes pointers to the download,
tutorials, webcasts, and labs Because executing the data mining algorithms requires access to an Analysis
Services server, setup installs and provides a link to run the Server Configuration Utility
The configuration wizard will set up a new Analysis Services database in which Excel mining models
can be created, or it enables you to identify an existing database if one has already been created for that
purpose This process assumes that an Analysis Services server is available and the account used for
installation has adequate permissions The configuration utility will also suggest enabling the creation of
temporary mining models, which is important to prevent the database becoming filled with junk objects
as a result of the models Excel will create
Once the install and configuration steps are complete, Excel’s Ribbon will have two new tabs: Data
Mining and Analyze (select some portion of a table to see the Analyze tab, described in ‘‘Table Analysis
Tools’’ later in this chapter)
Exploring and preparing data
Using these advanced functions is easiest when the data set being analyzed is defined as an Excel table
Data imported from external sources is automatically defined as a table, but other data, such as that
entered into Excel via a copy/paste operation, will not automatically be defined as a table A simple way
to check whether a data set has been defined as a table is to select a cell in the table, and if it has been
defined as a table, the Table Tools group of tabs will appear in Excel’s Ribbon Convert a range of cells
into a table by first ensuring that the top row of cells contains column headers for the table, selecting a
cell in the range to be converted, and then choosing Table from the Insert tab Excel assigns table names
that may be less than intuitive Table names can be adjusted by selecting a cell in a table, choosing the
‘‘Table Tools’’ Design tab, and typing over the name that appears on the left-hand side of the Ribbon
Once the data has been organized as desired, there are three actions in the Data Preparation group of
the Data Mining tab described in this section While these functions are intended to prepare data for use
by the data mining client, they can be useful for a wide variety of situations None of these explore and
prepare data functions rely on data mining algorithms, nor do they communicate with the Analysis
Ser-vices server
Explore Data
Choose Explore Data and the wizard will prompt for a table and column name, and then display a
histogram of rows for each value in that column For example, Figure 75-3(a) shows the count of rows
for each value in theNumberChildrenAtHomecolumn For numeric data, an alternate display can be
toggled via the icons at the lower left, allowing the data to be grouped into equally sized buckets of
values, as shown in Figure 75-3(b) This is very useful for columns that contain a large number
of values, such as dates, salaries, and so on
Displays in numeric mode can also add a new column to the source table to denote into which bucket
each row falls The copy button will snapshot the histogram chart for pasting in any application that
Trang 6FIGURE 75-3
Explore Data histograms in (a) Discrete and (b) Numeric displays
Clean Data
Choose Clean Data and two options will appear: Outliers and Re-label Outliers is very similar to
Explore Data as described above, except that when the histogram displays, sliders appear that allow the
elimination of extreme data values in the table For numeric values, this includes identifying minimum
and maximum allowable values, with several handling options: replacing an outlier with limit values,
replacing an outlier with a mean value, simply clearing the outlier, or totally removing the offending
row For text values, infrequently used values can be defined as outliers This enables, for example, the
top 10 occurring cities to be surfaced in an analysis, with less frequently occurring cities to be grouped
under an Other category In addition to replacing values, text values can be cleared or the associated
rows removed from the table
The Re-label variant can be thought of as a structured search and replace After identifying the table and
column of interest, the wizard presents a list of current values in that column, prompting for the new
values with which they should be replaced This function is useful for fixing data entry problems,
map-ping abbreviations to reporting descriptions, or even groumap-ping data into categories
Partition Data
Choose this function to copy rows from a source table to new tables in useful ways:
■ Split data into training and testing sets: When building data mining models, it is necessary
not only to train a model using part of the available data, but also to reserve a part of that
data for testing the trained model to assess how well it will perform on data it has not yet
seen This option will split the source table into two separate tables for this purpose based on
a chosen ratio, randomly selecting which rows fall into each set
■ Random sampling: This option extracts a random sample of the rows based on a
sup-plied ratio or row count While very similar in function to the Split option, it more directly
Trang 7to assess differences that training a model on different data slices present.
■ Oversampling to balance data distribution: Data sets sometimes do not accurately repre-sent the populations they are meant to model Oversampling is a method to compensate for sampling bias in a data set Indicate to the wizard the column and associated value to sample, and the resulting new data set will guarantee a representation of rows with a specified ratio
Table analysis tools
Select a cell inside of a table and the Table Tools tabs will become available, including the Analyze tab
The functions on this tab are common data mining operations that have been made nearly single-click
operations All of these operations use Analysis Services to run the associated data mining algorithms
The server and database used can be changed by choosing Connection from the Ribbon
Analyze Key Influencers
Data sets that include predictable outcome(s) often have many attributes, not all of which are important
in determining the outcome Select a cell in the table, choose the Analyze Key Influencers option, tell
the wizard which column contains the outcome, and Excel will build a Nạve Bayes model to determine
which attributes (columns) are most influential in determining the outcome Excel will automatically add
a worksheet and report on key influencers Additional report sections can be generated to contrast
influ-encers for selected outcomes
The resulting report provides some initial insight into the data set being analyzed, and suggests
attributes that should definitely be included when developing a predictive analysis However, it is
important to understand that these are often not the only attributes that influence the outcome Nạve
Bayes is the simplest of algorithms and will only detect very direct relationships
Detect Categories
It is often useful to group cases (rows) in a data set into groups to better understand the population For
example, grouping customers by common traits could yield insights that lead to more targeted marketing
campaigns Tell the wizard which columns to consider in determining the categories, limit the number
of categories that will be created if desired, and click Run Excel will build a clustering model to put
similar cases into distinct buckets, add the category names as a new column to the source table, and
then add a worksheet that enables the exploration and naming of the associated categories
The Categories Report page contains three sections, including notes about how to use each The
top-level summary shows how many cases fall into each category and allows the categories to be renamed
The second section shows the characteristics of the selected category — change the filter on the
cate-gory column to display other categories The third section shows how a selected column varies across
all categories — change the column displayed by right-clicking on the x-axis and choosing the ‘‘Sort and
Filter’’ menu item
Highlight Exceptions
The wizard and algorithm for this analysis is identical to Detect Categories described above, but instead
of presenting a report that enables exploration of the categories, cases that don’t fall inside the categories
Trang 8a single column, but at how that column’s value fits with other attributes in that row The result finds
combinations that while not impossible are unlikely, such as managers with entry-level salaries Basic
outlier detection would not recognize this problem because the salary is in a valid range for the data set
as a whole
Excel builds the categories and then looks at every table row in turn, using the model to predict the
likelihood of that row given the category definition When the likelihood falls below the user-defined
threshold, that row in the table is highlighted In addition, the value in each column is evaluated for
its likelihood as well, and the least likely value is highlighted Excel automatically adds a report
work-sheet that summarizes the exceptions found by the least-likely column The report also contains the
threshold that determines which likelihoods are considered exceptions — adjust this value to see fewer
or more exceptions
When reviewing exceptions in a large table, it is helpful to sort by color to put all the
exceptions in one place: Right-click on an exception, and select Sort➪Put selected cell
color on top.
Fill from Example
Excel will build a logistic regression model to detect patterns and estimate data for a column with
miss-ing data Tell the wizard the column to be filled in and the columns to be used to detect patterns, and
Excel will add a new column to the table with all the values filled In addition, a report is added
sum-marizing the patterns used to determine the missing values This feature is meant to handle a variety
of missing data cases, such as surveys that are missing some responses, assuming that patterns in the
known attributes will be a good predictor of the missing attribute
The model that Excel builds will always supply a value for the missing attribute even when it is not
correct, so before accepting the values provided, find some way to validate the model before blindly
accepting its results For example, add some test rows (cases with known values for the attribute in
question) to the table without values for the attribute in question and compare the value generated by
the model to their actual values
Forecasting
Forecasting estimates the next steps of a series given its history For example, what will next quarter’s
sales be? Set up a table with all the related series in columns, with one time column It is best to
include related series, as the Time Series algorithm finds relationships between series that can help
build better forecasts For example, last year’s software sales numbers may help predict this year’s
maintenance sales
Indicate in the wizard the series to be predicted, which column contains time, and the number of
peri-ods to be predicted If the data has an inherent periodicity, such as a quarterly sales cycle, supplying
that information in the wizard as a hint to the algorithm may improve the forecast Excel will extend the
source table with new rows at the bottom containing predicted values In addition, a forecasting report
worksheet is added showing a graph with the existing and predicted values
A good test for the reliability of the forecast is to copy the source worksheet, remove the last few
peri-ods, and run the forecast to predict known values Comparing the actual and predicted values will give
you an indication of the reliability of the forecast going forward
Trang 9Scenario Analysis investigates how changes to the source data set affect the outcome The ‘‘What-if’’
option enables the user to form an exact question — for example, ‘‘How many more customers would
purchase a bike if their income went up 20%?’’ The ‘‘Goal Seek’’ option asks Excel to find the value at
which the desired outcome occurs — for example, ‘‘How much more income would our customers need
before purchasing a bike?’’ Excel builds a logistic regression model to estimate the impact of changes
Upon completion of a ‘‘What-if’’ scenario for the entire table, new columns are added to the source table
showing the new value of the outcome column and the confidence in the result ‘‘Goal Seek’’ for the
entire table adds columns for the new value of the outcome column and the new value for the column
being adjusted
Prediction Calculator
The Prediction Calculator functions by first using the data in an Excel table to train a logistic regression
model to predict an outcome, and then makes the resulting model available as a calculator in Excel to
evaluate individual cases without being connected to the Analysis Services server For example, a model
could be trained to predict component failure based on measurable attributes, and then made available
to technicians performing preventive maintenance Inform the wizard of what attribute and attribute
value is to be predicted, and it will create up to three new sheets in the workbook:
■ Prediction Report: Lists all the significant attribute/value combinations found in building the model and their impact on the result In addition, if the user enters costs associated with correct and incorrect guesses into the interactive profit calculator (e.g., cost of a component failure vs cost of replacing a component that would not have failed), a threshold will be calculated for how likely an outcome must be before it is predicted This threshold is then used in the calculator pages
■ Prediction Calculator (optional): Enter the values for a case and see the predicted outcome
■ Printable Calculator (optional): This contains a printable form that can be used for data collection and later entry into Excel, or even manual calculation without entry into Excel
Shopping Basket Analysis
The Shopping Basket Analysis is a quick way to build an association rules model based on the data in
an Excel table This model will identify groups of items that normally appear together in a transaction,
allowing better product organization and/or suggestions to customers — for example, the famous
‘‘Customers who bought this book also bought ’’
The Excel table must contain certain columns that are indicated in the wizard:
■ Transaction ID: The Order Number, Session ID, or some other identifier that ties multiple rows together into a single transaction
■ Item: The name or other identifier of the item purchased
■ Item Value (optional): The price or value of the item included in that transaction This enables the results to be sorted on the total value that a ‘‘basket’’ represents (average price of the basket * number of sales) As a result, a priority can be placed on suggestions that will
Trang 10After the model has been built, two sheets are added to the workbook The Bundled Items report details
all the bundles (item combinations) found and their associated sales and price information The
Recom-mendations report lists recommendation rules by item, the proposed recommendation, and supporting
statistics
Data mining client
The Data Mining tab added by installing the SQL Server Data Mining Add-ins for Office 2007 provides
a full data mining environment, equivalent to the data mining environment provided by Visual Studio
(also known as Business Intelligence Development Studio) Unlike the table analysis tools described
earlier, whereby tables and reports are created directly in Excel, the primary focus here is on creating,
training, browsing, and querying data mining models in an Analysis Services database
Working from within Excel to develop models can have advantages over the Visual Studio environment,
especially when working with small amounts of data early in the process, when cleaning and exploring
the data set, as the data set can be quickly changed in Excel and used to train and test models in
Anal-ysis Services However, there are limitations to the Excel environment, such as the inability to show the
accuracy of competing models in the same accuracy chart
See Chapter 76, ‘‘Data Mining with Analysis Services,’’ to learn more about the data
min-ing features of Analysis Services and the functions detailed here.
Functions exposed on the Data Mining tab include the following:
■ Data Preparation: This is described in the section ‘‘Exploring and Preparing Data’’ earlier in
the chapter
■ Data Modeling: Allows the creation of mining structures and models Several of the most
popular models are listed as separate functions, while the Advanced option provides access to
all available algorithms
■ Accuracy and Validation: Provides different views of model performance on test data
■ Browse: Enables the examination of model details for any model in the current database
■ Document Model: Adds a new sheet to the current workbook listing model details
■ Query: Provides a friendly environment for constructing and executing DMX queries against
mining models
■ Manage Models: Enables structures and models in the current database to be deleted,
renamed, processed, and so on
■ Connection: Manages connections to the Analysis Services database
■ Trace: Provides a history of every command sent to the Analysis Services server Use of
session models for table analysis functions can also be enabled or disabled here
Summary
Microsoft Excel has long been the most frequently used tool for analyzing data, and with the advent of
the 2007 version, it is easier than ever to include relational and Analysis Services data in those
analy-ses Relational data can be included in data tables that remain linked to the underlying table or query