Hướng dẫn học Microsoft SQL Server 2008 part 163 ppsx

FIGURE 75-1Excel PivotTable based on Analysis Services cube Once the PivotTable is added to a worksheet, available data fields are displayed in the PivotTable Field List, ready for dragg

Trang 1

Data retrieved into a table takes on all the capabilities of an Excel table:

■ Table formatting and totals: Click inside the table, then choose a style from the Design tab, and the entire table’s formatting will change to that style The Total Row check box here enables a row at the bottom of the table for summary functions (e.g.,SUM,AVERAGE,COUNT) that will apply to all extracted data, regardless of how many rows are returned at the next refresh

■ Conditional formatting: Select a column in the table, choose a format from the Conditional Formatting menu on the Home tab, and the color, data bars, or icons will overlay the table data to highlight variations in values

■ Filter and sort: Clicking on the column header menu enables visible rows to be filtered by picking individual values in that column, by defining conditions (e.g., greater or less than value or average, top 10, etc.), or based on conditional formatting applied to that column

Similarly the column can be sorted by either value or conditional formatting

■ Add/Remove columns: Insert a new column into the table and enter an Excel formula into any cell within that column to create a calculated column Additionally, entire columns can be eliminated from the table by deleting that column without the need to change the connection definition Similarly, you can also remove rows from the table However, these rows will reappear the next time the table is refreshed

The latest data from the database can be retrieved at any time by right-clicking on the table and

choos-ing the Refresh item, or by chooschoos-ing one of the Refresh options from the Data tab None of the changes

made to the Excel table will change data in the source database

PivotTables

PivotTables and PivotCharts are powerful analysis tools that work for both relational and Analysis

Services data The way Excel interacts with the source data is fundamentally different between these two

types of data, however For relational data sources, Excel reads the entire data set from the database as

soon as the PivotTable is created, storing it invisibly within the workbook in aPivotCacheobject

This enables the PivotTable to respond to changes without querying the underlying data each time, but

it can make for a very large workbook when the data set is large

By contrast, Analysis Services data sources are queried for each update to the PivotTable, keeping

the workbook size down and relying on the responsiveness of Analysis Services PivotTables created

on Analysis Services data sources reflect the latest data with every change to the PivotTable, whereas

relationally based PivotTables only reflect new data when explicitly refreshed (or refreshed by the

connection definition)

Start a PivotTable by either choosing a connection from the Data tab or choosing PivotTable from

the Insert tab The idea of pivoting data is to display summaries based on categories that are placed

as row and column headers As categories are dropped onto the header areas, the table quickly

reformats itself to display values grouped by all the currently selected category values, as shown in

Figure 75-1

Trang 2

FIGURE 75-1

Excel PivotTable based on Analysis Services cube

Once the PivotTable is added to a worksheet, available data fields are displayed in the PivotTable Field

List, ready for dragging onto one of the four table areas:

■ Values: The center of the table that displays data aggregates, such as the Internet Order Count

shown in Figure 75-1

■ Row Labels: Category data that provides row headers on the left side of the table (e.g.,

Calendar Year in Figure 75-1)

■ Column Labels: Category data that provides column headers along the top of the table (e.g.,

Stage-Province in Figure 75-1)

■ Report Filter: Provides an overall filter for the PivotTable that does not change the layout of

the table (e.g., Country in Figure 75-1)

While the Field List panel is basically the same for both relational and Analysis Services data sources,

the Analysis Services version includes additional information Values (called measures in Analysis

Trang 3

Ser-filter at the top of the panel restricts the Field List to only those items in the selected group of values

(called measure groups in Analysis Services) Category items within Field List can be organized into

fold-ers by setting an item’s AttributeHierarchyDisplayFolder property in Analysis Services, which cause the

folders to appear next to Contacts and other category groups in Figure 75-1 Finally, Analysis Services

defines hierarchies that allow drill-down paths, such as Calendar and Customer Geography in Figure

75-1, which enable details to be toggled in outline form

Once fields have been placed in the PivotTable, field-specific settings are available Right-click on a field

in the PivotTable to access the following:

■ Field settings: These provide control over subtotals, layout, number format, and how val-ues are calculated Calculation options include basic aggregation functions (SUM,COUNT, AVERAGE, etc.), as well as ‘‘% of,’’ ‘‘Running total,’’ and several other options

■ Sort settings: Choose to sort rows or columns based on either headers or values

■ Filter: Individual header values can be selected, Label filters can be defined (e.g., State-Province does not contain ‘‘Wales’’), or Value filters can be defined (e.g., show only periods with more than 100 orders)

■ Properties: Analysis Services data sources associate properties with many of the values listed

in the header Some of these values may not be available directly in the Field List Properties may be exposed either directly as columns in the spreadsheet or as a tooltip when the cursor hovers over a header

■ Additional Actions: Analysis Services can associate actions, such as running reports, with header values

Because PivotTables display summary data, it is often useful to drill into the details behind a sum or

count Double-clicking on any value will create a new worksheet with the associated detail rows By

default, Analysis Services data sources limit the rows returned by a drill-through to 1,000, but this

maximum is configurable via the Connection Properties dialog

After a bit of practice, generating a desired view in this environment is extremely time efficient,

limited mostly by the speed of the underlying data source Insights into data can be gained at a

surprising rate

PivotCharts

PivotCharts (see Figure 75-2 for an example) are bound to a PivotTable, displaying the contents of

the table as it changes The PivotTable’s row headers appear as axis labels in the chart, and its column

headers appear as entries in the legend You can create a PivotChart either by choosing the PivotChart

option when the PivotTable is created or by clicking inside of an existing PivotTable and inserting an

Excel chart

You can control the content of the PivotChart with either the full-featured PivotTable Field List or the

simplified PivotChart Filter pane The majority of Excel chart functions are available for a PivotChart,

including creating a full-page chart by right-clicking and choosing the Move option

Trang 4

FIGURE 75-2

Excel PivotChart based on Analysis Services cube

Advanced Data Analysis

The SQL Server Data Mining Add-ins for Office 2007 make a number of additional features available in

Excel for analyzing data This free download enhances Excel with features that make it easier to explore

and prepare data sets, perform common analyses using data mining, and allow Excel to act as a full data

mining client

This approach of encapsulating common data mining analyses in Excel is extremely powerful, allowing

a much wider audience to use data mining than would otherwise access them Note that most of these

features require an Analysis Services server to execute the associated data mining processing

See Chapter 76, ‘‘Data Mining with Analysis Services’’ for more detail on how to approach

data mining projects and available algorithms.

Trang 5

Start by downloading and installing the add-ins The product page for the add-ins,www.microsoft

.com/sqlserver/2008/en/us/data-mining-addins.aspx, includes pointers to the download,

tutorials, webcasts, and labs Because executing the data mining algorithms requires access to an Analysis

Services server, setup installs and provides a link to run the Server Configuration Utility

The configuration wizard will set up a new Analysis Services database in which Excel mining models

can be created, or it enables you to identify an existing database if one has already been created for that

purpose This process assumes that an Analysis Services server is available and the account used for

installation has adequate permissions The configuration utility will also suggest enabling the creation of

temporary mining models, which is important to prevent the database becoming filled with junk objects

as a result of the models Excel will create

Once the install and configuration steps are complete, Excel’s Ribbon will have two new tabs: Data

Mining and Analyze (select some portion of a table to see the Analyze tab, described in ‘‘Table Analysis

Tools’’ later in this chapter)

Exploring and preparing data

Using these advanced functions is easiest when the data set being analyzed is defined as an Excel table

Data imported from external sources is automatically defined as a table, but other data, such as that

entered into Excel via a copy/paste operation, will not automatically be defined as a table A simple way

to check whether a data set has been defined as a table is to select a cell in the table, and if it has been

defined as a table, the Table Tools group of tabs will appear in Excel’s Ribbon Convert a range of cells

into a table by first ensuring that the top row of cells contains column headers for the table, selecting a

cell in the range to be converted, and then choosing Table from the Insert tab Excel assigns table names

that may be less than intuitive Table names can be adjusted by selecting a cell in a table, choosing the

‘‘Table Tools’’ Design tab, and typing over the name that appears on the left-hand side of the Ribbon

Once the data has been organized as desired, there are three actions in the Data Preparation group of

the Data Mining tab described in this section While these functions are intended to prepare data for use

by the data mining client, they can be useful for a wide variety of situations None of these explore and

prepare data functions rely on data mining algorithms, nor do they communicate with the Analysis

Ser-vices server

Explore Data

Choose Explore Data and the wizard will prompt for a table and column name, and then display a

histogram of rows for each value in that column For example, Figure 75-3(a) shows the count of rows

for each value in theNumberChildrenAtHomecolumn For numeric data, an alternate display can be

toggled via the icons at the lower left, allowing the data to be grouped into equally sized buckets of

values, as shown in Figure 75-3(b) This is very useful for columns that contain a large number

of values, such as dates, salaries, and so on

Displays in numeric mode can also add a new column to the source table to denote into which bucket

each row falls The copy button will snapshot the histogram chart for pasting in any application that

Trang 6

FIGURE 75-3

Explore Data histograms in (a) Discrete and (b) Numeric displays

Clean Data

Choose Clean Data and two options will appear: Outliers and Re-label Outliers is very similar to

Explore Data as described above, except that when the histogram displays, sliders appear that allow the

elimination of extreme data values in the table For numeric values, this includes identifying minimum

and maximum allowable values, with several handling options: replacing an outlier with limit values,

replacing an outlier with a mean value, simply clearing the outlier, or totally removing the offending

row For text values, infrequently used values can be defined as outliers This enables, for example, the

top 10 occurring cities to be surfaced in an analysis, with less frequently occurring cities to be grouped

under an Other category In addition to replacing values, text values can be cleared or the associated

rows removed from the table

The Re-label variant can be thought of as a structured search and replace After identifying the table and

column of interest, the wizard presents a list of current values in that column, prompting for the new

values with which they should be replaced This function is useful for fixing data entry problems,

map-ping abbreviations to reporting descriptions, or even groumap-ping data into categories

Partition Data

Choose this function to copy rows from a source table to new tables in useful ways:

■ Split data into training and testing sets: When building data mining models, it is necessary

not only to train a model using part of the available data, but also to reserve a part of that

data for testing the trained model to assess how well it will perform on data it has not yet

seen This option will split the source table into two separate tables for this purpose based on

a chosen ratio, randomly selecting which rows fall into each set

■ Random sampling: This option extracts a random sample of the rows based on a

sup-plied ratio or row count While very similar in function to the Split option, it more directly

Trang 7

to assess differences that training a model on different data slices present.

■ Oversampling to balance data distribution: Data sets sometimes do not accurately repre-sent the populations they are meant to model Oversampling is a method to compensate for sampling bias in a data set Indicate to the wizard the column and associated value to sample, and the resulting new data set will guarantee a representation of rows with a specified ratio

Table analysis tools

Select a cell inside of a table and the Table Tools tabs will become available, including the Analyze tab

The functions on this tab are common data mining operations that have been made nearly single-click

operations All of these operations use Analysis Services to run the associated data mining algorithms

The server and database used can be changed by choosing Connection from the Ribbon

Analyze Key Influencers

Data sets that include predictable outcome(s) often have many attributes, not all of which are important

in determining the outcome Select a cell in the table, choose the Analyze Key Influencers option, tell

the wizard which column contains the outcome, and Excel will build a Nạve Bayes model to determine

which attributes (columns) are most influential in determining the outcome Excel will automatically add

a worksheet and report on key influencers Additional report sections can be generated to contrast

influ-encers for selected outcomes

The resulting report provides some initial insight into the data set being analyzed, and suggests

attributes that should definitely be included when developing a predictive analysis However, it is

important to understand that these are often not the only attributes that influence the outcome Nạve

Bayes is the simplest of algorithms and will only detect very direct relationships

Detect Categories

It is often useful to group cases (rows) in a data set into groups to better understand the population For

example, grouping customers by common traits could yield insights that lead to more targeted marketing

campaigns Tell the wizard which columns to consider in determining the categories, limit the number

of categories that will be created if desired, and click Run Excel will build a clustering model to put

similar cases into distinct buckets, add the category names as a new column to the source table, and

then add a worksheet that enables the exploration and naming of the associated categories

The Categories Report page contains three sections, including notes about how to use each The

top-level summary shows how many cases fall into each category and allows the categories to be renamed

The second section shows the characteristics of the selected category — change the filter on the

cate-gory column to display other categories The third section shows how a selected column varies across

all categories — change the column displayed by right-clicking on the x-axis and choosing the ‘‘Sort and

Filter’’ menu item

Highlight Exceptions

The wizard and algorithm for this analysis is identical to Detect Categories described above, but instead

of presenting a report that enables exploration of the categories, cases that don’t fall inside the categories

Trang 8

a single column, but at how that column’s value fits with other attributes in that row The result finds

combinations that while not impossible are unlikely, such as managers with entry-level salaries Basic

outlier detection would not recognize this problem because the salary is in a valid range for the data set

as a whole

Excel builds the categories and then looks at every table row in turn, using the model to predict the

likelihood of that row given the category definition When the likelihood falls below the user-defined

threshold, that row in the table is highlighted In addition, the value in each column is evaluated for

its likelihood as well, and the least likely value is highlighted Excel automatically adds a report

work-sheet that summarizes the exceptions found by the least-likely column The report also contains the

threshold that determines which likelihoods are considered exceptions — adjust this value to see fewer

or more exceptions

When reviewing exceptions in a large table, it is helpful to sort by color to put all the

exceptions in one place: Right-click on an exception, and select Sort➪Put selected cell

color on top.

Fill from Example

Excel will build a logistic regression model to detect patterns and estimate data for a column with

miss-ing data Tell the wizard the column to be filled in and the columns to be used to detect patterns, and

Excel will add a new column to the table with all the values filled In addition, a report is added

sum-marizing the patterns used to determine the missing values This feature is meant to handle a variety

of missing data cases, such as surveys that are missing some responses, assuming that patterns in the

known attributes will be a good predictor of the missing attribute

The model that Excel builds will always supply a value for the missing attribute even when it is not

correct, so before accepting the values provided, find some way to validate the model before blindly

accepting its results For example, add some test rows (cases with known values for the attribute in

question) to the table without values for the attribute in question and compare the value generated by

the model to their actual values

Forecasting

Forecasting estimates the next steps of a series given its history For example, what will next quarter’s

sales be? Set up a table with all the related series in columns, with one time column It is best to

include related series, as the Time Series algorithm finds relationships between series that can help

build better forecasts For example, last year’s software sales numbers may help predict this year’s

maintenance sales

Indicate in the wizard the series to be predicted, which column contains time, and the number of

peri-ods to be predicted If the data has an inherent periodicity, such as a quarterly sales cycle, supplying

that information in the wizard as a hint to the algorithm may improve the forecast Excel will extend the

source table with new rows at the bottom containing predicted values In addition, a forecasting report

worksheet is added showing a graph with the existing and predicted values

A good test for the reliability of the forecast is to copy the source worksheet, remove the last few

peri-ods, and run the forecast to predict known values Comparing the actual and predicted values will give

you an indication of the reliability of the forecast going forward

Trang 9

Scenario Analysis investigates how changes to the source data set affect the outcome The ‘‘What-if’’

option enables the user to form an exact question — for example, ‘‘How many more customers would

purchase a bike if their income went up 20%?’’ The ‘‘Goal Seek’’ option asks Excel to find the value at

which the desired outcome occurs — for example, ‘‘How much more income would our customers need

before purchasing a bike?’’ Excel builds a logistic regression model to estimate the impact of changes

Upon completion of a ‘‘What-if’’ scenario for the entire table, new columns are added to the source table

showing the new value of the outcome column and the confidence in the result ‘‘Goal Seek’’ for the

entire table adds columns for the new value of the outcome column and the new value for the column

being adjusted

Prediction Calculator

The Prediction Calculator functions by first using the data in an Excel table to train a logistic regression

model to predict an outcome, and then makes the resulting model available as a calculator in Excel to

evaluate individual cases without being connected to the Analysis Services server For example, a model

could be trained to predict component failure based on measurable attributes, and then made available

to technicians performing preventive maintenance Inform the wizard of what attribute and attribute

value is to be predicted, and it will create up to three new sheets in the workbook:

■ Prediction Report: Lists all the significant attribute/value combinations found in building the model and their impact on the result In addition, if the user enters costs associated with correct and incorrect guesses into the interactive profit calculator (e.g., cost of a component failure vs cost of replacing a component that would not have failed), a threshold will be calculated for how likely an outcome must be before it is predicted This threshold is then used in the calculator pages

■ Prediction Calculator (optional): Enter the values for a case and see the predicted outcome

■ Printable Calculator (optional): This contains a printable form that can be used for data collection and later entry into Excel, or even manual calculation without entry into Excel

Shopping Basket Analysis

The Shopping Basket Analysis is a quick way to build an association rules model based on the data in

an Excel table This model will identify groups of items that normally appear together in a transaction,

allowing better product organization and/or suggestions to customers — for example, the famous

‘‘Customers who bought this book also bought ’’

The Excel table must contain certain columns that are indicated in the wizard:

■ Transaction ID: The Order Number, Session ID, or some other identifier that ties multiple rows together into a single transaction

■ Item: The name or other identifier of the item purchased

■ Item Value (optional): The price or value of the item included in that transaction This enables the results to be sorted on the total value that a ‘‘basket’’ represents (average price of the basket * number of sales) As a result, a priority can be placed on suggestions that will

Trang 10

After the model has been built, two sheets are added to the workbook The Bundled Items report details

all the bundles (item combinations) found and their associated sales and price information The

Recom-mendations report lists recommendation rules by item, the proposed recommendation, and supporting

statistics

Data mining client

The Data Mining tab added by installing the SQL Server Data Mining Add-ins for Office 2007 provides

a full data mining environment, equivalent to the data mining environment provided by Visual Studio

(also known as Business Intelligence Development Studio) Unlike the table analysis tools described

earlier, whereby tables and reports are created directly in Excel, the primary focus here is on creating,

training, browsing, and querying data mining models in an Analysis Services database

Working from within Excel to develop models can have advantages over the Visual Studio environment,

especially when working with small amounts of data early in the process, when cleaning and exploring

the data set, as the data set can be quickly changed in Excel and used to train and test models in

Anal-ysis Services However, there are limitations to the Excel environment, such as the inability to show the

accuracy of competing models in the same accuracy chart

See Chapter 76, ‘‘Data Mining with Analysis Services,’’ to learn more about the data

min-ing features of Analysis Services and the functions detailed here.

Functions exposed on the Data Mining tab include the following:

■ Data Preparation: This is described in the section ‘‘Exploring and Preparing Data’’ earlier in

the chapter

■ Data Modeling: Allows the creation of mining structures and models Several of the most

popular models are listed as separate functions, while the Advanced option provides access to

all available algorithms

■ Accuracy and Validation: Provides different views of model performance on test data

■ Browse: Enables the examination of model details for any model in the current database

■ Document Model: Adds a new sheet to the current workbook listing model details

■ Query: Provides a friendly environment for constructing and executing DMX queries against

mining models

■ Manage Models: Enables structures and models in the current database to be deleted,

renamed, processed, and so on

■ Connection: Manages connections to the Analysis Services database

■ Trace: Provides a history of every command sent to the Analysis Services server Use of

session models for table analysis functions can also be enabled or disabled here

Summary

Microsoft Excel has long been the most frequently used tool for analyzing data, and with the advent of

the 2007 version, it is easier than ever to include relational and Analysis Services data in those

analy-ses Relational data can be included in data tables that remain linked to the underlying table or query

Định dạng
Số trang	10
Dung lượng	691,18 KB