Hands-On Microsoft SQL Server 2008 Integration Services part 46 docx

Aggregate TransformationThe Aggregate transformation is an asynchronous transformation that helps you to perform aggregate operations such as SUM, AVERAGE, and COUNT.. The Aggregate tra

Trang 1

Figure 10-13 Configurations for the Unpivot transformation

Trang 2

Aggregate Transformation

The Aggregate transformation is an asynchronous transformation that helps you to

perform aggregate operations such as SUM, AVERAGE, and COUNT To perform

these aggregate operations, you need to have a complete data set The Aggregate

transformation consumes all the rows before applying any aggregation and extracting the transformed data Because of being an asynchronous transformation, the output data,

which most likely has a new schema, is populated in the new memory buffers

The Aggregate transformation can perform operations such as AVERAGE,

COUNT, COUNT DISTINCT, GROUP BY, selecting a minimum or maximum

from a group, and SUM on column values The aggregated data is then extracted out

in the new output columns The output columns may also contain the input columns,

which form part of the groupings or aggregations

When you select a column in the Aggregate Transformation Editor and click in

the Operation field, you will see a list of operations that coincide with the type of the

column you selected This makes sense as the aggregate operations require appropriate

column types—for example, a SUM would work on a numeric data type column

and would not work with a string data type column Following are the operation

descriptions in detail:

AVERAGE

c This operation is available only for numeric data type columns and

returns the average of the column values

COUNT

c Counts the number of rows for the selected column This operation

will not count the rows that have null values for the specified column In the

Aggregate Transform Editor, a special column (*) has been added that allows

you to perform the COUNT ALL operation to count all the rows in a data set,

including those with null values

COUNT DISTINCT

c Counts the number of rows containing distinct non-null values in a group

GROUP BY

c This operation can be performed on any data type column and

returns the data set in groups of row sets

MAXIMUM

c This operation can be performed on numeric, date, and time data

type columns and returns the maximum value in a group

MINIMUM

c This operation can be performed on numeric, date, and time data

type columns and returns the minimum value in a group

SUM

c This operation is available only for numeric data type columns and returns

the sum of values in a column

Trang 3

The Aggregate transformation’s user interface provides features and options to configure aggregations that we will cover in the following Hands-On exercise, as they will be easier

to understand as you work with them Before we dive in, consider the following:

You can perform multiple aggregate operations on the same set of data For example, c

you can perform SUM and AVERAGE operations on a column in the same transformation As the result from these two different aggregate operations will be different, you must direct the results to different outputs This is fully supported by the transformation, and you can add multiple outputs to this transformation

The null values are handled as specified in the SQL-92 standard, that is, in the c

same way they are handled by T-SQL The COUNT ALL operation will count all the rows including those containing null values, whereas COUNT or COUNT DISTINCT operations for a specific column will count only rows with non-null values in the specified columns In addition, GROUP BY operation puts null values in a separate group

When an output column requires special handling because it contains an oversized c

data value greater than four billion, or the data requires precision that is beyond a float data type, you can set the IsBig property of the output column to 1 so that the transformation uses the correct data type for storing the column value However, columns that are involved in a GROUP BY, MINIMUM, or MAXIMUM operation cannot take advantage of this

Hands-On: Aggregating SalesOrders

The SalesOrders.xls file has been extended with the pricing information by adding unit price and total price columns The extracted data has been saved as SalesOrdersExtended .xls From this extended sales orders data that has a price listed for each product against the SalesOrderID, you are required to aggregate a total price for each order and calculate the average price per product and the number of each product sold

Before starting this exercise, open the SalesOrdersExtended.xls Excel file to verify that the file has only one worksheet named SalesOrders This exercise adds two more worksheets in it; if the file has additional sheets, delete them before you start this exercise Also, if you are using the provided package code, you may get a validation error, as the Excel Destination used in the package looks for the worksheets during this exercise In this case, leave the worksheets as is

Method

As with the Pivot transformation Hands-On exercise, you will use an Excel file to access the data from a worksheet and create two new worksheets in the same Excel file to extract the processed data You will configure a Data Flow task and an Aggregate transformation

Trang 4

Exercise (Add Connection Manager

and Data Flow Task to a New Package)

Like the previous exercise, you will start this exercise by adding a new package to the Data

Flow transformations project and then adding an Excel Connection Manager in to it

1 Open the Data Flow transformations project in BIDS Right-click the SSIS

Packages in Solution Explorer and choose New SSIS Package This will add

a new SSIS package named Package1.dtsx

2 Rename Package1.dtsx as Aggregating SalesOrders.dtsx.

3 Add an Excel Connection Manager to connect to an Excel file C:\SSIS\

RawFiles\SalesOrdersExtended.xls

4 Add a Data Flow task from Toolbox and rename it Aggregating SalesOrders

Double-click it to open the Data Flow tab

Exercise (Configure Aggregating SalesOrders)

The main focus of this part is to learn to configure an Aggregate transformation You

will configure the Excel source and Excel destination as well complete the data flow

configurations

5 Add an Excel source to extract data from the SalesOrders$ worksheet in the

SalesOrdersExtended.xls file Rename this Excel Source Sales Orders Data Source.

6 Drag and drop an Aggregate transformation from the Toolbox onto the Data

Flow surface just below the Excel source Connect both the components with

a data flow path

7 Double-click the Aggregate transformation to open the Aggregate Transformation Editor that displays two tabs—Aggregations and Advanced In the Aggregations

tab, you select columns for aggregations and specify aggregation properties for

them This tab has two display types: basic and advanced An Advanced button

on the Aggregations tab converts the basic display of the Aggregations tab into an

advanced display This advanced display allows you to perform multiple groupings

or GROUP BY operations Click Advanced to see the advanced display Selecting

multiple GROUP BY operations adds rows using multiple Aggregation Names in

the advanced display section This also means that you will be generating different

types of output data sets that will be sent to multiple outputs So, when you add an

Aggregation Name, you add an additional output to the transformation outputs

8 Click the Advanced tab to see the properties you can apply at the Aggregate

component level As you can see, there are configurations to be done in more

than one place in this editor In fact, you can configure this transformation

at three levels—the component level, the output level, and the column level

Trang 5

The properties you define on the Advanced tab apply at a component level, the properties configured in the advanced display of the Aggregations tab apply at the output level, and the properties configured in the column list at the bottom of the Aggregations tab apply at the column level The capability to specify properties at different levels enables you to configure it for the maximum performance benefits The following descriptions explain the properties in the Advanced tab:

Key Scale

c This is an optional property that helps the transformation decide the initial cache size By default, this property is not used and can have a low, medium, or high value if selected Using the low value, the aggregation can write approximately 500,000 keys, medium enables it to write about 5 million keys, and high enables it to write approximately 25 million keys

Number Of Keys

c This optional setting is used to override the value of a key scale by specifying the exact number of keys that you expect this transformation

to handle Specifying the keys upfront allows the transformation to manage cache properly and avoids reorganizing the cache at run time; thus it will enhance performance

Count Distinct Scale

c You can specify an approximate number of distinct values that the transformation is expected to handle This is an optional setting and is unspecified by default You can select low, medium, or high values Using the low value, the aggregation can write approximately 500 thousand distinct values, medium enables it to write about 5 million distinct values, and high enables it to write approximately 25 million distinct values

Count Distinct Keys

c Using this property, you can override the count distinct scale value by specifying the exact number of distinct values that the transformation can write This will avoid reorganizing cached totals at run time and will enhance performance

Auto Extend Factor

c Using this property, you can specify a percentage value

by which this transformation can extend its memory during runtime You can use a value between 1 and 100 percent—the default is 25 percent

9 Go to the Aggregations tab, and select SalesOrderID and TotalPrice from the Available Input Columns As you select them, they will be added to the columns list below as SalesOrderID with GROUP BY operation and TotalPrice with SUM operation

10. Click Advanced to display the options for configuring aggregations for multiple outputs You will see Aggregate Output 1 already configured using SalesOrderID

as a GROUP BY column Rename Aggregate Output 1 as Total Per Order Next, click in the second row in the Aggregation Name field and type Products Sold

and Average Price in the cell Then select the ProductName, OrderQuantity,

and UnitPrice columns from the Available Input Columns list These columns

Trang 6

will appear in the columns list with default operations applied to them Change

these operations as follows: GROUP BY operation to the ProductName column,

SUM operation to the OrderQuantity column, and AVERAGE operation to the

UnitPrice column, as shown in the Figure 10-14

Note that you can specify a key scale and keys in the advanced display for each

output, thus enhancing performance by specifying the number of keys the output

is expected to contain Similarly, you can specify Count Distinct Scale and Count

Distinct Keys values for each column in the list to specify the number of distinct

values the column is expected to contain

Figure 10-14 Configuring Aggregate transformation for multiple outputs

Trang 7

11. You’re done with the configuration of Aggregation transformation Click OK

to close the editor Before you start executing the package, check out one more thing Open the Advanced Editor for Aggregate transformation and go to the Input and Output Properties tab You’ll see the two outputs you created earlier in the advanced display of Aggregations tab of the custom editor Expand and view the different output columns Also, note that you can specify the IsBig property

in the output columns here, as shown in Figure 10-15 Click OK to return to the Designer

Figure 10-15 Multiple outputs of Aggregate transformation

Trang 8

12. Let’s direct the outputs to different worksheets in the same Excel file Add an

Excel destination on the left, just below the Aggregate transformation, and drag

the green arrow from Aggregate transformation to Excel destination As you drop

it on the Excel destination, an Input Output Selection dialog box will pop-up,

asking you to specify the output you want to connect to this destination Select

Total Per Order in the Output field and click OK to add the connector

13. Double-click the Excel Destination and click the New button opposite to the

“Name of the Excel sheet” field to add a new worksheet in the Excel file In the

Create Table dialog box, change the name of the table from Excel Destination to

TotalPerOrder and click OK to return to the Excel Destination Editor dialog

box Select the TotalPerOrder sheet in the field Next, go to the Mappings page

and the mappings between the Available Input Columns with the Available

Destination Columns will be done for you by default Click OK to close the

Destination Editor Rename this destination Total Per Order.

14. Much as you did in Steps 12 and 13, add another Excel destination below the

Aggregate transformation on the right side (see Figure 10-16) and rename it

Figure 10-16 Aggregating the SalesOrders package

Trang 9

Products Sold and Average Price Connect the Aggregate transformation to the

Products Sold and Average Price destination using the second green arrow Add a new worksheet by clicking the New button next to the “Name of the Excel sheet”

field by the name of ProductsSoldandAveragePrice Go to the Mappings page

to create the mappings Click OK to close the editor

Exercise (Run the Aggregations)

In the final part of this Hands-On, you will execute the package and see the results If you wish, you can add data viewers where you would like to see the data grid

15. Press f5 to run the package The package will complete execution almost immediately Stop the debugging by pressing shift-f5 Save all files and close the project

16. Explore to the C:\SSIS\RawFiles folder and open the SalesOrdersExtended.xls file You will see two new worksheets created and populated with data Check out the data to validate the Aggregate transformation operations

Review

You’ve done some aggregations in this exercise and have added multiple outputs to the Aggregate transformation You’ve also learned that the Aggregate transformation can

be configured at the component level by specifying the properties in the Advanced tab, can be configured at the output level by specifying keys for each output, and can also

be configured at the column level for distinct values each column is expected to have You can achieve high levels of performance using these configurations However, if you find that the transformation is still suffering from memory shortage, you can use Auto Extend to extend the memory usage of this component

Audit Transformations

Audit transformations are important as well This category includes only two transformations in this release The Audit transformation includes environmental data such as system data or login name into the pipeline The Row Count transformation counts the number of rows in the pipeline and stores the data to a variable

Audit Transformation

One of the common requirements of data maintenance and data warehousing is to timestamp the record whenever it is either added or updated Generally in data marts, you get a nightly feed for new as well as updated records and you want to timestamp the records that are inserted or updated to maintain a history, which is also quite

Trang 10

helpful in data analysis By providing the Audit transformation, Integration Services

has extended this ability and allows environment variable values to be included in

the data flow With the Audit transformation, not only can you include the package

execution start time but you can include much more information—for example, the

name of the operator, computer, or package to indicate who has made changes to data

and the source of data This is like using Derived Column transformation in a special

way To perform the assigned functions, this transformation supports one input and

one output As it does not perform any transformation on the input columns data

(rather it just adds known environment values using system variables), an error in this

transformation is not expected, and hence it does not support an error output

The Audit transformation provides access to nine system variables Following is

a brief description of each of these variables:

ExecutionInstanceGUID

C Each execution instance of the package is allocated

a GUID contained in this variable

PackageID

C Represents the unique identifier of the package

PackageName

C A package name can be added in the data flow

VersionID

C Holds the version identifier of the package

ExecutionStartTime

C Include the time when the package started to run

MachineName

C Provides the computer name on which the package runs

UserName

C Adds the login name of the person who runs the package

TaskName

C Holds the name of the Data Flow task to which this transformation

belongs

TaskID

C Holds the unique identifier of the Data Flow task to which this

transformation belongs

When you open the editor for this transformation after connecting an input, you

will be able to select the system variable from a drop-down list by clicking in the Audit

Type column, as shown in Figure 10-17 When you select a system variable, the Output

Column Name shows the default name of the variable, which you can change This

Output Column will be added to the transformation output as a new output column

Row Count Transformation

Using the Row Count transformation, you can count the rows that are passing through

the transformation and store the final count in a variable that can be used by other

components, such as a script component and property expressions or can be useful

Định dạng
Số trang	10
Dung lượng	531,54 KB