Aggregate TransformationThe Aggregate transformation is an asynchronous transformation that helps you to perform aggregate operations such as SUM, AVERAGE, and COUNT.. The Aggregate tra
Trang 1Figure 10-13 Configurations for the Unpivot transformation
Trang 2Aggregate Transformation
The Aggregate transformation is an asynchronous transformation that helps you to
perform aggregate operations such as SUM, AVERAGE, and COUNT To perform
these aggregate operations, you need to have a complete data set The Aggregate
transformation consumes all the rows before applying any aggregation and extracting the transformed data Because of being an asynchronous transformation, the output data,
which most likely has a new schema, is populated in the new memory buffers
The Aggregate transformation can perform operations such as AVERAGE,
COUNT, COUNT DISTINCT, GROUP BY, selecting a minimum or maximum
from a group, and SUM on column values The aggregated data is then extracted out
in the new output columns The output columns may also contain the input columns,
which form part of the groupings or aggregations
When you select a column in the Aggregate Transformation Editor and click in
the Operation field, you will see a list of operations that coincide with the type of the
column you selected This makes sense as the aggregate operations require appropriate
column types—for example, a SUM would work on a numeric data type column
and would not work with a string data type column Following are the operation
descriptions in detail:
AVERAGE
c This operation is available only for numeric data type columns and
returns the average of the column values
COUNT
c Counts the number of rows for the selected column This operation
will not count the rows that have null values for the specified column In the
Aggregate Transform Editor, a special column (*) has been added that allows
you to perform the COUNT ALL operation to count all the rows in a data set,
including those with null values
COUNT DISTINCT
c Counts the number of rows containing distinct non-null values in a group
GROUP BY
c This operation can be performed on any data type column and
returns the data set in groups of row sets
MAXIMUM
c This operation can be performed on numeric, date, and time data
type columns and returns the maximum value in a group
MINIMUM
c This operation can be performed on numeric, date, and time data
type columns and returns the minimum value in a group
SUM
c This operation is available only for numeric data type columns and returns
the sum of values in a column
Trang 3The Aggregate transformation’s user interface provides features and options to configure aggregations that we will cover in the following Hands-On exercise, as they will be easier
to understand as you work with them Before we dive in, consider the following:
You can perform multiple aggregate operations on the same set of data For example, c
you can perform SUM and AVERAGE operations on a column in the same transformation As the result from these two different aggregate operations will be different, you must direct the results to different outputs This is fully supported by the transformation, and you can add multiple outputs to this transformation
The null values are handled as specified in the SQL-92 standard, that is, in the c
same way they are handled by T-SQL The COUNT ALL operation will count all the rows including those containing null values, whereas COUNT or COUNT DISTINCT operations for a specific column will count only rows with non-null values in the specified columns In addition, GROUP BY operation puts null values in a separate group
When an output column requires special handling because it contains an oversized c
data value greater than four billion, or the data requires precision that is beyond a float data type, you can set the IsBig property of the output column to 1 so that the transformation uses the correct data type for storing the column value However, columns that are involved in a GROUP BY, MINIMUM, or MAXIMUM operation cannot take advantage of this
Hands-On: Aggregating SalesOrders
The SalesOrders.xls file has been extended with the pricing information by adding unit price and total price columns The extracted data has been saved as SalesOrdersExtended .xls From this extended sales orders data that has a price listed for each product against the SalesOrderID, you are required to aggregate a total price for each order and calculate the average price per product and the number of each product sold
Before starting this exercise, open the SalesOrdersExtended.xls Excel file to verify that the file has only one worksheet named SalesOrders This exercise adds two more worksheets in it; if the file has additional sheets, delete them before you start this exercise Also, if you are using the provided package code, you may get a validation error, as the Excel Destination used in the package looks for the worksheets during this exercise In this case, leave the worksheets as is
Method
As with the Pivot transformation Hands-On exercise, you will use an Excel file to access the data from a worksheet and create two new worksheets in the same Excel file to extract the processed data You will configure a Data Flow task and an Aggregate transformation
Trang 4Exercise (Add Connection Manager
and Data Flow Task to a New Package)
Like the previous exercise, you will start this exercise by adding a new package to the Data
Flow transformations project and then adding an Excel Connection Manager in to it
1 Open the Data Flow transformations project in BIDS Right-click the SSIS
Packages in Solution Explorer and choose New SSIS Package This will add
a new SSIS package named Package1.dtsx
2 Rename Package1.dtsx as Aggregating SalesOrders.dtsx.
3 Add an Excel Connection Manager to connect to an Excel file C:\SSIS\
RawFiles\SalesOrdersExtended.xls
4 Add a Data Flow task from Toolbox and rename it Aggregating SalesOrders
Double-click it to open the Data Flow tab
Exercise (Configure Aggregating SalesOrders)
The main focus of this part is to learn to configure an Aggregate transformation You
will configure the Excel source and Excel destination as well complete the data flow
configurations
5 Add an Excel source to extract data from the SalesOrders$ worksheet in the
SalesOrdersExtended.xls file Rename this Excel Source Sales Orders Data Source.
6 Drag and drop an Aggregate transformation from the Toolbox onto the Data
Flow surface just below the Excel source Connect both the components with
a data flow path
7 Double-click the Aggregate transformation to open the Aggregate Transformation Editor that displays two tabs—Aggregations and Advanced In the Aggregations
tab, you select columns for aggregations and specify aggregation properties for
them This tab has two display types: basic and advanced An Advanced button
on the Aggregations tab converts the basic display of the Aggregations tab into an
advanced display This advanced display allows you to perform multiple groupings
or GROUP BY operations Click Advanced to see the advanced display Selecting
multiple GROUP BY operations adds rows using multiple Aggregation Names in
the advanced display section This also means that you will be generating different
types of output data sets that will be sent to multiple outputs So, when you add an
Aggregation Name, you add an additional output to the transformation outputs
8 Click the Advanced tab to see the properties you can apply at the Aggregate
component level As you can see, there are configurations to be done in more
than one place in this editor In fact, you can configure this transformation
at three levels—the component level, the output level, and the column level
Trang 5The properties you define on the Advanced tab apply at a component level, the properties configured in the advanced display of the Aggregations tab apply at the output level, and the properties configured in the column list at the bottom of the Aggregations tab apply at the column level The capability to specify properties at different levels enables you to configure it for the maximum performance benefits The following descriptions explain the properties in the Advanced tab:
Key Scale
c This is an optional property that helps the transformation decide the initial cache size By default, this property is not used and can have a low, medium, or high value if selected Using the low value, the aggregation can write approximately 500,000 keys, medium enables it to write about 5 million keys, and high enables it to write approximately 25 million keys
Number Of Keys
c This optional setting is used to override the value of a key scale by specifying the exact number of keys that you expect this transformation
to handle Specifying the keys upfront allows the transformation to manage cache properly and avoids reorganizing the cache at run time; thus it will enhance performance
Count Distinct Scale
c You can specify an approximate number of distinct values that the transformation is expected to handle This is an optional setting and is unspecified by default You can select low, medium, or high values Using the low value, the aggregation can write approximately 500 thousand distinct values, medium enables it to write about 5 million distinct values, and high enables it to write approximately 25 million distinct values
Count Distinct Keys
c Using this property, you can override the count distinct scale value by specifying the exact number of distinct values that the transformation can write This will avoid reorganizing cached totals at run time and will enhance performance
Auto Extend Factor
c Using this property, you can specify a percentage value
by which this transformation can extend its memory during runtime You can use a value between 1 and 100 percent—the default is 25 percent
9 Go to the Aggregations tab, and select SalesOrderID and TotalPrice from the Available Input Columns As you select them, they will be added to the columns list below as SalesOrderID with GROUP BY operation and TotalPrice with SUM operation
10. Click Advanced to display the options for configuring aggregations for multiple outputs You will see Aggregate Output 1 already configured using SalesOrderID
as a GROUP BY column Rename Aggregate Output 1 as Total Per Order Next, click in the second row in the Aggregation Name field and type Products Sold
and Average Price in the cell Then select the ProductName, OrderQuantity,
and UnitPrice columns from the Available Input Columns list These columns
Trang 6will appear in the columns list with default operations applied to them Change
these operations as follows: GROUP BY operation to the ProductName column,
SUM operation to the OrderQuantity column, and AVERAGE operation to the
UnitPrice column, as shown in the Figure 10-14
Note that you can specify a key scale and keys in the advanced display for each
output, thus enhancing performance by specifying the number of keys the output
is expected to contain Similarly, you can specify Count Distinct Scale and Count
Distinct Keys values for each column in the list to specify the number of distinct
values the column is expected to contain
Figure 10-14 Configuring Aggregate transformation for multiple outputs
Trang 711. You’re done with the configuration of Aggregation transformation Click OK
to close the editor Before you start executing the package, check out one more thing Open the Advanced Editor for Aggregate transformation and go to the Input and Output Properties tab You’ll see the two outputs you created earlier in the advanced display of Aggregations tab of the custom editor Expand and view the different output columns Also, note that you can specify the IsBig property
in the output columns here, as shown in Figure 10-15 Click OK to return to the Designer
Figure 10-15 Multiple outputs of Aggregate transformation
Trang 812. Let’s direct the outputs to different worksheets in the same Excel file Add an
Excel destination on the left, just below the Aggregate transformation, and drag
the green arrow from Aggregate transformation to Excel destination As you drop
it on the Excel destination, an Input Output Selection dialog box will pop-up,
asking you to specify the output you want to connect to this destination Select
Total Per Order in the Output field and click OK to add the connector
13. Double-click the Excel Destination and click the New button opposite to the
“Name of the Excel sheet” field to add a new worksheet in the Excel file In the
Create Table dialog box, change the name of the table from Excel Destination to
TotalPerOrder and click OK to return to the Excel Destination Editor dialog
box Select the TotalPerOrder sheet in the field Next, go to the Mappings page
and the mappings between the Available Input Columns with the Available
Destination Columns will be done for you by default Click OK to close the
Destination Editor Rename this destination Total Per Order.
14. Much as you did in Steps 12 and 13, add another Excel destination below the
Aggregate transformation on the right side (see Figure 10-16) and rename it
Figure 10-16 Aggregating the SalesOrders package
Trang 9Products Sold and Average Price Connect the Aggregate transformation to the
Products Sold and Average Price destination using the second green arrow Add a new worksheet by clicking the New button next to the “Name of the Excel sheet”
field by the name of ProductsSoldandAveragePrice Go to the Mappings page
to create the mappings Click OK to close the editor
Exercise (Run the Aggregations)
In the final part of this Hands-On, you will execute the package and see the results If you wish, you can add data viewers where you would like to see the data grid
15. Press f5 to run the package The package will complete execution almost immediately Stop the debugging by pressing shift-f5 Save all files and close the project
16. Explore to the C:\SSIS\RawFiles folder and open the SalesOrdersExtended.xls file You will see two new worksheets created and populated with data Check out the data to validate the Aggregate transformation operations
Review
You’ve done some aggregations in this exercise and have added multiple outputs to the Aggregate transformation You’ve also learned that the Aggregate transformation can
be configured at the component level by specifying the properties in the Advanced tab, can be configured at the output level by specifying keys for each output, and can also
be configured at the column level for distinct values each column is expected to have You can achieve high levels of performance using these configurations However, if you find that the transformation is still suffering from memory shortage, you can use Auto Extend to extend the memory usage of this component
Audit Transformations
Audit transformations are important as well This category includes only two transformations in this release The Audit transformation includes environmental data such as system data or login name into the pipeline The Row Count transformation counts the number of rows in the pipeline and stores the data to a variable
Audit Transformation
One of the common requirements of data maintenance and data warehousing is to timestamp the record whenever it is either added or updated Generally in data marts, you get a nightly feed for new as well as updated records and you want to timestamp the records that are inserted or updated to maintain a history, which is also quite
Trang 10helpful in data analysis By providing the Audit transformation, Integration Services
has extended this ability and allows environment variable values to be included in
the data flow With the Audit transformation, not only can you include the package
execution start time but you can include much more information—for example, the
name of the operator, computer, or package to indicate who has made changes to data
and the source of data This is like using Derived Column transformation in a special
way To perform the assigned functions, this transformation supports one input and
one output As it does not perform any transformation on the input columns data
(rather it just adds known environment values using system variables), an error in this
transformation is not expected, and hence it does not support an error output
The Audit transformation provides access to nine system variables Following is
a brief description of each of these variables:
ExecutionInstanceGUID
C Each execution instance of the package is allocated
a GUID contained in this variable
PackageID
C Represents the unique identifier of the package
PackageName
C A package name can be added in the data flow
VersionID
C Holds the version identifier of the package
ExecutionStartTime
C Include the time when the package started to run
MachineName
C Provides the computer name on which the package runs
UserName
C Adds the login name of the person who runs the package
TaskName
C Holds the name of the Data Flow task to which this transformation
belongs
TaskID
C Holds the unique identifier of the Data Flow task to which this
transformation belongs
When you open the editor for this transformation after connecting an input, you
will be able to select the system variable from a drop-down list by clicking in the Audit
Type column, as shown in Figure 10-17 When you select a system variable, the Output
Column Name shows the default name of the variable, which you can change This
Output Column will be added to the transformation output as a new output column
Row Count Transformation
Using the Row Count transformation, you can count the rows that are passing through
the transformation and store the final count in a variable that can be used by other
components, such as a script component and property expressions or can be useful