Hands-On Microsoft SQL Server 2008 Integration Services part 45 doc

In such a case, you can use the Percentage Sampling transformation to create a sample data set from the input rows using the specified percentage.. The Pivot transformation converts a no

Trang 1

a randomly selected 15 percent of the total records in the segment One other example could be that you need to create a training data set and a test data set for your data mining models and you want to divide your data set based on a defined percentage In such a case, you can use the Percentage Sampling transformation to create a sample data set from the input rows using the specified percentage This transformation helps you create a representative data set much smaller in size that you can use for variety

of purposes, such as testing your packages in a development environment or using the sample data set for surveys and marketing purposes

The Percentage Sampling transformation has a simple and intuitive user interface with just four fields The first field, Percentage Of Rows, allows you to specify a percentage

of sampling The Percentage Sampling transformation uses an algorithm to select at random the number of rows according to the specified percentage However, the number

of rows that this transformation selects does not precisely match with the percentage calculations—i.e., the output rows may be a little bit too many or a little bit too few The next two fields allow you to specify the names for selected sample output rows and the remaining unselected output rows As this transformation selects rows for sampling, it outputs those rows onto its first output and the remaining unselected rows are outputted

on to its second output This transformation supports one input and two outputs to support both the selected and unselected data sets and supports no error output

The last field is a check box and a value pair, you can select the check box if you want to specify a sampling seed and type a value in the field provided as the sampling seed If you specify a sampling seed and reuse the same sampling seed in a later run, it will produce the same sample output no matter how many times you run the package with the same data set This is helpful in testing of packages Alternatively, if you don’t specify a sampling seed, this component will generate a random number using the tick count of the operating system Hence, each time you run a package, a different random number is generated and a different data set is sampled

Row Sampling Transformation

The Row Sampling transformation works quite similar to the Percentage Sampling transformation to sample a data set However, the Row Sampling transformation outputs an exact number of rows as specified in the transformation This random selection of a precise number of rows is sometimes very useful An example of such

a scenario can be a gift allocation to the random selection of people Suppose you’re running a campaign to introduce your new product to different segments of your customers and prospects by sending them an e-mail every week To promote readers’ interest, you decide to award gifts to a random selection of 50 persons who show interest in your product by evaluating it every week You can easily build this package

by bringing into the data flow the records for the persons who evaluated the product in

Trang 2

the current week and then apply a Row Sampling transformation to select 50 persons

out of these records

In the Row Sampling Transformation Editor, you specify the number of rows

you want to output This transformation supports two outputs—one for extracting

the selected records and the other for unselected records You can type in the names

for both the outputs in the user interface It is not necessary for you to configure a

downstream data flow to capture the unselected records You can simply ignore this

output, and the records appearing on this output will not be included in the data flow

You can choose to specify a random seed for selection of records by clicking in

the “Use the following random seed” check box This transformation selects random

records on the basis of an algorithm that uses the random seed If you specify the same

random seed, the algorithm will select the same random records for the same input

data When you check this option, a message will pop up to tell you that using the same random seed on the same input data always generates the same sample, and specifying

a random seed is recommended only during the development and testing of a package

Specifying a random seed affects the selection of records, and when you don’t specify

a random seed, the transformation uses the tick count of the operating system to create

the random number that is obviously different each time you run the package, and

hence the selected random records will be different even for the same input data

Pivot Transformation

Relational databases are modeled to store normalized data This normalization of data

changes the data view in a way that sometimes may not be as intuitive as businesses

desire The process that is used to convert data from a normalized form to a denormalized

form is called pivoting To understand how you can use this transformation, you will be

working through a Hands-On exercise later; but for now, let’s discuss what a normalized

data is by looking at the data we are going to use

The data you will use in an exercise later is in an Excel spreadsheet that keeps sales

order details in three columns: SalesOrderID, ProductName, and OrderQuantity To

keep the data in a normalized form, the table contains multiple entries or rows for the

same SalesOrderID For example, if three products have been purchased under a single

SalesOrderID, the normalized data is represented by listing three rows for the same

SalesOrderID with a different ProductName in each row to show the purchase quantity for that product But the sales manager may prefer to see the sales order details with the products and the quantity for each of them listed against the SalesOrderID on the same row This is when you need to use pivot function to denormalize the data

Before SSIS made it available, the pivot function was available in Microsoft Excel or

third-party tools, or you had to write custom code to accomplish the task Integration

Services now provides both a Pivot transformation and an UnPivot transformation to

Trang 3

provide different data views or forms The Pivot transformation converts a normalized data set into a less normalized form by pivoting the input data on a column value The role a column performs in pivoting is defined by the PivotUsage property specifying values ranging from 0 to 3 The column that is used to pivot the data around forms the

set key for pivoting, and this column is assigned a value of 1 for the PivotUsage property

Assigning a value of 1 to the PivotUsage property of an input column indicates that it is part

of the set key of a single-row or multirow set In our example, the SalesOrderID column will act as a set key for pivoting and the multiple rows with the same SalesOrderID will

be combined into one row

When the data is pivoted, the values in a column, called pivot key values, are pivoted

to the columns in the output For specifying the values to be pivoted to output columns, you assign a value of 2 to the PivotUsage property on the input column In our example, the ProductName column has 10 different values for the products When this data is pivoted, 10 columns are created on the basis of 10 different values in the ProductName column—e.g., the value Mountain–100 of the ProductName column becomes the Mountain–100 column in the pivoted output These newly created columns in the pivoted output get the values from the third column, which is used to provide values for newly created columns in the pivoted output by specifying a value of 3 to the PivotUsage property All other input columns that don’t participate in the pivoting process are assigned PivotUsage value of 0, and for the set of input rows that has same set key, the first input value for the column is copied to the output column

The Pivot transformation pivots the data on the basis of set key column value For the same value of set key, the Pivot transformation merges multiple rows into a single row and pivots the input rows into columns This implies that if the data is not sorted to list the same set key values in one collection of rows, this transformation will output the same key values multiple times In our example, to get only one record for a SalesOrderID, the data must be sorted on SalesOrderID However, if the data is not sorted on SalesOrderID, this transformation will generate multiple records for same SalesOrderID, as it will pivot the rows to columns each time the value of SalesOrderID changes

All this may appear quite complex, but it is not that complicated when it comes to configuring the Pivot transformation Let’s see how to use this transformation to pivot data from an Excel worksheet

Hands-On: Pivoting Sales Order Records in an Excel Worksheet

The records exported from the Sales order database to an Excel worksheet are in the normalized form—i.e., one sales order number appears in multiple rows to store details for the products ordered against it The sales manager wants to see details of all the products ordered against each sales order in a single row

Trang 4

Before starting this exercise, open the C:\SSIS\RawFiles\SalesOrders.xls file to

verify that the file has only one worksheet labeled Normalized This exercise adds

another worksheet to this file; if it already has two worksheets, delete the second

worksheet and then start this exercise Also, if you are using the provided package code, you may get a validation error, as the Excel Destination used in the package looks for

the worksheets during this exercise In this case, leave the worksheets as is

Method

In this exercise, you will be using a Pivot transformation to transform the given data

to the required format and will put the pivoted data in a new worksheet As Pivot

transformation pivots the data every time the set key column value changes, you will

need to sort the data before the Pivot transformation

Exercise (Add Connection Manager and Data Flow Task)

You will start this exercise with adding a new package to the Data Flow transformations

project, and then adding an Excel Connection Manager to it

1 Open the Data Flow transformations project in BIDS Right-click the SSIS

Packages in the Solution Explorer and choose New SSIS Package This will add

a new SSIS package called Package1.dtsx

2 Rename the Package1.dtsx package to Pivoting SalesOrders.dtsx.

3 Right-click in the Connection Managers area and choose New Connection

from the context menu Select the Excel Connection Manager type from the

list in the Add SSIS Connection Manager dialog box and click Add Next, Type

C:\SSIS\RawFiles\SalesOrders.xls in the Excel file path field in the Excel

Connection Manager dialog box Leave the Excel Version selected as Microsoft

Excel 97-2003 and see that the check box for First Row has column names

checked Click OK to add the Excel Connection Manager in the Connection

Managers area Rename it as SalesOrders Connection Manager.

4 Drag the Data Flow Task from the Toolbox and drop it onto the Control Flow

Designer surface Rename this task Transforming SalesOrders Double-click it

to open the Data Flow tab and configure this task

Exercise (Configure the Data Flow Task)

To configure a data flow for pivoting SalesOrders data, here you will add an Excel source

to extract data from a normalized worksheet of the SalesOrders.xls file and then sort this

data on the SalesOrderID, as a Pivot transformation requires all the rows having the same set key to be together in a sorted set for merging them to a single row After sorting the

data, you will configure the Pivot transformation to pump the pivoted data through to an

Excel destination that will write the pivoted output to a new Excel worksheet

Trang 5

5 From the Toolbox, drag and drop the Excel source onto the Data Flow Designer

surface Rename this adapter Normalized Data Source Double-click the

Normalized Data Source to open the Excel Source Editor The SalesOrders Connection Manager will be listed in the OLE DB Connection Manager field for you Select the Normalized$ in the “Name of the Excel sheet” field

6 Go to the Columns page, and verify that all the three fields have been selected from the Available External Columns Click OK to close the editor

7 Drop the Sort transformation from the Toolbox onto the Data Flow Designer surface just below the Normalized Data Source and join both the components

using green connector Rename it as Sort on SalesOrderID and double-click to

open the Sort Transformation Editor Click to select the check box before the SalesOrderID column This column will appear in the lower half of the dialog box with Sort Type as ascending and Sort Order equal to 1 Leave these setting

as is and click OK to close this editor

8 Drop the Pivot from the Toolbox onto the Designer surface just below the Sort

on SalesOrderID Connect the two transformations using the green arrow Double-click the Pivot to open the Advanced Editor for Pivot, as Pivot doesn’t have a custom UI In the Component Properties tab, change the Name field to

Pivot on ProductName.

9 Go to the Input Columns tab and select all three columns

10. Move on to the Input and Output Properties tab Expand the Input Columns under Pivot Default Input on the left side of the dialog box to reveal the three columns you have selected in the Input Columns tab Click the SalesOrderID

to list the properties of this column on the right pane of the dialog box Scroll down in the properties to locate the PivotUsage property Assign a value of 1 to PivotUsage, indicating that this field will be treated as a set key Now, click the ProductName column and assign a value of 2 to its PivotUsage property to indicate that this field is a pivot key field and the distinct values in this field will create corresponding columns in the output Next, click the OrderQuantity column and assign a value of 3 to its PivotUsage property to indicate that the values from this field will be populated in the columns generated by the pivot key column

11. Expand Pivot Default Output and then click the Output Columns You will notice that no output column appears in this transformation yet Click Add

Column and rename the newly added column SalesOrderID This output

column needs to be linked to an input column In the properties of this column, locate the SourceColumn property This property holds the lineage identifier

of an input column and tells the Pivot transformation to populate the output column using values from the specified input column To specify the source in this column, click the SalesOrderID column under Input Columns and note the LineageID (not the ID) Specify this value of LineageID in the SourceColumn

property of SalesOrderID output field The value 32 shown in the SourceColumn

property in Figure 10-10 is the LineageID of the SalesOrderID input column

Trang 6

12. As you did in Step 11, add 10 more columns in the Output Columns and

rename them as per the following table Also, assign values to the PivotKeyValue

and SourceColumn properties of the new output columns Remember that

PivotKeyValue is one of the distinct values of pivot key column and the

SourceColumn indicates from which input column the data will populate the

concerned output column The new columns added here will be populated by

Figure 10-10 Setting LineageID on the SourceColumn property

Trang 7

OrderQuantity input column, so the LineageID of OrderQuantity will be used to populate all of them

Output Column Renamed to PivotKeyValue SourceColumn

Column Mountain-100 Mountain-100 LineageID Value of OrderQuantity column Column1 Mountain-200 Mountain-200 LineageID Value of OrderQuantity column Column2 Mountain-300 Mountain-300 LineageID Value of OrderQuantity column Column3 Road-150 Road-150 LineageID Value of OrderQuantity column Column4 Road-250 Road-250 LineageID Value of OrderQuantity column Column5 Road-350 Road-350 LineageID Value of OrderQuantity column Column6 Touring-1000 Touring-1000 LineageID Value of OrderQuantity column Column7 Touring-2000 Touring-2000 LineageID Value of OrderQuantity column Column8 Touring-3000 Touring-3000 LineageID Value of OrderQuantity column Column9 Sport-100 Sport-100 LineageID Value of OrderQuantity column

Refer to Figure 10-11 to see how this will look In the figure, the value 42 in the

SourceColumn is the LineageID of OrderQuantity input column Click OK to close this transformation

13. Drop the Excel destination from the Toolbox just below the Pivot on ProductName

and connect the two transformations using the green arrow Rename it as Pivoted

Data Destination Double-click the Pivoted Data destination to open the Excel

Destination Editor You will use the same connection manager and the same Excel workbook But you will add a new worksheet called Pivoted in the Excel workbook

to store the pivoted data Click the New button opposite the “Name of the Excel sheet” field and you will see a script to create a new worksheet in the Excel workbook Change the Pivoted Data Destination just after CREATE TABLE statement to

Pivoted only This will create a new worksheet named Pivoted in the SalesOrders.xls

file Click OK and select the Pivoted sheet in the “Name of the Excel sheet” field

14. Go to the Mappings page As you click Mappings, you should see all the mappings created for you automatically Click OK to close the editor for this component

Exercise (Add Data Viewers and Execute the Package)

After having configured all the data flow components, you are ready to execute the package However, in this exercise, you will also add data viewers before and after the Pivot transformation to see how the data has been pivoted

15. Double-click the data flow path connecting Sort on SalesOrderID and Pivot

on ProductName In the Data Flow Path Editor, click the Data Viewers page, and then click Add and add a grid type data viewer Click OK twice to close the editor window

Trang 8

16. Similarly, add another grid type data viewer on the data flow path between Pivot

on ProductName and Pivoted Data Destination

17. Press f5 to execute the package As the Normalized Data Source extracts data and

passes to the Sort on SalesOrderID, you can see Normalized Data Source turns

green, indicating a successful extraction of data and a total of 2,187 records extracted The next two components will appear in yellow, as the data viewer is holding the

execution process for you to check the data Click the Detach button on the first data

Figure 10-11 Configuring output columns for the Pivot transformation

Trang 9

viewer to let the execution proceed As you detach the first data viewer, you will see the second data viewer pop up with populated data and the Pivot on ProductName outputting 1,302 rows As the data passes the Sort on SalesOrderID, it will turn green to indicate a successful sort operation (see Figure 10-12) When you are done checking the data, click the Detach button on the second data viewer and let the package complete successfully

18. Stop debugging by pressing shift-f5 Save and close the project Open the SalesOrders.xls file and check out the Pivoted worksheet to see how the data has been pivoted

Review

You’ve used a Pivot transformation to convert normalized data to a less normalized form During this exercise, you sorted the data before sending it to the Pivot transformation

Figure 10-12 Data viewers showing data before and after Pivot transformation

Trang 10

and created output columns for the pivoted data Finally, you created a new Excel

worksheet using an Excel destination Note that you’ve used only one connection

manager to connect to Excel file for both extractions of data using a source adapter

and loading of data using a destination adapter

Unpivot Transformation

This transformation works in an opposite way to the Pivot transformation and converts

a denormalized data set into a more normalized version—i.e., one row may be broken

down into multiple atomic rows so that they can be stored in a relational database

We will use the data you derived in the last exercise, in which each row lists multiple

products against a single SalesOrderID When you run this transformation, the records

will be broken up into multiple rows containing the same value for SalesOrderID, but

having only one product in each row To support its functions, this transformation uses

one input, one output, and an error output

This transformation has a custom user interface that is much simpler than that of the Pivot transformation To get an idea of how to configure this transformation, consider

the pivoted data that you created in the preceding exercise To get the data formatted

in the normalized form—i.e., from where you started in the last Hands-On—you will

configure the Unpivot transformation, as shown in Figure 10-13

Following is the step-by-step method you will use to get the pivoted data back to

normalized form:

1 Use the Excel source to bring the Pivoted worksheet data in the data flow

2 Add an Unpivot transformation and configure it Let the set key column pass

through the transformation as is, which is SalesOrderID in this case Then

select all the columns that you want to unpivot As you select the check boxes for

Available Input Columns, the Input Column and the Pivot Key Value columns

will be filled in using the column name selected After that, manually fill in

the Destination Column name where you want the values of input columns

to be populated, which is OrderQuantity in this example Last, specify the

column name in which you would like pivot key values to be populated These

configurations indicate that the input column names will be converted into the

values specified in the Pivot Key Value, which will then be populated in pivot

key value column (ProductName in this case), and input column values will

be populated in a new column specified in the Destination Column, which is

OrderQuantity column in this example

3 Finally, add an Excel destination to collect the normalized data in an Excel

worksheet

Định dạng
Số trang	10
Dung lượng	457,18 KB