In such a case, you can use the Percentage Sampling transformation to create a sample data set from the input rows using the specified percentage.. The Pivot transformation converts a no
Trang 1a randomly selected 15 percent of the total records in the segment One other example could be that you need to create a training data set and a test data set for your data mining models and you want to divide your data set based on a defined percentage In such a case, you can use the Percentage Sampling transformation to create a sample data set from the input rows using the specified percentage This transformation helps you create a representative data set much smaller in size that you can use for variety
of purposes, such as testing your packages in a development environment or using the sample data set for surveys and marketing purposes
The Percentage Sampling transformation has a simple and intuitive user interface with just four fields The first field, Percentage Of Rows, allows you to specify a percentage
of sampling The Percentage Sampling transformation uses an algorithm to select at random the number of rows according to the specified percentage However, the number
of rows that this transformation selects does not precisely match with the percentage calculations—i.e., the output rows may be a little bit too many or a little bit too few The next two fields allow you to specify the names for selected sample output rows and the remaining unselected output rows As this transformation selects rows for sampling, it outputs those rows onto its first output and the remaining unselected rows are outputted
on to its second output This transformation supports one input and two outputs to support both the selected and unselected data sets and supports no error output
The last field is a check box and a value pair, you can select the check box if you want to specify a sampling seed and type a value in the field provided as the sampling seed If you specify a sampling seed and reuse the same sampling seed in a later run, it will produce the same sample output no matter how many times you run the package with the same data set This is helpful in testing of packages Alternatively, if you don’t specify a sampling seed, this component will generate a random number using the tick count of the operating system Hence, each time you run a package, a different random number is generated and a different data set is sampled
Row Sampling Transformation
The Row Sampling transformation works quite similar to the Percentage Sampling transformation to sample a data set However, the Row Sampling transformation outputs an exact number of rows as specified in the transformation This random selection of a precise number of rows is sometimes very useful An example of such
a scenario can be a gift allocation to the random selection of people Suppose you’re running a campaign to introduce your new product to different segments of your customers and prospects by sending them an e-mail every week To promote readers’ interest, you decide to award gifts to a random selection of 50 persons who show interest in your product by evaluating it every week You can easily build this package
by bringing into the data flow the records for the persons who evaluated the product in
Trang 2the current week and then apply a Row Sampling transformation to select 50 persons
out of these records
In the Row Sampling Transformation Editor, you specify the number of rows
you want to output This transformation supports two outputs—one for extracting
the selected records and the other for unselected records You can type in the names
for both the outputs in the user interface It is not necessary for you to configure a
downstream data flow to capture the unselected records You can simply ignore this
output, and the records appearing on this output will not be included in the data flow
You can choose to specify a random seed for selection of records by clicking in
the “Use the following random seed” check box This transformation selects random
records on the basis of an algorithm that uses the random seed If you specify the same
random seed, the algorithm will select the same random records for the same input
data When you check this option, a message will pop up to tell you that using the same random seed on the same input data always generates the same sample, and specifying
a random seed is recommended only during the development and testing of a package
Specifying a random seed affects the selection of records, and when you don’t specify
a random seed, the transformation uses the tick count of the operating system to create
the random number that is obviously different each time you run the package, and
hence the selected random records will be different even for the same input data
Pivot Transformation
Relational databases are modeled to store normalized data This normalization of data
changes the data view in a way that sometimes may not be as intuitive as businesses
desire The process that is used to convert data from a normalized form to a denormalized
form is called pivoting To understand how you can use this transformation, you will be
working through a Hands-On exercise later; but for now, let’s discuss what a normalized
data is by looking at the data we are going to use
The data you will use in an exercise later is in an Excel spreadsheet that keeps sales
order details in three columns: SalesOrderID, ProductName, and OrderQuantity To
keep the data in a normalized form, the table contains multiple entries or rows for the
same SalesOrderID For example, if three products have been purchased under a single
SalesOrderID, the normalized data is represented by listing three rows for the same
SalesOrderID with a different ProductName in each row to show the purchase quantity for that product But the sales manager may prefer to see the sales order details with the products and the quantity for each of them listed against the SalesOrderID on the same row This is when you need to use pivot function to denormalize the data
Before SSIS made it available, the pivot function was available in Microsoft Excel or
third-party tools, or you had to write custom code to accomplish the task Integration
Services now provides both a Pivot transformation and an UnPivot transformation to
Trang 3provide different data views or forms The Pivot transformation converts a normalized data set into a less normalized form by pivoting the input data on a column value The role a column performs in pivoting is defined by the PivotUsage property specifying values ranging from 0 to 3 The column that is used to pivot the data around forms the
set key for pivoting, and this column is assigned a value of 1 for the PivotUsage property
Assigning a value of 1 to the PivotUsage property of an input column indicates that it is part
of the set key of a single-row or multirow set In our example, the SalesOrderID column will act as a set key for pivoting and the multiple rows with the same SalesOrderID will
be combined into one row
When the data is pivoted, the values in a column, called pivot key values, are pivoted
to the columns in the output For specifying the values to be pivoted to output columns, you assign a value of 2 to the PivotUsage property on the input column In our example, the ProductName column has 10 different values for the products When this data is pivoted, 10 columns are created on the basis of 10 different values in the ProductName column—e.g., the value Mountain–100 of the ProductName column becomes the Mountain–100 column in the pivoted output These newly created columns in the pivoted output get the values from the third column, which is used to provide values for newly created columns in the pivoted output by specifying a value of 3 to the PivotUsage property All other input columns that don’t participate in the pivoting process are assigned PivotUsage value of 0, and for the set of input rows that has same set key, the first input value for the column is copied to the output column
The Pivot transformation pivots the data on the basis of set key column value For the same value of set key, the Pivot transformation merges multiple rows into a single row and pivots the input rows into columns This implies that if the data is not sorted to list the same set key values in one collection of rows, this transformation will output the same key values multiple times In our example, to get only one record for a SalesOrderID, the data must be sorted on SalesOrderID However, if the data is not sorted on SalesOrderID, this transformation will generate multiple records for same SalesOrderID, as it will pivot the rows to columns each time the value of SalesOrderID changes
All this may appear quite complex, but it is not that complicated when it comes to configuring the Pivot transformation Let’s see how to use this transformation to pivot data from an Excel worksheet
Hands-On: Pivoting Sales Order Records in an Excel Worksheet
The records exported from the Sales order database to an Excel worksheet are in the normalized form—i.e., one sales order number appears in multiple rows to store details for the products ordered against it The sales manager wants to see details of all the products ordered against each sales order in a single row
Trang 4Before starting this exercise, open the C:\SSIS\RawFiles\SalesOrders.xls file to
verify that the file has only one worksheet labeled Normalized This exercise adds
another worksheet to this file; if it already has two worksheets, delete the second
worksheet and then start this exercise Also, if you are using the provided package code, you may get a validation error, as the Excel Destination used in the package looks for
the worksheets during this exercise In this case, leave the worksheets as is
Method
In this exercise, you will be using a Pivot transformation to transform the given data
to the required format and will put the pivoted data in a new worksheet As Pivot
transformation pivots the data every time the set key column value changes, you will
need to sort the data before the Pivot transformation
Exercise (Add Connection Manager and Data Flow Task)
You will start this exercise with adding a new package to the Data Flow transformations
project, and then adding an Excel Connection Manager to it
1 Open the Data Flow transformations project in BIDS Right-click the SSIS
Packages in the Solution Explorer and choose New SSIS Package This will add
a new SSIS package called Package1.dtsx
2 Rename the Package1.dtsx package to Pivoting SalesOrders.dtsx.
3 Right-click in the Connection Managers area and choose New Connection
from the context menu Select the Excel Connection Manager type from the
list in the Add SSIS Connection Manager dialog box and click Add Next, Type
C:\SSIS\RawFiles\SalesOrders.xls in the Excel file path field in the Excel
Connection Manager dialog box Leave the Excel Version selected as Microsoft
Excel 97-2003 and see that the check box for First Row has column names
checked Click OK to add the Excel Connection Manager in the Connection
Managers area Rename it as SalesOrders Connection Manager.
4 Drag the Data Flow Task from the Toolbox and drop it onto the Control Flow
Designer surface Rename this task Transforming SalesOrders Double-click it
to open the Data Flow tab and configure this task
Exercise (Configure the Data Flow Task)
To configure a data flow for pivoting SalesOrders data, here you will add an Excel source
to extract data from a normalized worksheet of the SalesOrders.xls file and then sort this
data on the SalesOrderID, as a Pivot transformation requires all the rows having the same set key to be together in a sorted set for merging them to a single row After sorting the
data, you will configure the Pivot transformation to pump the pivoted data through to an
Excel destination that will write the pivoted output to a new Excel worksheet
Trang 55 From the Toolbox, drag and drop the Excel source onto the Data Flow Designer
surface Rename this adapter Normalized Data Source Double-click the
Normalized Data Source to open the Excel Source Editor The SalesOrders Connection Manager will be listed in the OLE DB Connection Manager field for you Select the Normalized$ in the “Name of the Excel sheet” field
6 Go to the Columns page, and verify that all the three fields have been selected from the Available External Columns Click OK to close the editor
7 Drop the Sort transformation from the Toolbox onto the Data Flow Designer surface just below the Normalized Data Source and join both the components
using green connector Rename it as Sort on SalesOrderID and double-click to
open the Sort Transformation Editor Click to select the check box before the SalesOrderID column This column will appear in the lower half of the dialog box with Sort Type as ascending and Sort Order equal to 1 Leave these setting
as is and click OK to close this editor
8 Drop the Pivot from the Toolbox onto the Designer surface just below the Sort
on SalesOrderID Connect the two transformations using the green arrow Double-click the Pivot to open the Advanced Editor for Pivot, as Pivot doesn’t have a custom UI In the Component Properties tab, change the Name field to
Pivot on ProductName.
9 Go to the Input Columns tab and select all three columns
10. Move on to the Input and Output Properties tab Expand the Input Columns under Pivot Default Input on the left side of the dialog box to reveal the three columns you have selected in the Input Columns tab Click the SalesOrderID
to list the properties of this column on the right pane of the dialog box Scroll down in the properties to locate the PivotUsage property Assign a value of 1 to PivotUsage, indicating that this field will be treated as a set key Now, click the ProductName column and assign a value of 2 to its PivotUsage property to indicate that this field is a pivot key field and the distinct values in this field will create corresponding columns in the output Next, click the OrderQuantity column and assign a value of 3 to its PivotUsage property to indicate that the values from this field will be populated in the columns generated by the pivot key column
11. Expand Pivot Default Output and then click the Output Columns You will notice that no output column appears in this transformation yet Click Add
Column and rename the newly added column SalesOrderID This output
column needs to be linked to an input column In the properties of this column, locate the SourceColumn property This property holds the lineage identifier
of an input column and tells the Pivot transformation to populate the output column using values from the specified input column To specify the source in this column, click the SalesOrderID column under Input Columns and note the LineageID (not the ID) Specify this value of LineageID in the SourceColumn
property of SalesOrderID output field The value 32 shown in the SourceColumn
property in Figure 10-10 is the LineageID of the SalesOrderID input column
Trang 612. As you did in Step 11, add 10 more columns in the Output Columns and
rename them as per the following table Also, assign values to the PivotKeyValue
and SourceColumn properties of the new output columns Remember that
PivotKeyValue is one of the distinct values of pivot key column and the
SourceColumn indicates from which input column the data will populate the
concerned output column The new columns added here will be populated by
Figure 10-10 Setting LineageID on the SourceColumn property
Trang 7OrderQuantity input column, so the LineageID of OrderQuantity will be used to populate all of them
Output Column Renamed to PivotKeyValue SourceColumn
Column Mountain-100 Mountain-100 LineageID Value of OrderQuantity column Column1 Mountain-200 Mountain-200 LineageID Value of OrderQuantity column Column2 Mountain-300 Mountain-300 LineageID Value of OrderQuantity column Column3 Road-150 Road-150 LineageID Value of OrderQuantity column Column4 Road-250 Road-250 LineageID Value of OrderQuantity column Column5 Road-350 Road-350 LineageID Value of OrderQuantity column Column6 Touring-1000 Touring-1000 LineageID Value of OrderQuantity column Column7 Touring-2000 Touring-2000 LineageID Value of OrderQuantity column Column8 Touring-3000 Touring-3000 LineageID Value of OrderQuantity column Column9 Sport-100 Sport-100 LineageID Value of OrderQuantity column
Refer to Figure 10-11 to see how this will look In the figure, the value 42 in the
SourceColumn is the LineageID of OrderQuantity input column Click OK to close this transformation
13. Drop the Excel destination from the Toolbox just below the Pivot on ProductName
and connect the two transformations using the green arrow Rename it as Pivoted
Data Destination Double-click the Pivoted Data destination to open the Excel
Destination Editor You will use the same connection manager and the same Excel workbook But you will add a new worksheet called Pivoted in the Excel workbook
to store the pivoted data Click the New button opposite the “Name of the Excel sheet” field and you will see a script to create a new worksheet in the Excel workbook Change the Pivoted Data Destination just after CREATE TABLE statement to
Pivoted only This will create a new worksheet named Pivoted in the SalesOrders.xls
file Click OK and select the Pivoted sheet in the “Name of the Excel sheet” field
14. Go to the Mappings page As you click Mappings, you should see all the mappings created for you automatically Click OK to close the editor for this component
Exercise (Add Data Viewers and Execute the Package)
After having configured all the data flow components, you are ready to execute the package However, in this exercise, you will also add data viewers before and after the Pivot transformation to see how the data has been pivoted
15. Double-click the data flow path connecting Sort on SalesOrderID and Pivot
on ProductName In the Data Flow Path Editor, click the Data Viewers page, and then click Add and add a grid type data viewer Click OK twice to close the editor window
Trang 816. Similarly, add another grid type data viewer on the data flow path between Pivot
on ProductName and Pivoted Data Destination
17. Press f5 to execute the package As the Normalized Data Source extracts data and
passes to the Sort on SalesOrderID, you can see Normalized Data Source turns
green, indicating a successful extraction of data and a total of 2,187 records extracted The next two components will appear in yellow, as the data viewer is holding the
execution process for you to check the data Click the Detach button on the first data
Figure 10-11 Configuring output columns for the Pivot transformation
Trang 9viewer to let the execution proceed As you detach the first data viewer, you will see the second data viewer pop up with populated data and the Pivot on ProductName outputting 1,302 rows As the data passes the Sort on SalesOrderID, it will turn green to indicate a successful sort operation (see Figure 10-12) When you are done checking the data, click the Detach button on the second data viewer and let the package complete successfully
18. Stop debugging by pressing shift-f5 Save and close the project Open the SalesOrders.xls file and check out the Pivoted worksheet to see how the data has been pivoted
Review
You’ve used a Pivot transformation to convert normalized data to a less normalized form During this exercise, you sorted the data before sending it to the Pivot transformation
Figure 10-12 Data viewers showing data before and after Pivot transformation
Trang 10and created output columns for the pivoted data Finally, you created a new Excel
worksheet using an Excel destination Note that you’ve used only one connection
manager to connect to Excel file for both extractions of data using a source adapter
and loading of data using a destination adapter
Unpivot Transformation
This transformation works in an opposite way to the Pivot transformation and converts
a denormalized data set into a more normalized version—i.e., one row may be broken
down into multiple atomic rows so that they can be stored in a relational database
We will use the data you derived in the last exercise, in which each row lists multiple
products against a single SalesOrderID When you run this transformation, the records
will be broken up into multiple rows containing the same value for SalesOrderID, but
having only one product in each row To support its functions, this transformation uses
one input, one output, and an error output
This transformation has a custom user interface that is much simpler than that of the Pivot transformation To get an idea of how to configure this transformation, consider
the pivoted data that you created in the preceding exercise To get the data formatted
in the normalized form—i.e., from where you started in the last Hands-On—you will
configure the Unpivot transformation, as shown in Figure 10-13
Following is the step-by-step method you will use to get the pivoted data back to
normalized form:
1 Use the Excel source to bring the Pivoted worksheet data in the data flow
2 Add an Unpivot transformation and configure it Let the set key column pass
through the transformation as is, which is SalesOrderID in this case Then
select all the columns that you want to unpivot As you select the check boxes for
Available Input Columns, the Input Column and the Pivot Key Value columns
will be filled in using the column name selected After that, manually fill in
the Destination Column name where you want the values of input columns
to be populated, which is OrderQuantity in this example Last, specify the
column name in which you would like pivot key values to be populated These
configurations indicate that the input column names will be converted into the
values specified in the Pivot Key Value, which will then be populated in pivot
key value column (ProductName in this case), and input column values will
be populated in a new column specified in the Destination Column, which is
OrderQuantity column in this example
3 Finally, add an Excel destination to collect the normalized data in an Excel
worksheet