Hands-On Microsoft SQL Server 2008 Integration Services part 42 doc

Split and Join TransformationsThese transformations can create multiple copies of input data, split input data into one or more outputs, merge multiple inputs, or add columns to the pipe

Trang 1

3 8 8 H a n d s - O n M i c r o s o f t S Q L S e r v e r 2 0 0 8 I n t e g r a t i o n S e r v i c e s

In the Component Properties tab, you specify common and custom properties for this component The properties you can specify are:

The number of seconds before the command times out in the CommandTimeout c

field By default, this field has a value of 0, which indicates an infinite timeout value The code page value in the DefaultCodePage field when the data source is unable c

to provide the code page information

The LocaleID, which can be changed from the default value to any other c

Windows LocaleID

The ValidateExternalMetadata value, which can be changed from the default value c

of True to False if you do not want your component to be validated during the validation phase

The SQLCommand field, where you can type in the SQL statement that this c

transformation runs for each row of the input data You can use parameters in your SQL statement and map these parameters to the input columns so that the SQL statement is modifying the data, made available by OLE DB Connection Manager, on the basis of values in the input columns By default, these parameters are named as Param_0, Param_1, and so forth; and you cannot change these names However, if you use a stored procedure with sql variables in it, then you can use the variable names that make the mappings of parameters easier This option has been explained later in the chapter (refer to Figure 10-9)

The mappings between parameters used in SQL statements and input columns are defined in the Column Mappings tab In this tab, after typing the SQL statement in the SQLCommand field, you will see the Available Destination Columns populated with parameters for you (You may have to click Refresh to see the parameters.) The automatic population of parameters is dependent on the ability of the OLE

DB provider you’ve specified earlier For some third-party OLE DB providers that do not support deriving parameter information from the SQL statement, you will need to manually create parameter columns in the External Columns node in OLE DB Command Input by going to Input And Output Properties tab; assigning them names such as Param_0, Param_1, and so on; and specify a value of 1 to the DBParamInfoFlags custom property of the column You can then manually create mappings between Available Input Columns and Available Output Columns by using the drag-and-drop technique

Trang 2

Split and Join Transformations

These transformations can create multiple copies of input data, split input data into one

or more outputs, merge multiple inputs, or add columns to the pipeline by looking up

exact matches in the reference table After reading through the descriptions for these

split and join transformations, you will complete your first Hands-On exercise for this

chapter that will cover some of the Row transformations also

Conditional Split Transformation

Sometimes you will need to work with a particular data set—i.e., a subset—separately

from the data that is coming in the pipeline For example, you may want to apply

different business rules to your different types of customers In such cases, you will need

to split data to match the specific criteria into the multiple data sets using conditions

that are defined in the Conditional Split transformation This transformation allows

you to create more than one output and assign a condition (filter) to the output for the

type of data that can pass through it Among the outputs, one output has to be a default output for the rows that meet no criteria When an input data row hits the Conditional

Split transformation, it passes the data row through a set of conditions one by one and

will route the data row to the output to which it matches the criteria first Each row

can onlybe diverted to one output, and this output has to be the first one for which the

condition evaluates to True This transformation has one input, one error output, and

as you can make out, multiple outputs

The user interface of the Conditional Split Transformation Editor is similar to

that of a property expression You can select variables and columns from the top-left

section and functions and operators from the top-right section of the editor window

You can build an expression using variables, columns, functions, and operators in the

Condition field for an output As you add a condition, an output is created with an

order number specified in the Order column and a name is assigned to it in the Output

Name field The order number plays a vital role in routing the rows to outputs, as the

row is matched to these conditions in an ascending order and the row is diverted to

the output for which the condition becomes true first Once the row has been diverted,

the rest of the conditions are ignored, as a row is always sent only to one output If a

row doesn’t meet any condition, in the end it will be sent to the default output of the

transformation The Output Name can be changed to a more appropriate name that

suits your data set This transformation has a mandatory default output built in for you

The syntax you use to build an expression for a condition uses the expression

grammar For example, if you want to split customers on the basis of countries—

Trang 3

e.g., you want to separate out rows for UK, USA, and rest of the world—your Conditional Split transformation will have three outputs with the following configurations:

Order Output Name Condition

The default output will collect all the rows for which Country is neither UK nor USA

Multicast Transformation

The Conditional Split transformation enables you to divert a row to a specific output and split a data set It doesn’t allow you to divert a row to more than one output and hence does not create a copy of the data set However, sometimes you may need

to create copies of a data set during run time so that you can apply multiple sets of transformations to the same data For example, working within the same ETL, you may want to treat your sales data differently while preparing data for balance sheets, than while using it to calculate commission paid to the salespersons Another example could be when you want to load data to your relational database as well as the analytical reporting data mart, with both having different data models and obviously different loading requirements In this case, you will create two sets of data, apply different transformations to the data to bring them in line with the data model requirements, and then load the data to the destination database This transformation has one input and supports multiple outputs to perform multicast operation

The Multicast transformation creates copies of a data set by sending every row to every output It does not apply any other transformation to the data other than diverting

a data set to more than one output The transformation is simple to use and doesn’t require much to configure In fact, when you open the Multicast Transformation Editor, you will see nothing to configure with the two blank panes aligned side by side This is because you configure this transformation by connecting its outputs to inputs

of multiple components and not by setting properties on the attributes On the Data Flow surface, when you click the Multicast transformation after connecting its first output to a downstream component, you will see that it provides another output (a green arrow emerging from the transformation) for you to connect to another downstream component If you double-click the Multicast transformation after connecting the second output to another downstream component, you will see two outputs listed in the left pane of the editor window, and the right pane shows the properties of the output selected in the left pane You can modify the name of the output and write a description for it if you want and that’s it

Trang 4

Union All Transformation

This component works similar to the Union All command of T-SQL and combines two

or more inputs into a single output To support this functionality, this transformation has

multiple inputs and one output and does not support an error output During run time,

the transformation picks up all the rows from one input, sends them to the output, then

picks up all the rows from the second input, sends them to the output after combining

them with the rows of the first input, and this goes on until all the inputs have been

combined Note that this transformation can select inputs in any random order based on

when the data is available at the inputs Also, it does not sort the records in any way All

the rows from a particular input will be output together

On the Data Flow Designer when you connect an input to this transformation,

this transformation copies the metadata from this first input to its output so that the

columns of this first input are mapped to the output created by the transformation The columns having matching metadata of any input you connect after the first input are

also mapped to the corresponding Output columns Any column that does not exist

in the first input, but is a part of subsequent input, will not be copied in the output

columns by the transformation You must create and map this column manually in the

Output Column Name field of the output with the corresponding Input column If

you don’t create and map a column that exists in subsequent inputs but not in the first

input, that column will be ignored in the output You can also change mappings in the

editor by clicking in the Input field and choosing an option from the drop-down list

Merge Transformation

As its name suggests, the Merge transformation combines two inputs into a single

output but requires that the inputs be sorted and the columns have matching metadata

To perform merge functionality, this transformation has two inputs and one output It

does not have an error output The Merge transformation can even provide you sorted

output by inserting rows from inputs based on their key values This transformation is

similar to the Union All transformation with some key differences:

The Merge transformation requires sorted inputs, whereas the Union All

c

transformation does not

The Merge transformation can combine only two inputs, whereas

combine more than two inputs

The Merge transformation can provide sorted output, whereas Union All cannot

c

When you add a Merge transformation into a package’s data flow, you won’t be able

to open its editor until you have attached two sorted inputs that have properly defined

Trang 5

sort-key positions for the columns on which they are sorted The input data sets must be sorted—for example, using sort transformation—which must be indicated by the IsSorted property on the outputs and the SortKeyPosition property of the output columns of the upstream components These properties are discussed in Chapter 15, where you can refer

to Figure 15-4 and Figure 15-5 to see how these properties are used

Once the sorted inputs are connected to the inputs of this transformation and you are able to open the editor, you will see the output columns mapped to the Input columns

of both the inputs This transformation creates the output by copying the metadata of the first input you connect to it You can make changes to the mappings However, the metadata of the mapped columns must match If the second input has a column that is not in the Output columns list, you can add and map this column manually

Merge Join Transformation

The Merge Join transformation joins two sorted inputs into one output using a full, left, or inner join For merging two inputs, this transformation supports two inputs and one output and does not support an error output The joins work exactly as it does

in T-SQL For the benefit of those who haven’t worked with joins, here is a quick refresher for these joins

Inner Join

An inner join returns the rows that have a join key present in both the data sets For example, suppose you have a table for all the customers, a table for all the products that your company has sold in last ten years, and a date table An inner join will return a list

of customers and the products who purchased any product in year 2009

Left Outer Join

A left outer join returns all the rows from the first (left side of the join) table and the rows from second table for the matching key For the rows that don’t have a matching key in the second table, the corresponding output columns are filled with nulls For the customers and products example, a left outer join will list all the customers with the products listed against the customers who have made purchases in 2009 and nulls against the customers who didn’t make a purchase in 2009

Full Outer Join

A full outer join returns all the rows from both tables joined by the key—simply put,

it is a list of all the data If the rows from first table don’t have a matching key in the second table, the corresponding second table columns are filled with nulls Similarly,

if the rows from the second table don’t have a matching key in the first table, the corresponding first table columns are filled with nulls

Trang 6

When you add a merge join transformation into a data flow, you won’t be able to

open the editor until you have attached two sorted inputs—one to the merge join left

input and the other to the merge join right input The input data sets must be sorted

physically—for example, using a sort transformation—which must be indicated by

the IsSorted property on the outputs and the SortKeyPosition property of the output

columns of the upstream components These properties are discussed in Chapter 15,

where you can refer to Figure 15-4 and Figure 15-5 to see how these properties are used

Once the sorted inputs have been connected to the left and right inputs of this

transformation, you can open the editor You can choose among inner join, left outer

join, or full outer join types in the Join Type field You can also swap your inputs by

clicking the Swap Inputs button for a Left Outer Join After selecting the join type, you can specify the join keys if they have not already been picked up by the transformation

The joined columns must have the matching metadata, and the join key must be in

the same order as specified by the sort key Next, you can select the output columns

by selecting the check boxes or using the drop-down lists in the Input columns in the

lower section of the Transformation Editor You will also be able to assign an alias to

the Output column Once you have selected the required columns for the outputs, you

can close the editor, as there is no other configuration in the editor; however, there are

couple of important properties that you may configure for this transformation You

can access these special properties in the properties window when you press the f4 key

while this transformation is selected

The first performance-related property is the MaxBuffersPerInput property,

which lets you specify an integer value for the number of buffers for each input that

you want to suggest Merge Join is a semi-blocking transformation; that is, it needs

a set of rows before it could start outputting any record You can well imagine this

as the transformation needs a good set of rows or buffers for each input to be able

to join them and start spitting out the resultant rows The second reason for this

transformation to collect large sets of data is to avoid deadlocking between the threads

that are processing data If you are working with really large data sets, the Merge Join

transformation can cache huge amounts of data, putting high pressure on memory

available on the computer This could affect performance of other applications on the

server To avoid such a situation, you can throttle the memory requirements using

the MaxBuffersPerInput property While this property allows you some control, it is

just a suggestive value and not a hard value that forces the Merge Join to use only the

specified number of buffers If the transformation feels that there is a risk of thread

deadlocking or it doesn’t have enough data to perform, it will increase the number of

buffers required to some other value than what has been specified However, it does

stay within specified limits if no such risk exists The default value for this property is

five buffers per input, which works well with most of the requirements You can specify

a larger value to improve performance if you are working with large data sets and have

Trang 7

enough free memory on the server Alternatively, you can decrease the value from the default value if the server is already struggling with memory pressure But as you can imagine, reducing this value to 0 will disable all throttling and can adversely affect performance You need to be a bit more cautious when configuring this transformation and configure a more realistic setting, as the memory usage could go too much off the mark when configured too low for a large data set Another issue has been observed when both the input data streams hit this transformation at wildly different times Testing your data flow is the best recommended approach for finding the right balance with this transformation

Last, you can specify to treat null values equal by using the TreatNullsAsEqual property

of the transformation; the default is True for this property If you decide not to treat nulls as equal, the transformation then treats nulls similar to the database engine

Cache Transform

The Lookup transformation in Integration Services 2008 can use an in-memory cache

or a cache file to load reference data to perform lookups on the data in pipeline The in-memory cache or the cache file needs to be created for the lookup transformation before it starts execution This means you can create a reference data cache either before the package starts running or during the same package execution, but before the lookup transformation Using a cache transformation, you can write distinct data rows to a cache connection manager that, depending upon how you have configured

it, can write this data into cache or can also persist this cached data to a file on the hard disk The user interface of the Cache Transformation Editor is simple and has only two pages In the Connection Manager page, you specify the Cache Connection Manager, and in the Mappings page, you map Input columns to the destination columns being written to a Cache Connection Manager Here, you have to map all the input columns to the destination columns; otherwise, the component will throw an error If you change the data type of a column later—e.g., you increase the length of

an input column—the metadata can be corrected in the Cache Connection Manager, which displays the columns and their metadata in the Columns tab While configuring the Cache Connection Manager, you will also need to configure an Index Position for the columns By default, the index position is 0 for all the columns, indicating that these columns are non-index columns You can assign positive integer values such as

1, 2, 3… and so on to the columns that are participating in indexing The number assigned to an index position on a column indicates the order in which the Lookup transformation compares rows in the reference data set to rows in the input data source The selection of the columns as index columns depends upon which columns you need to look up while configuring a Lookup transformation For example, if you

Trang 8

are matching for car details based on the manufacturer, the model description, and the

body style, you will assign the index positions as shown in this table:

Column Name Index Position Index Position Description

Manufacturer 1 This column participates in the lookup operation and should be mapped to the

input column in a lookup transformation This is the first column on which the comparison will be made

Model Description 2 This column participates in the lookup operation and should be mapped to the

input column in a lookup transformation This is the second column on which the comparison will be made

Body Style 3 This column participates in the lookup operation and should be mapped to the

input column in a lookup transformation This is the third column on which the comparison will be made

All other columns 0 These columns do not participate in the lookup operation mappings; however,

they can be selected in order to add them into the data flow

As the Cache Connection Manager writes data to memory, it gets tied up to the cache

Hence, you cannot use multiple cache transforms to write to the same Cache Connection

Manager The cache transform that gets called first during the package execution writes

data to the Cache Connection Manager while all subsequent cache transforms fail As

mentioned earlier, you can create a cache file using a cache transform in a separate package

or in the same package that runs before the Lookup transformation So, once the reference

data is persisted to a cache file, which is a raw file, you can use this cache file among

multiple data flow tasks within the same package, between multiple packages on the same

server, or between multiple packages on different servers The only consideration you have

to keep in mind is that the data in the cache file will be current to the time it is loaded into

the cache file If your reference data doesn’t change that often, you can use this technique

to perform quite fast lookup operations

Lookup Transformation

With the ever-increasing use of the Web to capture data, lookup operations have

become quite important Web page designers tend to ask users to fill in the most

critical data in a web form, and the form fills in rest of the information for them For

example, you may ask a visitor on your site to fill in a street address and postal code

during registration, and based on these two pieces of information, you can fill the city

and state address fields This is done by looking up a database table that contains all

the postal information keyed in with postal codes So, you simply look for the row

that contains the particular postcode and you will be able to complete the address

fields You can perform such lookup operations in Integration Services also—using

Trang 9

the Lookup transformation While loading a data warehouse, you will frequently use a Lookup transformation to look up surrogate keys in a dimension table by matching the business key

This transformation lets you perform lookups by joining the input columns in the data flow with columns in a reference data set The resulting values can be included

in the data flow in the new columns, or you can replace the existing column values For example, when a value of a record in the postal code column in the data flow

is equal to a record in the postal code column in the reference data set, the lookup transformation will be able to get data for all the other address columns This

equality operation makes this join an equijoin, requiring that all the values in the

data flow match at least one value in the reference data set In a complex lookup operation, you may be joining multiple columns in the input to the multiple columns

in the reference data set

This transformation has undergone a major upgrade since its predecessor in Integration Services 2005 The user interface is also changed and provides more flexibility and performance improvements The Lookup Transformation Editor provides five pages

to configure its properties When you open the editor, you will choose the options in the General page to decide how you want to configure this transformation Let’s start with the simplest one—the Lookup transformation allows you to configure its outputs

on the basis of how you want to handle the non-matching rows The “Specify how to handle rows with no matching entries” option in the General page provides you four options to choose from:

Ignore failure c

Redirect rows to error output c

Fail component (default) c

Redirect rows to no match output c

When the transformation looks up an input data key against the reference data set, you can either have a match or no match When the input row key is found in the reference data set, the row is called a match row and is sent to the match output If a key in the input row is not found in the reference data set, the row is called a no-match row and this option allows you to decide how you want to handle such rows

In Integration Services 2005, no-match rows are treated as errors and are sent to error output, whereas in Integration Services 2008, you can choose to redirect non-matching rows to a no-match output Choosing an option other than “Redirect rows

to no match output” will treat no-match rows as error rows and will send them to error output

Trang 10

Next you can select how you would like to cache your reference data set The Cache

Mode section provides three options to choose from:

Full cache

c Selecting this option will force the reference set to be prepared

and loaded into the cache memory before the lookup operation is performed

For example, if you are looking for postal addresses, all the rows with postal

information will be loaded into the memory before actual lookup is performed

against this reference data set As you can imagine, lookup against a reference set

that is held in memory will be faster compared to the lookup operation where the

component has to make a connection to the outside data store to get the lookup

values Full cache mode is the best-performing option if you have enough physical

memory on the server to hold the complete reference data set When you choose

this option, two things happen in the user interface:

The Advanced page is disabled The options in this page are not used when

c

using full cache mode

The Cache Connection Manager option becomes available The Lookup

c

transformation can use either an OLE DB Connection Manager or a

cache connection manager in the case of full cache mode This feature is a

new addition in this version The Cache Connection Manager can read the

reference data from a cache file that can be prepared separately from the

lookup process You can actually prepare the reference data set separately, in

the same Data Flow task before the Lookup transformation, in an earlier

Data Flow task in the same package, or even in a separate package using a

Cache transform When the data is prepared and persisted to a cache file

(.caw), the Lookup transformation can load the data from the cache file faster

than from the relational sources using a Cache Connection Manager So, by

utilizing full cache mode and sharing a cache file between multiple lookups

within the same package or in multiple packages, you can achieve high levels

of performance, especially when the reference data set is not small The only

caveat here is that your reference data set must not change If it changes with

each data load, you have to be careful while using the Cache transform and

the Cache Connection Manager in your package

Partial cache

c Selecting this option will enable you to apply limits on the use of

memory for caching reference data When you select the partial cache mode, the

Advanced page options become available, but the Cache Connection Manager

option is disabled, as it is used only in the full cache mode So, you can only use

the OLE DB Connection Manager to collect reference data from a relational

source In the Advanced page, you can limit the cache size to a value that you can

Định dạng
Số trang	10
Dung lượng	118,65 KB