Hands-On Microsoft SQL Server 2008 Integration Services part 37 docx

During design time, when you drop the data flow source onto the Data Flow Designer surface and configure it to connect to an external data source from which data will be read, the data f

Trang 1

data into the Integration Services data type During design time, when you drop the data flow source onto the Data Flow Designer surface and configure it to connect

to an external data source from which data will be read, the data flow source copies the metadata—i.e., the schema of the data—to its external columns metadata At run time, when the data is being pulled in, the incoming data columns are parsed into the Integration Services data types using the external columns metadata In the example, the names Steve and Sarah are parsed into the DT_STR data type, sex, which is indicated using 1 and 0, and have been assigned DT_BOOL; House_ Number has been assigned DT_I4; and the Moved_in data has been assigned the DT_ DBTIMESTAMP data type Integration Services provides 29 different data types to cover various types of data, such as character data, numerical data, Boolean, dates, text, and image fields Integration Services provides a wide range of data types and is particular about how they are used; hence, if data has a data type that does not match with the data types available in Integration Services, an error occurs

We will explore many data types along the way as we progress with our Hands-On exercises; however, to know more about each data type, refer to Microsoft SQL Server 2008 Books Online

The process that converts the source data into Integration Services data types is

called data parsing Data flow components can be configured to use either fast parsing

or standard parsing Fast parsing supports the most commonly used data types with

a simple set of routines and can be configured at the column level Because it uses simple routines and does not check for many other data types, it can achieve high levels

of performance However, it is not available for most of the data flow components

1 250 07-11-2005

Sarah 0 130 25-21-2004

DT_STR

DT_I4 DT_DBDATE DT_BOOL

Steve 1 250

Processing logic

Error output Output

Sarah 0 130

External columns metadata

Data flow source

Name Sex House_Number Moved_in

varchar(50) bit int datetime

Figure 9-1 A data flow source extracting and parsing data

Trang 2

and can parse only a narrow range of data types For example, fast parsing does not

support locale-specific parsing, special currency symbols, and date formats other than

year-month-date Also, fast parsing can be used only when using a Flat File source,

a data conversion transformation or derived column transformation, and a Flat File

destination, because these are the only components that convert data between string

and binary data types You may use fast parsing for performance reasons when your

data meets these requirements, but for all other occasions, you will be using standard

parsing Before using fast parsing, check out the Books Online to find out the data

types supported by the fast parsing

Standard parsing uses a rich set of parsing routines that are equivalent to OLE DB

parsing APIs and supports all the data type conversions provided by the automation

data type conversion APIs available in Oleaut32.dll and Ole2dsip.dll For example,

standard parsing provides support for locale-sensitive parsing and international data

type conversions

Returning to the data flow, as the data rows arrive and are parsed, they are validated

against the external columns metadata The data that passes this validation check at

run time is copied to the output columns; the data that doesn’t pass is treated slightly

differently You can define the action the component can take when a failure occurs

Typical errors you can expect at the validation stage include data type mismatches

or a data length that exceeds the length defined for the column Integration Services

handles these as two different types of errors—data type errors and data length or data

truncation errors You can specify the action you want the data flow component to take

for each type of error from the three options—fail the component, ignore the error, or

redirect the failing row to error output fields You can specify one of these actions on

all columns or different actions for different columns If you redirect the failing rows to

the error output fields and link the error output to a data flow destination, the failing

rows will be written to the data flow destination you specified

The error output contains all the output columns and two additional columns for the

failing rows, ErrorCode and ErrorColumn, which indicate the type of error and the failing

column In Figure 9-1, note that the record holding data for Sarah has a wrong date

specified and hence fails during extract process As the source was configured to redirect

rows, the failing row data is sent to the Error Output Also, note that the two rightmost

columns indicate the type of error and the column number Every output field is assigned

an ID automatically, and the number shown in the ErrorColumn is the ID number of the

column failing the extract process If two columns fail on the same row, the column that

fails first is captured and reported So, in case of multiple failures on a row, you might not

know about them till the package fails again after you have fixed the first error

After pulling in the data rows, a data flow source passes the output rows to the next

data flow component—generally a transformation—and the failing rows are passed to

another data flow component—which could be a data flow destination, if configured

Trang 3

to redirect the failing rows You can also redirect the failing rows to an alternate branch in the data flow via a transformation in which you can apply logic to correct the failing rows and bring them back into the main data flow after corrections This is a highly useful capability that, if used properly, can reduce wastage and improve data quality When the output of a data flow source is connected to the input of a data flow transformation, the data flows from the input of the transformation, through the processing logic of transformation, and then to its output Based on the logic of transformation, some rows may fail the process and may be sent to the error output The main difference to note in comparison to a data flow source is that the transformations do not have external columns and have input columns instead Figure 9-2 shows the functional layout of a data flow transformation

Finally, after passing through the data flow source and the data flow transformations, the data will reach a data flow destination so that it can be written to an external data store Like data flow sources, a data flow destination can also read the schema information from

Processing logic

Error output Output

Input

Data flow transformation

ErrorCode ErrorColumn

Name Sex

Figure 9-2 Data flow transformation showing Input, Output, and Error Output columns

Trang 4

the external data store and copy the metadata to the external columns When the data

flows through the data flow destination, it gets validated against this metadata before being written to the external data store If configured to redirect the failing rows, the failing rows

may be sent to the error output columns and the rest of the successful data is written to

the external data store Note that a data flow destination does not have output columns

Figure 9-3 shows the data flow through a data flow destination

At this point it is worth mentioning that a pipeline or a data flow path usually

terminates in a data flow destination; however, that is not necessary Sometimes, you

may prefer to terminate a data flow path in a transformation For example, while testing

and debugging, you can break and terminate the data flow path by adding a Row Count

transformation to see what happens at a particular point in the data flow path while

ignoring rest of the pipeline With the preceding description of how the data is extracted from a data source, and how it flows through the transformations and destinations

before being written to the external data store, let’s study these components to learn

more about them and to learn how many types of data sources or destinations are

available in the data flow

Processing logic

Error output Data flow destination

External columns metadata DT_STR

DT_I4 DT_DBDATE

DT_BOOL

Steve 1 250 07-11-2005

Parsing Input

Person External data source

Name Sex House_Number Moved_in

varchar(10) bit int datetime

Figure 9-3 Data flow destination showing the flow of data to the external data store

Trang 5

Data Flow Sources

While building a data flow for a package using BIDS, your first objective is to bring the data inside the Integration Services data flow so that you can then modify the data using data flow transformations Data flow sources are designed to bring data from the external sources into the Integration Services data flow A data flow source reads the external data source, such as a flat file or a table in relational database, and brings in the data to the SSIS data flow by passing this data through the output columns on to the downstream component, usually a data flow transformation During design time, the data flow source keeps a snapshot of the metadata of external columns and can refer to it at run time If the data in certain rows doesn’t match with this schema at run time, the data sources can

be configured to redirect those rows to the error output columns that can be dealt with separately Integration Services provides six preconfigured data flow sources to read data from a variety of data sources, plus a script component that can also be scripted as a data flow source If existing data flow sources do not meet your requirements, you can always build yourself a custom data flow source using the Integration Services object model Scripting a data flow source has been discussed in Chapter 11

The following two tables list the preconfigured data flow source adapters and the interfaces they have

Extracts data from NET Framework data providers, such as SQL Server 2008 using ADO.NET Connection Managers

Extracts data from an Excel worksheet using Excel Connection Manager

Reads data from a text file using Flat File Connection Manager

Extracts data from OLE DB–compliant relational databases using OLE DB Connection Manager

Extracts data from a raw file using a direct connection

Hosts and runs a script that can be used to extract, transform, or load data Though not shown under data flow sources in the Toolbox, the Script Component can also be used as a data flow source This component is covered in Chapter 11

Reads data from an XML data source by specifying the location of the XML file or a variable

Trang 6

Data Flow Source Input Output Error Output Custom UI Connection Manager

Flat File source No 1 1 Yes Flat File Connection Manager

From the preceding table, you can make out that not all data flow sources have Error Output and not all data flow sources require a connection manager to connect to the

external data source; rather, some can directly connect to the data source such as XML

Source But the important thing to understand here is that data flow sources don’t

have an input They use the external columns interface to get the data and use output

columns to pass the data to downstream components We will study more about each

of the data flow sources in the following topics

ADO NET Source

When you need to access data from a NET provider, you use an ADO.NET Connection

Manager to connect to the data source and then can use the ADO NET source to

bring the data inside the data flow pipeline You can configure this source to use

either a table or a view or use an SQL command to extract rows from the external

.NET source The ADO NET source has one regular output and one error output

The ADO NET source has a custom UI though you can also use advanced editor to

configure some of the properties that are not exposed in the custom UI The custom

UI has three pages to configure—Connection Manager, Columns, and Error Output

Connection Manager

that the ADO NET Source uses to connect to the data source Select one of the

ADO.NET Connection Managers already configured in the package from the

drop-down list provided under the ADO.NET Connection Manager field, or you

can use the New button to add a new ADO.NET Connection Manager

Columns

data source and cached into the External Columns interface It also shows you

the corresponding Output Columns that, by default, have the same names as the

cached schema in the External Columns You can change the Output Column

names here if you want to call them differently inside your pipeline

Trang 7

Error Output

source, some rows may fail due to wrong data coming through These failures can

be categorized as errors or truncations Errors can be data conversion errors or expression evaluation errors The data may be of wrong type—i.e., alphabetical characters arriving in an integer field—causing errors to be generated Truncation failures may not be as critical as errors—in fact, sometimes they are desirable Truncation lets the data through, but truncates the data characters for which the length becomes more than the specified length—for example, if you specify the city column as VARCHAR(10), then all the characters after first ten characters will be truncated when it exceeds the ten-character length You can configure this component for data type errors or truncations of data in columns to fail, ignore the error, or redirect the failing row to error output See Figure 9-4 for an explanation

Figure 9-4 Setting error conditions for data type mismatches and truncation errors

Trang 8

Fail Component

and will fail the data flow component, ADO NET Source in this case, when

an error or a truncation occurs

Ignore Failure

outputting the data from this component

Redirect Row

failing row to the error output of the source adapter, which will be handled by

the components capturing the error output rows

If you have many columns to configure for different settings, you may find using the

Set This Value To Selected Cells field easier to apply a value to all of your selected cells

Excel Source

When you need to work with data in Excel files, you can either use the OLE DB

Source with Microsoft Jet 4.0 OLE DB Provider or simply use an Excel source to

get the data into the pipeline You will configure the Excel Connection Manager to

connect to an Excel workbook and then use that connection manager inside an Excel

source to extract the data from a worksheet and bring it in to the pipeline You can

treat an Excel workbook as a database and its worksheets as tables while configuring

the Excel source A range in the Excel workbook can also be treated as a table or a view

on database The Excel source adapter has one regular output and one error output

This component has its own user interface, though you can also use Advanced

Editor to configure its properties When you open the Excel Source Editor in the data

flow designer you will see that this source adapter also has three pages to configure

Connection Manager

you can select the connection manager from the drop-down list in the OLE DB

Connection Manager field The Data Access Mode drop-down list provides four

options Depending upon your choice in the Data Access Mode field, the interface changes the fields to provide relevant information

Table or view

extracts data from the specified worksheet in the Name Of The Excel Sheet field

Table name or view name variable

subsequent field changes to Variable Name field This option works similar

to the Table Or View option, except instead of expecting the name of the

worksheet or Excel range to be specified directly, it lets you specify the name

of a variable in the Variable Name field, from which it can read the name of

the Excel range or the worksheet

Trang 9

SQL Command

changes the interface to let you provide an SQL statement in the SQL Command Text field to access data from an Excel workbook You can either type in SQL directly or use the provided Build Query button to build an SQL query (I recommend you to use this query builder to access data from the Excel sheet, even if you know how to write complex SQL queries for SQL Server, because there are some lesser-known issues on accessing data from Excel workbooks using SQL.)

You can also use a parameterized SQL query, for which you can specify parameter mappings using the Parameters button When you click Parameters, you get

an interface that lets you map a parameter to a variable

SQL Command From Variable

except it reads the SQL statement from a variable specified in the Variable Name field

Columns

change the names of output columns

Error Output

page to fail the component, ignore the error, or redirect the row in case an error occurs in a data column

While the Excel Source Editor allows you to configure the properties for Excel source, you may need to use the Advanced Editor to configure the properties not exposed by the Excel Source Editor These properties include assigning a name and description to the component, specifying a timeout value for the SQL query, or most important, changing the data type for a column While working with the Advanced Editor, get acquainted with the various options available in its interface

Flat File Source

The Flat File source lets your package read data from a text file and bring that data into the pipeline You configure a Flat File Connection Manager to connect to a text file and specify how the file is formatted Also, you will specify the data type and length of each column in the Flat File Connection Manager that will set guidelines for the Flat File source to handle it appropriately The Flat File source can read a delimited, fixed width, or ragged right–formatted flat file To know more about these file types, refer to Chapter 3

The Flat File source has a custom user interface that you can use to configure its properties Also, as with the Excel source adapter, its properties can be configured using

an Advanced Editor When you open the Flat File Source Editor, the Connection

Trang 10

Manager page opens up by default You can select the Connection Manager from

the drop-down list provided in the Flat File Connection Manager field The flat files

contain nothing for the null values, and if you want to keep these null values, check the

box for “Retain null values from the source as null values” in the data flow option By

default, this check box is unchecked, which means the Flat File source will not keep

null values in the data but will replace null values with the appropriate default values for each column type—for example, empty strings for string columns and zero for numeric

columns Note that the file you are trying to access must be in a delimited format This

is because the fixed width and/or ragged right–format files do not contain blank spaces;

you need to pad the fields with a padding character to the maximum width so that

the data cannot be treated as null values by the Flat File source adapter These format

settings for the flat file are done in the Flat File Connection Manager

The Columns and Error Output pages can be configured as described in the Excel

source adapter You can use Columns page on the Flat File source to map external

columns to the output columns and the Error Output page to configure the error and

truncation behavior when the mismatched data comes along, using the three options of

Fail Component, Redirect Row, or Ignore Failure

This source has two important custom properties—FastParse and UseBinaryFormat—

that are exposed in the Advanced Editor The configurations for both these properties

are done in the Input and Output Properties tab of the Advanced Editor for the Flat

File source Depending on the data you are dealing with, you can set the FastParse

option for each of the columns by selecting the column in the Output Columns section

and then going to the Custom Properties category of the column properties By default,

the FastParse option is set to False (see Figure 9-5), which means the standard parsing

technique will be used (Remember that standard parsing is a rich set of algorithms that

can provide extensive data type conversions, where as fast parsing is relatively simplified

set of parsing routines that supports only the most commonly used data and time formats

without any locale-specific data type conversions.)

The second property, UseBinaryFormat (also shown in Figure 9-5), allows you to let

the binary data in the input column pass through to the output column without parsing

Sometimes you have to deal with binary data, such as data with the packed decimal

format, especially when you’re receiving data from a mainframe system or the data source

is storing data in the COBOL binary format—for example, you might be dealing with

IBM EBCDIC–formatted data In such cases, you may not prefer the Flat File source

to parse the data; rather, you would like to parse it separately using special rules that

are built based on how it has been packed into the column in the first place By default

UseBinaryFormat is set to false, which means the data in the input column will be parsed

using Flat File source parsing techniques To use this property, set UseBinaryFormat to

true and the data type on the output column to DT_BYTES, to let the binary data be

passed on to the output column as is

Định dạng
Số trang	10
Dung lượng	237,82 KB