During design time, when you drop the data flow source onto the Data Flow Designer surface and configure it to connect to an external data source from which data will be read, the data f
Trang 1data into the Integration Services data type During design time, when you drop the data flow source onto the Data Flow Designer surface and configure it to connect
to an external data source from which data will be read, the data flow source copies the metadata—i.e., the schema of the data—to its external columns metadata At run time, when the data is being pulled in, the incoming data columns are parsed into the Integration Services data types using the external columns metadata In the example, the names Steve and Sarah are parsed into the DT_STR data type, sex, which is indicated using 1 and 0, and have been assigned DT_BOOL; House_ Number has been assigned DT_I4; and the Moved_in data has been assigned the DT_ DBTIMESTAMP data type Integration Services provides 29 different data types to cover various types of data, such as character data, numerical data, Boolean, dates, text, and image fields Integration Services provides a wide range of data types and is particular about how they are used; hence, if data has a data type that does not match with the data types available in Integration Services, an error occurs
We will explore many data types along the way as we progress with our Hands-On exercises; however, to know more about each data type, refer to Microsoft SQL Server 2008 Books Online
The process that converts the source data into Integration Services data types is
called data parsing Data flow components can be configured to use either fast parsing
or standard parsing Fast parsing supports the most commonly used data types with
a simple set of routines and can be configured at the column level Because it uses simple routines and does not check for many other data types, it can achieve high levels
of performance However, it is not available for most of the data flow components
1 250 07-11-2005
Sarah 0 130 25-21-2004
DT_STR
DT_I4 DT_DBDATE DT_BOOL
Steve 1 250
Processing logic
Error output Output
Sarah 0 130
External columns metadata
Data flow source
Name Sex House_Number Moved_in
varchar(50) bit int datetime
Figure 9-1 A data flow source extracting and parsing data
Trang 2and can parse only a narrow range of data types For example, fast parsing does not
support locale-specific parsing, special currency symbols, and date formats other than
year-month-date Also, fast parsing can be used only when using a Flat File source,
a data conversion transformation or derived column transformation, and a Flat File
destination, because these are the only components that convert data between string
and binary data types You may use fast parsing for performance reasons when your
data meets these requirements, but for all other occasions, you will be using standard
parsing Before using fast parsing, check out the Books Online to find out the data
types supported by the fast parsing
Standard parsing uses a rich set of parsing routines that are equivalent to OLE DB
parsing APIs and supports all the data type conversions provided by the automation
data type conversion APIs available in Oleaut32.dll and Ole2dsip.dll For example,
standard parsing provides support for locale-sensitive parsing and international data
type conversions
Returning to the data flow, as the data rows arrive and are parsed, they are validated
against the external columns metadata The data that passes this validation check at
run time is copied to the output columns; the data that doesn’t pass is treated slightly
differently You can define the action the component can take when a failure occurs
Typical errors you can expect at the validation stage include data type mismatches
or a data length that exceeds the length defined for the column Integration Services
handles these as two different types of errors—data type errors and data length or data
truncation errors You can specify the action you want the data flow component to take
for each type of error from the three options—fail the component, ignore the error, or
redirect the failing row to error output fields You can specify one of these actions on
all columns or different actions for different columns If you redirect the failing rows to
the error output fields and link the error output to a data flow destination, the failing
rows will be written to the data flow destination you specified
The error output contains all the output columns and two additional columns for the
failing rows, ErrorCode and ErrorColumn, which indicate the type of error and the failing
column In Figure 9-1, note that the record holding data for Sarah has a wrong date
specified and hence fails during extract process As the source was configured to redirect
rows, the failing row data is sent to the Error Output Also, note that the two rightmost
columns indicate the type of error and the column number Every output field is assigned
an ID automatically, and the number shown in the ErrorColumn is the ID number of the
column failing the extract process If two columns fail on the same row, the column that
fails first is captured and reported So, in case of multiple failures on a row, you might not
know about them till the package fails again after you have fixed the first error
After pulling in the data rows, a data flow source passes the output rows to the next
data flow component—generally a transformation—and the failing rows are passed to
another data flow component—which could be a data flow destination, if configured
Trang 3to redirect the failing rows You can also redirect the failing rows to an alternate branch in the data flow via a transformation in which you can apply logic to correct the failing rows and bring them back into the main data flow after corrections This is a highly useful capability that, if used properly, can reduce wastage and improve data quality When the output of a data flow source is connected to the input of a data flow transformation, the data flows from the input of the transformation, through the processing logic of transformation, and then to its output Based on the logic of transformation, some rows may fail the process and may be sent to the error output The main difference to note in comparison to a data flow source is that the transformations do not have external columns and have input columns instead Figure 9-2 shows the functional layout of a data flow transformation
Finally, after passing through the data flow source and the data flow transformations, the data will reach a data flow destination so that it can be written to an external data store Like data flow sources, a data flow destination can also read the schema information from
Processing logic
Error output Output
Input
Data flow transformation
ErrorCode ErrorColumn
Name Sex
Figure 9-2 Data flow transformation showing Input, Output, and Error Output columns
Trang 4the external data store and copy the metadata to the external columns When the data
flows through the data flow destination, it gets validated against this metadata before being written to the external data store If configured to redirect the failing rows, the failing rows
may be sent to the error output columns and the rest of the successful data is written to
the external data store Note that a data flow destination does not have output columns
Figure 9-3 shows the data flow through a data flow destination
At this point it is worth mentioning that a pipeline or a data flow path usually
terminates in a data flow destination; however, that is not necessary Sometimes, you
may prefer to terminate a data flow path in a transformation For example, while testing
and debugging, you can break and terminate the data flow path by adding a Row Count
transformation to see what happens at a particular point in the data flow path while
ignoring rest of the pipeline With the preceding description of how the data is extracted from a data source, and how it flows through the transformations and destinations
before being written to the external data store, let’s study these components to learn
more about them and to learn how many types of data sources or destinations are
available in the data flow
Processing logic
Error output Data flow destination
External columns metadata DT_STR
DT_I4 DT_DBDATE
DT_BOOL
Steve 1 250 07-11-2005
Parsing Input
Person External data source
Name Sex House_Number Moved_in
varchar(10) bit int datetime
Figure 9-3 Data flow destination showing the flow of data to the external data store
Trang 5Data Flow Sources
While building a data flow for a package using BIDS, your first objective is to bring the data inside the Integration Services data flow so that you can then modify the data using data flow transformations Data flow sources are designed to bring data from the external sources into the Integration Services data flow A data flow source reads the external data source, such as a flat file or a table in relational database, and brings in the data to the SSIS data flow by passing this data through the output columns on to the downstream component, usually a data flow transformation During design time, the data flow source keeps a snapshot of the metadata of external columns and can refer to it at run time If the data in certain rows doesn’t match with this schema at run time, the data sources can
be configured to redirect those rows to the error output columns that can be dealt with separately Integration Services provides six preconfigured data flow sources to read data from a variety of data sources, plus a script component that can also be scripted as a data flow source If existing data flow sources do not meet your requirements, you can always build yourself a custom data flow source using the Integration Services object model Scripting a data flow source has been discussed in Chapter 11
The following two tables list the preconfigured data flow source adapters and the interfaces they have
Extracts data from NET Framework data providers, such as SQL Server 2008 using ADO.NET Connection Managers
Extracts data from an Excel worksheet using Excel Connection Manager
Reads data from a text file using Flat File Connection Manager
Extracts data from OLE DB–compliant relational databases using OLE DB Connection Manager
Extracts data from a raw file using a direct connection
Hosts and runs a script that can be used to extract, transform, or load data Though not shown under data flow sources in the Toolbox, the Script Component can also be used as a data flow source This component is covered in Chapter 11
Reads data from an XML data source by specifying the location of the XML file or a variable
Trang 6Data Flow Source Input Output Error Output Custom UI Connection Manager
Flat File source No 1 1 Yes Flat File Connection Manager
From the preceding table, you can make out that not all data flow sources have Error Output and not all data flow sources require a connection manager to connect to the
external data source; rather, some can directly connect to the data source such as XML
Source But the important thing to understand here is that data flow sources don’t
have an input They use the external columns interface to get the data and use output
columns to pass the data to downstream components We will study more about each
of the data flow sources in the following topics
ADO NET Source
When you need to access data from a NET provider, you use an ADO.NET Connection
Manager to connect to the data source and then can use the ADO NET source to
bring the data inside the data flow pipeline You can configure this source to use
either a table or a view or use an SQL command to extract rows from the external
.NET source The ADO NET source has one regular output and one error output
The ADO NET source has a custom UI though you can also use advanced editor to
configure some of the properties that are not exposed in the custom UI The custom
UI has three pages to configure—Connection Manager, Columns, and Error Output
Connection Manager
that the ADO NET Source uses to connect to the data source Select one of the
ADO.NET Connection Managers already configured in the package from the
drop-down list provided under the ADO.NET Connection Manager field, or you
can use the New button to add a new ADO.NET Connection Manager
Columns
data source and cached into the External Columns interface It also shows you
the corresponding Output Columns that, by default, have the same names as the
cached schema in the External Columns You can change the Output Column
names here if you want to call them differently inside your pipeline
Trang 7Error Output
source, some rows may fail due to wrong data coming through These failures can
be categorized as errors or truncations Errors can be data conversion errors or expression evaluation errors The data may be of wrong type—i.e., alphabetical characters arriving in an integer field—causing errors to be generated Truncation failures may not be as critical as errors—in fact, sometimes they are desirable Truncation lets the data through, but truncates the data characters for which the length becomes more than the specified length—for example, if you specify the city column as VARCHAR(10), then all the characters after first ten characters will be truncated when it exceeds the ten-character length You can configure this component for data type errors or truncations of data in columns to fail, ignore the error, or redirect the failing row to error output See Figure 9-4 for an explanation
Figure 9-4 Setting error conditions for data type mismatches and truncation errors
Trang 8Fail Component
and will fail the data flow component, ADO NET Source in this case, when
an error or a truncation occurs
Ignore Failure
outputting the data from this component
Redirect Row
failing row to the error output of the source adapter, which will be handled by
the components capturing the error output rows
If you have many columns to configure for different settings, you may find using the
Set This Value To Selected Cells field easier to apply a value to all of your selected cells
Excel Source
When you need to work with data in Excel files, you can either use the OLE DB
Source with Microsoft Jet 4.0 OLE DB Provider or simply use an Excel source to
get the data into the pipeline You will configure the Excel Connection Manager to
connect to an Excel workbook and then use that connection manager inside an Excel
source to extract the data from a worksheet and bring it in to the pipeline You can
treat an Excel workbook as a database and its worksheets as tables while configuring
the Excel source A range in the Excel workbook can also be treated as a table or a view
on database The Excel source adapter has one regular output and one error output
This component has its own user interface, though you can also use Advanced
Editor to configure its properties When you open the Excel Source Editor in the data
flow designer you will see that this source adapter also has three pages to configure
Connection Manager
you can select the connection manager from the drop-down list in the OLE DB
Connection Manager field The Data Access Mode drop-down list provides four
options Depending upon your choice in the Data Access Mode field, the interface changes the fields to provide relevant information
Table or view
extracts data from the specified worksheet in the Name Of The Excel Sheet field
Table name or view name variable
subsequent field changes to Variable Name field This option works similar
to the Table Or View option, except instead of expecting the name of the
worksheet or Excel range to be specified directly, it lets you specify the name
of a variable in the Variable Name field, from which it can read the name of
the Excel range or the worksheet
Trang 9SQL Command
changes the interface to let you provide an SQL statement in the SQL Command Text field to access data from an Excel workbook You can either type in SQL directly or use the provided Build Query button to build an SQL query (I recommend you to use this query builder to access data from the Excel sheet, even if you know how to write complex SQL queries for SQL Server, because there are some lesser-known issues on accessing data from Excel workbooks using SQL.)
You can also use a parameterized SQL query, for which you can specify parameter mappings using the Parameters button When you click Parameters, you get
an interface that lets you map a parameter to a variable
SQL Command From Variable
except it reads the SQL statement from a variable specified in the Variable Name field
Columns
change the names of output columns
Error Output
page to fail the component, ignore the error, or redirect the row in case an error occurs in a data column
While the Excel Source Editor allows you to configure the properties for Excel source, you may need to use the Advanced Editor to configure the properties not exposed by the Excel Source Editor These properties include assigning a name and description to the component, specifying a timeout value for the SQL query, or most important, changing the data type for a column While working with the Advanced Editor, get acquainted with the various options available in its interface
Flat File Source
The Flat File source lets your package read data from a text file and bring that data into the pipeline You configure a Flat File Connection Manager to connect to a text file and specify how the file is formatted Also, you will specify the data type and length of each column in the Flat File Connection Manager that will set guidelines for the Flat File source to handle it appropriately The Flat File source can read a delimited, fixed width, or ragged right–formatted flat file To know more about these file types, refer to Chapter 3
The Flat File source has a custom user interface that you can use to configure its properties Also, as with the Excel source adapter, its properties can be configured using
an Advanced Editor When you open the Flat File Source Editor, the Connection
Trang 10Manager page opens up by default You can select the Connection Manager from
the drop-down list provided in the Flat File Connection Manager field The flat files
contain nothing for the null values, and if you want to keep these null values, check the
box for “Retain null values from the source as null values” in the data flow option By
default, this check box is unchecked, which means the Flat File source will not keep
null values in the data but will replace null values with the appropriate default values for each column type—for example, empty strings for string columns and zero for numeric
columns Note that the file you are trying to access must be in a delimited format This
is because the fixed width and/or ragged right–format files do not contain blank spaces;
you need to pad the fields with a padding character to the maximum width so that
the data cannot be treated as null values by the Flat File source adapter These format
settings for the flat file are done in the Flat File Connection Manager
The Columns and Error Output pages can be configured as described in the Excel
source adapter You can use Columns page on the Flat File source to map external
columns to the output columns and the Error Output page to configure the error and
truncation behavior when the mismatched data comes along, using the three options of
Fail Component, Redirect Row, or Ignore Failure
This source has two important custom properties—FastParse and UseBinaryFormat—
that are exposed in the Advanced Editor The configurations for both these properties
are done in the Input and Output Properties tab of the Advanced Editor for the Flat
File source Depending on the data you are dealing with, you can set the FastParse
option for each of the columns by selecting the column in the Output Columns section
and then going to the Custom Properties category of the column properties By default,
the FastParse option is set to False (see Figure 9-5), which means the standard parsing
technique will be used (Remember that standard parsing is a rich set of algorithms that
can provide extensive data type conversions, where as fast parsing is relatively simplified
set of parsing routines that supports only the most commonly used data and time formats
without any locale-specific data type conversions.)
The second property, UseBinaryFormat (also shown in Figure 9-5), allows you to let
the binary data in the input column pass through to the output column without parsing
Sometimes you have to deal with binary data, such as data with the packed decimal
format, especially when you’re receiving data from a mainframe system or the data source
is storing data in the COBOL binary format—for example, you might be dealing with
IBM EBCDIC–formatted data In such cases, you may not prefer the Flat File source
to parse the data; rather, you would like to parse it separately using special rules that
are built based on how it has been packed into the column in the first place By default
UseBinaryFormat is set to false, which means the data in the input column will be parsed
using Flat File source parsing techniques To use this property, set UseBinaryFormat to
true and the data type on the output column to DT_BYTES, to let the binary data be
passed on to the output column as is