Usability and management of variables have been greatly enhanced, connectivity needs for packages are now satisfied by connection managers, enhanced precedence constraints have been incl
Trang 16 8 H a n d s - O n M i c r o s o f t S Q L S e r v e r 2 0 0 8 I n t e g r a t i o n S e r v i c e s
Summary
You created an Integration Services blank project in Chapter 1 In this chapter, you created packages using the SQL Server Import and Export Wizard and then added those packages into your blank project You also created a package directly in the BIDS again using the SQL Server Import and Export Wizard But above all, you explored those packages by opening component properties and configurations, and now hopefully you better understand the constitution of an Integration Services package Last, you worked with the Data Profiling Task to identify quality issues with your data In the next chapter, you will learn about the basic components, the nuts and bolts of Integration Services packages, before jumping in to make complex packages in Chapter 4 using various preconfigured components provided in BIDS
Figure 2-14 Column Length Distribution Profiles
Trang 2Nuts and Bolts of the SSIS Workflow
In This Chapter
c Integration Services Objects
c Solutions and Projects
c File Formats
c Connection Managers
c Data Sources and Data
Source Views
c SSIS Variables
c Precedence Constraints
c Integration Services Expressions
c Summary
Trang 37 0 H a n d s - O n M i c r o s o f t S Q L S e r v e r 2 0 0 8 I n t e g r a t i o n S e r v i c e s
So far, you have moved data using the SQL Server Import and Export Wizard
and viewed the packages created by opening them in the Business Intelligence Development Studio (BIDS) In this chapter, you will extend your learning
by understanding the nuts and bolts of Integration Services such as use of variables, connection managers, precedence constraints, and SSIS Expressions If you have used Data Transformation Services (DTS 2000), you may grasp these issues quickly; however, there is a lot of new stuff about them in Integration Services Usability and management of variables have been greatly enhanced, connectivity needs for packages are now satisfied by connection managers, enhanced precedence constraints have been included to provide you total control on the package workflow, and, above all, the SSIS Expression language offers a powerful programming interface to let you generate values
at run time
Integration Services Objects
Integration Services performs its operations with the help of various objects and components such as connection managers, sources, tasks and transformations, containers, event handlers, and destinations All these components are threaded together to achieve the desired functionality—that is, they work hand in hand, yet they can be configured separately
A major enhancement Microsoft has provided to DTS 2000 to make it Integration Services is the separation of workflow from the data flow SSIS provides two different designer surfaces, which are effectively different integrated development environments (IDEs) for developing packages You can design and configure workflow in the Control Flow Designer surface and the data movement and transformations in the Data Flow Designer surface Different components have been provided in each of the designer environment, and the Toolbox window is unique with each environment
The following objects are involved in an Integration Services package:
Integration Services package
c The top-level object in the SSIS component hierarchy All the work performed by SSIS tasks occurs within the context of
a package
Control flow
c Helps to build the workflow in an ordered sequence using containers, tasks, and precedence constraints Containers provide structure to the package and looping facility, tasks provide functionality, and precedence constraints build
an ordered workflow by connecting containers, tasks, and other executables in an orderly fashion
Data flow
c Helps to build the data movement and transformations in a package using data adapters and transformations in ordered sequential paths
Trang 4Connection managers Handle all the connectivity needs.
Integration Services variables
c Help to reuse or pass values between objects and provide a facility to derive values dynamically at run time
Integration Services event handlers
events occurring at run time
Integration Services log providers
log-enabled events occur at run time
To enhance the learning experience while you are working with the SSIS components, first you will be introduced to the easier and more often-used objects, and later will be presented with the more complex configurations
Solutions and Projects
Integration Services offers different environments for developing and managing your
SSIS packages The SSIS packages are designed and developed in, most likely, the
development environment using BIDS, while the SQL Server Management Studio can
be used to deploy, manage, and run packages, though there are other options to deploy and manage the packages as you will study in Chapter 13 Both environments have
special features and toolsets to help you perform the jobs efficiently
While BIDS has the whole toolset to develop and deploy SSIS packages, SQL
Server Management Studio cannot be used to edit or design Integration Services
solutions or projects However, in both environments, you use solutions and projects to
organize and manage your files and code in a logical, hierarchical manner A solution is
a container that allows you to bring together scattered projects so that you can organize and manage them as one unit In general, you will use a solution to focus on one area of the business—such as one solution for accounts and a separate solution for marketing However, complex business problems may require multiple solutions to achieve specific objectives Figure 3-1 shows a solution that not only affects multiple projects but also
includes projects of multiple types This figure shows an analysis services project having
a Sales cube, an integration services projects having two SSIS packages, and a reporting services project with a Monthly Sales report, all in one solution
Within a solution, one or more projects, along with related files for databases,
connections, scripts, and miscellaneous files, can be saved together Not only can multiple
projects be stored under one solution, but multiple types of projects can be stored under
one solution For example, while working in BIDS, you can store a data transformation
project as well as a data-mining project under the same solution Grouping multiple
projects in one solution has several benefits such as reduced development time, code
Trang 57 2 H a n d s - O n M i c r o s o f t S Q L S e r v e r 2 0 0 8 I n t e g r a t i o n S e r v i c e s
reusability, interdependencies management, settings management for all the projects
at a single location, and the facility to save all the projects to Visual SourceSafe or Team Foundation Server in the same hierarchical manner as you have in development environment Both SQL Server Management Studio and BIDS provide templates for working with different types of projects These templates provide appropriate environments—such as designer surfaces, scripts, connections, and so on—for each project with which you are working
When you create a new project, Visual Studio tools automatically generate a solution for you while giving you an option to create a separate folder for the solution If you don’t choose to create a directory for the solution, then the solution file is created along with other project files in the same folder; however, if you choose to create a directory for the solution, then a folder is created with project folder created under this as a subfolder So, you get a hierarchical structure created for you to which you can then add various projects—data sources, data source views, SSIS packages, scripts, miscellaneous files—as and when required Solution Explorer lists the projects and the files contained
in them in a tree view that helps you to manage the projects and the files (as shown
Figure 3-1 Solution Explorer showing a solution with different types of projects
Trang 6in Figure 3-1) The logical hierarchy reflected in the tree view of a solution does not
necessarily relate to the physical storage of files and folders on the hard disk drive,
however Solution Explorer provides the facility to integrate with Visual SourceSafe or Team Foundation Server for version control, which is a great feature when you want to track changes or roll back code
File Formats
Whenever an ETL tool has to integrate with legacy systems, mainframes, or any other proprietary database systems, the easiest way to transfer data between the systems is to use flat files Integration Services can deal with flat files that are fixed width, delimited, and ragged right format types For the benefit of users who are new to the ETL world, these formats are explained next
Fixed Width
If you have been working with mainframes or legacy systems, you may be familiar with this format Fixed-width files use different widths for columns, but the chosen width
per column stays fixed for all the rows, regardless of the contents of those columns If
you open such a file, you will likely see lots of blank spaces between the two columns
As most of the data in a column with variable data tends to be smaller than the width
provided, you’ll see a lot of wasted space As a result, these types of files are more likely
to be larger in size than the other formats
Delimited
The most common format used by most of the systems to exchange data with foreign
systems, delimited files separate the columns using a delimiter such as a comma or tab and typically use a character combination (for example, a combination of carriage return plus linefeed characters—{CR}{LF}) to delimit rows/records Generally, importing
data using this format is quite easy, unless the delimiter used also appears in the data
For example, if users are allowed to enter data in a field, some users may use a comma while entering notes in the specified field, but this comma will be treated as column
delimiter and will distort the whole row format This free-format data entry conflicts
with the delimiter and imports data in the wrong columns Because of potential
conflicts, you need to pay particular attention to the quality of data you are dealing
with while choosing a delimiter Delimited files are usually smaller in size compared to fixed-width files, as the free space is removed by the use of a delimiter
Trang 77 4 H a n d s - O n M i c r o s o f t S Q L S e r v e r 2 0 0 8 I n t e g r a t i o n S e r v i c e s
Ragged Right
If you have a fixed-width file and one of the columns (the rightmost one) is a nonuniform column, and you want to save some space, you can add a delimiter (such
as {CR}{LF}) at the end of the row and make it a ragged-right file Ragged-right files
are similar to fixed-width files except they use a delimiter to mark the end of a row/
record—that is, in ragged-right files, the last column is of variable size This makes the file easier to work with when displayed in Notepad or imported into an application Also, some vendors use this type of format when they want the flexibility to change the number of columns in the file In such situations, they keep all the regular columns (the columns that always exist) in the first part of the file and the columns that may
or may not exist combined as a single string of data in the end of the row Depending upon the columns that have been included the length of the last column will vary The applications generally use substring logic to separate out the columns from the last variable-length combined column
Connection Managers
As data grow in random places, it’s the job of the information analyst to bring it all together to draw out pertinent information The biggest problem of bringing together such data sets and merging them to a single storage location is how to handle different data sources, such as legacy mainframe systems, Oracle databases, flat files, Excel spreadsheets, Microsoft Access files, and so on Connection managers provided in Integration Services come to the rescue
In Chapter 2, you saw how the connection managers were used inside the package to import data The components defined inside an Integration Services package require that physical connections be made to data stores during run time The source adapter reads data from the data source and then passes it on to the data flow for transformations, while the destination adapter loads the transformed data to the destination store Not only do the extraction and loading components require connections, but these connections are also required by some other components For example, during the lookup, transformation values are read from a reference table to perform transformations based on the values
in the lookup table Then there are logging and auditing requirements that also need connections to storage systems such as databases or text files
A connection manager is a logical representation of a connection You use a connection manager to describe the connection properties at design time, and these are interpreted
to make a physical connection at run time by Integration Services For example, at design time, you can set a connection string property within a connection manager, which is then read by the Integration Services run-time engine to make a physical connection A connection manager is stored in the package metadata and cannot be shared with other packages
Trang 8Connection managers enhance connection flexibility Multiple connection managers
of the same type can be created to meet the needs of Integration Services packages and enhance performance For example, a package can use, say, five OLE DB connection
managers, all built on the same data connection
You can add connection managers to your package using one of the following
methods in BIDS:
Choose New Connection from the SSIS menu
c
Choose the New Connection command from the context menu that opens when c
you right-click the blank surface in the Connection Managers area
Add a connection manager from within the editor or advanced editor dialog boxes c
of some of the tasks, transformations, source adapters, and destination adapters
that require connection to a data store
The connection managers you add to the project at design time appear in the
Connection Managers area in the BIDS designer surfaces, but they do not appear in
the Connection Managers collection in Package Explorer until you run the package
successfully for the first time At run time, Integration Services resolves the settings of all the added connections, sets the connection manager properties to each of them, and then adds them to the Connection Managers collection in Package Explorer
You will be using many of the connection managers in Hands-On exercises while
you create solutions for business problems later on For now, open BIDS, create a
new blank project, and check out the properties of all the connection managers as you
read through the following descriptions Figure 3-2, which appears in the later section
“Microsoft Connector 1.0 for SAP BI,” shows the list of all the connection managers
provided in SQL Server 2008 Integration Services
ADO Connection Manager
The ADO Connection Manager enables a package to connect to an ADO recordset
This connection manager has been provided mainly for legacy support You will most
likely use it when you’re working with a legacy application that is using ActiveX Data
Objects (ADO) to connect to the data sources You might have to use this connection manager when developing a custom component where such legacy application is used
ADO.NET Connection Manager
The current model of software applications is very different from the earlier connected, tightly coupled client/server scenario, where a connection was held open for the lifetime Now, you’ve varied types of data stores and these data stores are being hit with several
Trang 97 6 H a n d s - O n M i c r o s o f t S Q L S e r v e r 2 0 0 8 I n t e g r a t i o n S e r v i c e s
hundred connections every minute ADO.NET overcomes these shortcomings and provides disconnected data access, integration with XML, optimized interaction with databases, and the ability to combine data from numerous data sources These features make ADO.NET connection managers quite reliable and flexible with lots of options; however, they might be a little bit slower than the customized or dedicated connection managers for a particular source You can also have consistent access to data sources using ADO.NET providers The ADO.NET Connection Manager provides access
to data sources, such as SQL Server or sources exposed through OLE DB or XML, using a NET provider You can choose from the NET Framework Data Provider for SQL Server (SqlClient), the NET Framework Data Provider for Oracle Server (OracleClient), the NET Framework Data Provider for ODBC (Open Database Connectivity), and the NET Framework Data Provider for OLE DB The configuration options of the ADO.NET Connection Manager change, depending on the choice of .NET provider
Cache Connection Manager
The Cache Connection Manager is primarily used for creating cache for the Lookup Transformation When you have to repeatedly run a Lookup Transformation in a package or have to share the reference (lookup) data set among multiple packages, then you might prefer to persist this cache to a file to improve the performance You would then use a cache transformation, which in turn uses the Cache Connection Manager
to write the cached information to a cache file (.caw) Later in Chapter 10, “Data Flow Transformations,” when you will be working with the Lookup Transformation, you will use this connection manager to cache data to a file
Excel Connection Manager
This connection manager provides access to the Microsoft Excel workbook file It
is used when you add Excel Source or Excel Destination in your package With the launch of Excel 2007, the data provider for Excel is changed to OLE DB provider for the Microsoft Office 12.0 Access Database Engine from the earlier used Microsoft Jet OLE DB Provider If you check the ConnectionString property of the Excel Connection Manager after adding it using the Microsoft Excel 97-2003 version, you will see the Provider listed as Microsoft.Jet.OLEDB.4.0, whereas this property will show you the provider as Microsoft.ACE.OLEDB.12.0 when you add the Excel Connection Manager using Microsoft Excel 2007 version It is important to understand the connection string, as you may need to write the connection string yourself in some packages, for example, if you’re getting the file path at run time and you want to dynamically create the connection string Here is the connection string shown for both versions of the Excel driver:
Trang 10RawDataTxt.xls;Extended Properties="Excel 8.0;HDR=YES";
Provider=Microsoft.ACE.OLEDB.12.0; Data Source=C:\SSIS\RawFiles\
RawDataTxt.xlsx;Extended Properties="Excel 12.0;HDR=YES";
Note the differences between the providers for both the versions as has been explained earlier There are some additional properties that you need to specify in the extended
properties section The first is that you use Excel 8.0 for Excel versions 97, 2000, 2002, and 2003 in the extended properties, while you use Excel 12.0 for Excel 2007 version Second, you use the HDR property to specify if the first row has column names
The default value is yes; that is, if you do not specify this property, the first row will
be deemed to contain columns Also, sometimes the Excel driver fails to pick up some values in the columns where you have string and numeric values mixed up The Excel
driver samples, by default the first eight rows, to determine the data type of the column and returns the null values if other data types exist in the column You can override this behavior by importing all the values as strings using the import mode setting IMEX=1
in the extended properties of the connection string
If you will be deploying this connection manager to a 64-bit server, which is most
likely the case these days, you will need to run the package in 32-bit mode, as both
the aforesaid providers are available in 32-bit version only You will need to run the
package using the 32-bit version of dtexec.exe from the 32-bit area, which is by default
in the C:\Program Files(x86)\Microsoft SQL Server\100\DTS\Binn folder
File Connection Manager
This connection manager enables you to reference a file or folder that already exists
or is created at run time While executing a package, Integration Services tasks and
data flow components need input for values of property attributes to perform their
functions These input values can be directly configured by you within the component’s properties, or they can be read from external sources such as files or variables When
you configure to get this input information from a file, you use the File Connection
Manager For example, the Execute SQL task executes an SQL statement, which can
be directly input by you in the Execute SQL task, or this SQL statement can be read
from a file
You can use an existing file or folder, or you can create a file or a folder by using the File Connection Manager However, you can reference only one file or folder If you
want to reference multiple files or folders, you must use a Multiple Files Connection
Manager, described a bit later
To configure this connection manager, choose from the four available options in the Usage Type field of the File Connection Manager Editor Your choice in this field sets