Hands-On Microsoft SQL Server 2008 Integration Services part 10 pps

Usability and management of variables have been greatly enhanced, connectivity needs for packages are now satisfied by connection managers, enhanced precedence constraints have been incl

Trang 1

6 8 H a n d s - O n M i c r o s o f t S Q L S e r v e r 2 0 0 8 I n t e g r a t i o n S e r v i c e s

Summary

You created an Integration Services blank project in Chapter 1 In this chapter, you created packages using the SQL Server Import and Export Wizard and then added those packages into your blank project You also created a package directly in the BIDS again using the SQL Server Import and Export Wizard But above all, you explored those packages by opening component properties and configurations, and now hopefully you better understand the constitution of an Integration Services package Last, you worked with the Data Profiling Task to identify quality issues with your data In the next chapter, you will learn about the basic components, the nuts and bolts of Integration Services packages, before jumping in to make complex packages in Chapter 4 using various preconfigured components provided in BIDS

Figure 2-14 Column Length Distribution Profiles

Trang 2

Nuts and Bolts of the SSIS Workflow

In This Chapter

c Integration Services Objects

c Solutions and Projects

c File Formats

c Connection Managers

c Data Sources and Data

Source Views

c SSIS Variables

c Precedence Constraints

c Integration Services Expressions

c Summary

Trang 3

So far, you have moved data using the SQL Server Import and Export Wizard

and viewed the packages created by opening them in the Business Intelligence Development Studio (BIDS) In this chapter, you will extend your learning

by understanding the nuts and bolts of Integration Services such as use of variables, connection managers, precedence constraints, and SSIS Expressions If you have used Data Transformation Services (DTS 2000), you may grasp these issues quickly; however, there is a lot of new stuff about them in Integration Services Usability and management of variables have been greatly enhanced, connectivity needs for packages are now satisfied by connection managers, enhanced precedence constraints have been included to provide you total control on the package workflow, and, above all, the SSIS Expression language offers a powerful programming interface to let you generate values

at run time

Integration Services Objects

Integration Services performs its operations with the help of various objects and components such as connection managers, sources, tasks and transformations, containers, event handlers, and destinations All these components are threaded together to achieve the desired functionality—that is, they work hand in hand, yet they can be configured separately

A major enhancement Microsoft has provided to DTS 2000 to make it Integration Services is the separation of workflow from the data flow SSIS provides two different designer surfaces, which are effectively different integrated development environments (IDEs) for developing packages You can design and configure workflow in the Control Flow Designer surface and the data movement and transformations in the Data Flow Designer surface Different components have been provided in each of the designer environment, and the Toolbox window is unique with each environment

The following objects are involved in an Integration Services package:

Integration Services package

c The top-level object in the SSIS component hierarchy All the work performed by SSIS tasks occurs within the context of

a package

Control flow

c Helps to build the workflow in an ordered sequence using containers, tasks, and precedence constraints Containers provide structure to the package and looping facility, tasks provide functionality, and precedence constraints build

an ordered workflow by connecting containers, tasks, and other executables in an orderly fashion

Data flow

c Helps to build the data movement and transformations in a package using data adapters and transformations in ordered sequential paths

Trang 4

Connection managers Handle all the connectivity needs.

Integration Services variables

c Help to reuse or pass values between objects and provide a facility to derive values dynamically at run time

Integration Services event handlers

events occurring at run time

Integration Services log providers

log-enabled events occur at run time

To enhance the learning experience while you are working with the SSIS components, first you will be introduced to the easier and more often-used objects, and later will be presented with the more complex configurations

Solutions and Projects

Integration Services offers different environments for developing and managing your

SSIS packages The SSIS packages are designed and developed in, most likely, the

development environment using BIDS, while the SQL Server Management Studio can

be used to deploy, manage, and run packages, though there are other options to deploy and manage the packages as you will study in Chapter 13 Both environments have

special features and toolsets to help you perform the jobs efficiently

While BIDS has the whole toolset to develop and deploy SSIS packages, SQL

Server Management Studio cannot be used to edit or design Integration Services

solutions or projects However, in both environments, you use solutions and projects to

organize and manage your files and code in a logical, hierarchical manner A solution is

a container that allows you to bring together scattered projects so that you can organize and manage them as one unit In general, you will use a solution to focus on one area of the business—such as one solution for accounts and a separate solution for marketing However, complex business problems may require multiple solutions to achieve specific objectives Figure 3-1 shows a solution that not only affects multiple projects but also

includes projects of multiple types This figure shows an analysis services project having

a Sales cube, an integration services projects having two SSIS packages, and a reporting services project with a Monthly Sales report, all in one solution

Within a solution, one or more projects, along with related files for databases,

connections, scripts, and miscellaneous files, can be saved together Not only can multiple

projects be stored under one solution, but multiple types of projects can be stored under

one solution For example, while working in BIDS, you can store a data transformation

project as well as a data-mining project under the same solution Grouping multiple

projects in one solution has several benefits such as reduced development time, code

Trang 5

reusability, interdependencies management, settings management for all the projects

at a single location, and the facility to save all the projects to Visual SourceSafe or Team Foundation Server in the same hierarchical manner as you have in development environment Both SQL Server Management Studio and BIDS provide templates for working with different types of projects These templates provide appropriate environments—such as designer surfaces, scripts, connections, and so on—for each project with which you are working

When you create a new project, Visual Studio tools automatically generate a solution for you while giving you an option to create a separate folder for the solution If you don’t choose to create a directory for the solution, then the solution file is created along with other project files in the same folder; however, if you choose to create a directory for the solution, then a folder is created with project folder created under this as a subfolder So, you get a hierarchical structure created for you to which you can then add various projects—data sources, data source views, SSIS packages, scripts, miscellaneous files—as and when required Solution Explorer lists the projects and the files contained

in them in a tree view that helps you to manage the projects and the files (as shown

Figure 3-1 Solution Explorer showing a solution with different types of projects

Trang 6

in Figure 3-1) The logical hierarchy reflected in the tree view of a solution does not

necessarily relate to the physical storage of files and folders on the hard disk drive,

however Solution Explorer provides the facility to integrate with Visual SourceSafe or Team Foundation Server for version control, which is a great feature when you want to track changes or roll back code

File Formats

Whenever an ETL tool has to integrate with legacy systems, mainframes, or any other proprietary database systems, the easiest way to transfer data between the systems is to use flat files Integration Services can deal with flat files that are fixed width, delimited, and ragged right format types For the benefit of users who are new to the ETL world, these formats are explained next

Fixed Width

If you have been working with mainframes or legacy systems, you may be familiar with this format Fixed-width files use different widths for columns, but the chosen width

per column stays fixed for all the rows, regardless of the contents of those columns If

you open such a file, you will likely see lots of blank spaces between the two columns

As most of the data in a column with variable data tends to be smaller than the width

provided, you’ll see a lot of wasted space As a result, these types of files are more likely

to be larger in size than the other formats

Delimited

The most common format used by most of the systems to exchange data with foreign

systems, delimited files separate the columns using a delimiter such as a comma or tab and typically use a character combination (for example, a combination of carriage return plus linefeed characters—{CR}{LF}) to delimit rows/records Generally, importing

data using this format is quite easy, unless the delimiter used also appears in the data

For example, if users are allowed to enter data in a field, some users may use a comma while entering notes in the specified field, but this comma will be treated as column

delimiter and will distort the whole row format This free-format data entry conflicts

with the delimiter and imports data in the wrong columns Because of potential

conflicts, you need to pay particular attention to the quality of data you are dealing

with while choosing a delimiter Delimited files are usually smaller in size compared to fixed-width files, as the free space is removed by the use of a delimiter

Trang 7

Ragged Right

If you have a fixed-width file and one of the columns (the rightmost one) is a nonuniform column, and you want to save some space, you can add a delimiter (such

as {CR}{LF}) at the end of the row and make it a ragged-right file Ragged-right files

are similar to fixed-width files except they use a delimiter to mark the end of a row/

record—that is, in ragged-right files, the last column is of variable size This makes the file easier to work with when displayed in Notepad or imported into an application Also, some vendors use this type of format when they want the flexibility to change the number of columns in the file In such situations, they keep all the regular columns (the columns that always exist) in the first part of the file and the columns that may

or may not exist combined as a single string of data in the end of the row Depending upon the columns that have been included the length of the last column will vary The applications generally use substring logic to separate out the columns from the last variable-length combined column

Connection Managers

As data grow in random places, it’s the job of the information analyst to bring it all together to draw out pertinent information The biggest problem of bringing together such data sets and merging them to a single storage location is how to handle different data sources, such as legacy mainframe systems, Oracle databases, flat files, Excel spreadsheets, Microsoft Access files, and so on Connection managers provided in Integration Services come to the rescue

In Chapter 2, you saw how the connection managers were used inside the package to import data The components defined inside an Integration Services package require that physical connections be made to data stores during run time The source adapter reads data from the data source and then passes it on to the data flow for transformations, while the destination adapter loads the transformed data to the destination store Not only do the extraction and loading components require connections, but these connections are also required by some other components For example, during the lookup, transformation values are read from a reference table to perform transformations based on the values

in the lookup table Then there are logging and auditing requirements that also need connections to storage systems such as databases or text files

A connection manager is a logical representation of a connection You use a connection manager to describe the connection properties at design time, and these are interpreted

to make a physical connection at run time by Integration Services For example, at design time, you can set a connection string property within a connection manager, which is then read by the Integration Services run-time engine to make a physical connection A connection manager is stored in the package metadata and cannot be shared with other packages

Trang 8

Connection managers enhance connection flexibility Multiple connection managers

of the same type can be created to meet the needs of Integration Services packages and enhance performance For example, a package can use, say, five OLE DB connection

managers, all built on the same data connection

You can add connection managers to your package using one of the following

methods in BIDS:

Choose New Connection from the SSIS menu

c

Choose the New Connection command from the context menu that opens when c

you right-click the blank surface in the Connection Managers area

Add a connection manager from within the editor or advanced editor dialog boxes c

of some of the tasks, transformations, source adapters, and destination adapters

that require connection to a data store

The connection managers you add to the project at design time appear in the

Connection Managers area in the BIDS designer surfaces, but they do not appear in

the Connection Managers collection in Package Explorer until you run the package

successfully for the first time At run time, Integration Services resolves the settings of all the added connections, sets the connection manager properties to each of them, and then adds them to the Connection Managers collection in Package Explorer

You will be using many of the connection managers in Hands-On exercises while

you create solutions for business problems later on For now, open BIDS, create a

new blank project, and check out the properties of all the connection managers as you

read through the following descriptions Figure 3-2, which appears in the later section

“Microsoft Connector 1.0 for SAP BI,” shows the list of all the connection managers

provided in SQL Server 2008 Integration Services

ADO Connection Manager

The ADO Connection Manager enables a package to connect to an ADO recordset

This connection manager has been provided mainly for legacy support You will most

likely use it when you’re working with a legacy application that is using ActiveX Data

Objects (ADO) to connect to the data sources You might have to use this connection manager when developing a custom component where such legacy application is used

ADO.NET Connection Manager

The current model of software applications is very different from the earlier connected, tightly coupled client/server scenario, where a connection was held open for the lifetime Now, you’ve varied types of data stores and these data stores are being hit with several

Trang 9

hundred connections every minute ADO.NET overcomes these shortcomings and provides disconnected data access, integration with XML, optimized interaction with databases, and the ability to combine data from numerous data sources These features make ADO.NET connection managers quite reliable and flexible with lots of options; however, they might be a little bit slower than the customized or dedicated connection managers for a particular source You can also have consistent access to data sources using ADO.NET providers The ADO.NET Connection Manager provides access

to data sources, such as SQL Server or sources exposed through OLE DB or XML, using a NET provider You can choose from the NET Framework Data Provider for SQL Server (SqlClient), the NET Framework Data Provider for Oracle Server (OracleClient), the NET Framework Data Provider for ODBC (Open Database Connectivity), and the NET Framework Data Provider for OLE DB The configuration options of the ADO.NET Connection Manager change, depending on the choice of .NET provider

Cache Connection Manager

The Cache Connection Manager is primarily used for creating cache for the Lookup Transformation When you have to repeatedly run a Lookup Transformation in a package or have to share the reference (lookup) data set among multiple packages, then you might prefer to persist this cache to a file to improve the performance You would then use a cache transformation, which in turn uses the Cache Connection Manager

to write the cached information to a cache file (.caw) Later in Chapter 10, “Data Flow Transformations,” when you will be working with the Lookup Transformation, you will use this connection manager to cache data to a file

Excel Connection Manager

This connection manager provides access to the Microsoft Excel workbook file It

is used when you add Excel Source or Excel Destination in your package With the launch of Excel 2007, the data provider for Excel is changed to OLE DB provider for the Microsoft Office 12.0 Access Database Engine from the earlier used Microsoft Jet OLE DB Provider If you check the ConnectionString property of the Excel Connection Manager after adding it using the Microsoft Excel 97-2003 version, you will see the Provider listed as Microsoft.Jet.OLEDB.4.0, whereas this property will show you the provider as Microsoft.ACE.OLEDB.12.0 when you add the Excel Connection Manager using Microsoft Excel 2007 version It is important to understand the connection string, as you may need to write the connection string yourself in some packages, for example, if you’re getting the file path at run time and you want to dynamically create the connection string Here is the connection string shown for both versions of the Excel driver:

Trang 10

RawDataTxt.xls;Extended Properties="Excel 8.0;HDR=YES";

Provider=Microsoft.ACE.OLEDB.12.0; Data Source=C:\SSIS\RawFiles\

RawDataTxt.xlsx;Extended Properties="Excel 12.0;HDR=YES";

Note the differences between the providers for both the versions as has been explained earlier There are some additional properties that you need to specify in the extended

properties section The first is that you use Excel 8.0 for Excel versions 97, 2000, 2002, and 2003 in the extended properties, while you use Excel 12.0 for Excel 2007 version Second, you use the HDR property to specify if the first row has column names

The default value is yes; that is, if you do not specify this property, the first row will

be deemed to contain columns Also, sometimes the Excel driver fails to pick up some values in the columns where you have string and numeric values mixed up The Excel

driver samples, by default the first eight rows, to determine the data type of the column and returns the null values if other data types exist in the column You can override this behavior by importing all the values as strings using the import mode setting IMEX=1

in the extended properties of the connection string

If you will be deploying this connection manager to a 64-bit server, which is most

likely the case these days, you will need to run the package in 32-bit mode, as both

the aforesaid providers are available in 32-bit version only You will need to run the

package using the 32-bit version of dtexec.exe from the 32-bit area, which is by default

in the C:\Program Files(x86)\Microsoft SQL Server\100\DTS\Binn folder

File Connection Manager

This connection manager enables you to reference a file or folder that already exists

or is created at run time While executing a package, Integration Services tasks and

data flow components need input for values of property attributes to perform their

functions These input values can be directly configured by you within the component’s properties, or they can be read from external sources such as files or variables When

you configure to get this input information from a file, you use the File Connection

Manager For example, the Execute SQL task executes an SQL statement, which can

be directly input by you in the Execute SQL task, or this SQL statement can be read

from a file

You can use an existing file or folder, or you can create a file or a folder by using the File Connection Manager However, you can reference only one file or folder If you

want to reference multiple files or folders, you must use a Multiple Files Connection

Manager, described a bit later

To configure this connection manager, choose from the four available options in the Usage Type field of the File Connection Manager Editor Your choice in this field sets

Định dạng
Số trang	10
Dung lượng	243,28 KB