Hands-On Microsoft SQL Server 2008 Integration Services part 67 pps

Precedence Constraints and Data Flow Paths You’ve used these two connecting elements in package development, so now you understand that precedence constraints are used to connect control

Trang 1

In this Hands-On exercise, you’ve seen how the Progress tab shows the number of executions of a task along with other information This tab also displays alert, warning, and error messages while the package executes These messages are still available after the package completes execution when the Progress tab becomes the Execution Results tab You’ve also used the Locals window to see the values of variables that get updated with each iteration of the Foreach Loop container Then you changed the breakpoint setting and jumped four steps at a time This feature helps a lot while debugging packages with large data sets and you need to check the multiple points of failure The most interesting thing you’ve learned is how to change the variable value on the fly and inject that value into the package execution You can easily simulate various values and see how the package responds to these values using this feature

Figure 15-2 Changing variable value at run time using the Watch 1 window

Trang 2

Logging Feature

The logging feature of Integration Services can be used to debug task failures and

monitor performance issues at run time A properly configured log schema to log

run-time events can be an effective tool in your debugging toolkit Logging has been

discussed in detail in Chapter 8 along with a discussion on various log providers Refer

to that chapter for more details on how to configure logging in your package

Precedence Constraints and Data Flow Paths

You’ve used these two connecting elements in package development, so now you

understand that precedence constraints are used to connect control flow tasks whereas

data flow paths are used to connect data flow components You can use precedence

constraints to control the workflow in a package by configuring the evaluation

operation to a constraint or an expression, or a combination of both Using a constraint

as the evaluation operation, you can configure the connected task to run on the basis

of success, failure, or completion of the initial task If you use an expression as an

evaluation operation, you can then specify a Boolean expression whose evaluation result

of True will let the connected task be executed You can also choose to use both a

constraint and an expression with an AND operator or an OR operator to specify when

the next task in the workflow can be executed You can also set multiple precedence

constraints to one task when multiple tasks are connected to this task You can set

it to run when either of the constraint evaluates to True or leave it to default setting

of executing the connected task when all the constraints must evaluate to True By

configuring precedence constraints properly, you can control the order of execution of

tasks in the workflow to the granular level

As with precedence constraints, you use data flow paths in the pipeline to connect

components However, data flow paths do not apply any constraints on the data

flow as precedence constraints apply to control flow tasks Several of the data flow

components support error outputs that are represented by red-colored data flow paths

You can configure the error outputs of components to handle the rows that do not

pass through the main output successfully and can cause truncation of data or failure

of the component These failing rows can be passed on to error outputs to be treated

separately from the main pipeline You can then log these rows to be dealt with later, or you can be more productive by configuring an alternate data flow to fix the errors in the data for such failing rows and put these rows back to the main data flow This ability

to fix errors in the data while processing the bulk of it is extremely powerful and easy

to use So, in the simplistic way, you can use error output to handle errors or the failing

rows in the data flow component, and in a slightly modified way, you can use them

to deploy alternate pipeline paths to create more productive, more resilient, and more

reliable data manipulation packages

Trang 3

Data Viewers

Data viewers are excellent debugging tools when used in a pipeline—just like oscilloscopes that are used in electric circuits You see traces of the waves and pulses of current flowing

in the circuit using an oscilloscope, whereas you use data viewers to see the data flowing from one component to the other in the pipeline There is a difference between the two, though Oscilloscopes do not tend to affect the flow of current, whereas data viewers stop the execution of the data flow engine and require you to click Continue to proceed Data viewers are attached to the path connecting the two data flow components, and at run time, these data viewers pop open to show you the data in one of four formats: grid, histogram, scatter plot, or chart format

You were introduced to data viewers in Chapter 9 and used them extensively in Chapter 10 While working with them, you attached the data viewers to the path and saw the data in the grid format popping on the Designer surface If you run a package that has data viewers attached in the data flow using any method other than inside BIDS, the data viewers do not show up—i.e., data viewers work only when the package

is run inside the BIDS environment

Performance Enhancements

The next step to improve your package development skills is to consider the proper use of resources Many scenarios occur in development, staging, and production environments that can affect how packages run For example, your server may be running other processes in parallel that are affected when an Integration Services package starts execution and puts heavy demands on server resources, or your package may use sorts and aggregations that require lots of memory that may not be available

to you, or else the tables required by an Integration Services package may be locked by SQL Server to serve queries from other users

In this section, you will learn skills to design your package and its deployment so that you can manage resources on the server for optimal utilization without affecting other services provided by the server You will study about optimization techniques

to keep your packages running at peak performance levels As with most database applications, you can enhance the performance of Integration Services by managing memory allocations to various components properly Integration Services packages can also be configured to work in parallel, as is discussed later in the chapter

It’s All About Memory

One common misconception is to assume that the memory management of Integration Services either is handled by the SQL Server 2005 database engine or can be managed

in a similar fashion However, Integration Services is a totally independent application

Trang 4

that is packaged with SQL Server, but runs exactly the way any other application

would run It has nothing to do with memory management of the SQL Server

database engine It is particularly important for you to understand that Integration

Services works like any other application on Windows that is not aware of Advanced

Windowing Extensions (AWE) memory Hence, SSIS can use only virtual address

space memory of 2GB or 3GB if the /3GB switch is used on 32-bit systems per

process Here a process means a package—that is, if you have spare memory installed

on the server that you want to use, you have to distribute your work among multiple

packages and run them in parallel to enable these packages to use more memory at

run time This goes back to best practices of package design that advocates modular

design for development of more complex requirements The package modules can be

combined using the Execute Package task to form a more complex parent package

Also, you will need to run child packages out of process from the parent package

to enable them to reserve their own memory pool during execution You can do this

by setting the ExecuteOutOfProcess property to True on the Execute Package task

Refer to Chapter 5 for more details on how this property affects running packages If a

package cannot be divided into child packages and requires more memory to run, you

need to consider moving to 64-bit systems that can allocate large virtual memory to

each process As SSIS has no AWE memory support, so adding more AWE memory

on a 32-bit system is irrelevant when executing a large SSIS package

64-Bit Is Here

With 64-bit chips becoming more affordable and multicore CPUs readily available,

performance issues are beginning to disappear—at least for the time being Utilizing

64-bit computer systems in the online data processing, analytical systems, and reporting

systems makes sense If you are up against a small time window and your data processing needs are still growing, you need to consider moving to 64-bit technology The benefits

of this environment not only outweigh the cost of moving to 64-bit, but it may actually

turn out to be the cheaper option when you’re dealing with millions of transactions on

several 32-bit systems and need to scale up

Earlier 64-bit versions in SQL Server 2000 provided limited options, whereas with

SQL Server 2005 and later, 64-bit edition provides the same feature set as 32-bit with

enhanced performance All the components of SQL Server 2005 and later versions

can be run in native 64-bit mode, thus eliminating the earlier requirement to have

a separate 32-bit computer to run tools In addition, the WOW64 mode feature of

Microsoft Windows allows 32-bit applications to run on a 64-bit operating system

This is a cool feature of Microsoft Windows, as it lets third-party software without

64-bit support coexist with SQL Server on the 64-bit server

While discussing advantages of 64-bit technology, let’s quickly go through the

following architectural and technical advantages of moving to the 64-bit environment

Trang 5

Compared to 32-bit systems that are limited to 4GB of address space, 64-bit systems can support up to 1024 gigabytes of both physical and addressable memory The 32-bit systems must use Address Windowing Extensions (AWE) for accessing memory beyond the 4GB limit, which has its own limitations The increase in directly addressable memory for 64-bit architecture enables it to perform more complex and resource-intensive queries easily without swapping out to disk Also, 64-bit processors have larger on-die caches, enabling them to use processor time more efficiently The transformations that deal with row sets instead of row-by-row operation, such as the Aggregate transformation, and the transformations that cache data to memory for lookup operations, such as the Lookup transformation and the Fuzzy Lookup transformations, are benefited with increased availability of memory

The improved bus architecture and parallel processing abilities provide almost linear scalability with each additional processor, yielding higher returns per processor when compared to 32-bit systems

The wider bus architecture of 64-bit processors enables them to move data quicker between the cache and the processor, which results in improved performance

With more benefits and increased affordability, deployment of 64-bit servers is growing and will eventually replace 32-bit servers The question you may ask is, what are the optimal criteria to make a switch? As an Integration Services developer, you need

to determine when your packages start suffering from performance and start requiring more resources The following scenarios may help you identifying such situations: With the increase in data volumes to be processed, especially where Sort and c

Aggregate transformations are involved, the pressure on memory also increases, causing large data sets that cannot fit in the memory space to be swapped out

to hard disks Whenever this swapping out of data volumes to hard disks starts occurring, you will see massive performance degradation You can use Windows performance counters to capture this situation so that when it happens, you can add more memory to the system However, if you’ve already run out of full capacity and there is no more room to grow, or if you are experiencing other performance issues with the system, your best bet is to replace it with a newer, faster, and beefier system Think about 64-bit seriously, analyze cost versus performance issues, and go for it

If you are running SSIS packages on a database system that is in use at the c

time when these packages run, you may encounter performance issues and your packages may take much longer to finish than you would expect This may be due

to data sets being swapped out of memory and also due to processor resource allocations Such systems will benefit most due to improved parallelization of processes within 64-bit systems

Trang 6

The next step is to understand the requirements, limitations, and implications

associated with use of 64-bit systems Review the following list before making a final

decision:

Not all utilities are available in 64-bit editions that otherwise are available in 32-bit c

editions The only utilities available in 64-bit edition are dtutil.exe, dtexec.exe, and

DTSWizard.exe (SQL Server Import and Export Wizard)

You might have issues connecting to data sources on a 64-bit environment due

c

to lack of providers For example, when you’re populating a database in a 64-bit

environment, you must have 64-bit OLE DB providers for all data sources

available to you

As DTS 2000 components are not available for 64-bit editions of SQL Server

c

2000, SSIS has no 64-bit design-time or run-time support for DTS packages

Because of this, you also cannot use the Execute DTS 2000 Package task in your

packages you intend to run on a 64-bit edition of Integration Services

By default, SQL Server 64-bit editions run jobs configured in SQL Server Agent

c

in 64-bit mode If you want to run an SSIS package in 32-bit mode on a 64-bit

edition, you can do so by creating a job with the job step type of operating system

and invoking the 32-bit version of dtexec.exe using the command line or a batch

file The SQL Server Agent uses the Registry to identify the correct version of the

utility However, on a 64-bit version of SQL Server Agent if you choose Use 32-Bit Runtime on the Execution Options tab of the New Job Step dialog box, the

package will run in 32-bit mode

Not all NET Framework data providers and native OLE DB providers are

c

available in 64-bit editions You need to check the availability of all the providers

you’ve used in your package before deploying your packages on a 64-bit edition

If you need to connect to a 64-bit system from a 32-bit computer where you’re

c

building your Integration Services package, you must have a 32-bit provider

installed on the local computer along with the 64-bit version, because the 32-bit

SSIS Designer displays only 32-bit providers To use 64-bit provider at run time,

you simply make sure that the Run64BitRuntime property of the project is set to

True, which is the default

After you have sorted out the infrastructure requirements of having a 32-bit system

or a 64-bit system, your next step toward performance improvement is to design and

develop packages that are able to use available resources optimally, without causing

issues for other applications running in proximity In the following sections, you will

learn some of the design concepts before learning to monitor performance

Trang 7

Architecture of the Data Flow

The Data Flow task is a special task in your package design that handles data movement, data transformations, and data loading in the destination store This is the main component of an Integration Services package that determines how your package is going to perform when deployed on production systems Typically, your package can have one or more Data Flow tasks, and each Data Flow task can have one or more Data Flow sources to extract data; none, one, or more Data Flow transformations

to manipulate data; and one or more Data Flow destinations All your optimization research rotates around these components, where you will discover the values of properties you need to modify to enhance the performance To be able to decipher the codes and logs, you need to understand how the data flow engine works and what key terms are used in its design

When a data flow source extracts the data from the data source, it places that data in

chunks in the memory The memory allocated to these chunks of data is called a buffer

A memory buffer is nothing more than an area in memory that holds rows and columns

of data You’ve used data viewers earlier in various Hands-On exercises These data viewers show the data stored in the buffer at one time If data is spread out over more than one buffer, you click Continue on the data viewers to see the data buffer by buffer

In Chapter 9, you saw how data viewers show data in the buffers, and you also explored two other properties on the Data Flow task, which are discussed here as well

The Data Flow task has a property called DefaultBufferSize whose buffer size is set to 10MB by default (see Figure 15-3) Based on the number of columns—i.e.,

Figure 15-3 Miscellaneous properties of the Data Flow task

Trang 8

the row size of pipeline data—and keeping some contingency for performance

optimizations (if a column is derived, it could be accommodated in the same buffer),

Integration Services calculates the number of rows that can fit in a buffer However,

if your row width is small, that doesn’t mean that Integration Services will fit as many

rows as can be accommodated in the buffer Another property of the Data Flow

task, DefaultBufferMaxRows, restricts a buffer from including more than a specified

number of rows, in case there are too few columns in the data set or the columns are

too narrow This design is meant to maximize the memory utilization but still keep the

memory requirements predictable within a package You can also see another property,

EngineThreads, in the same figure that is discussed a bit later in this chapter

While the data flow is executed, you see the data is extracted by Data Flow sources,

is passed on to downstream transformations, and finally lands in the Data Flow

destination adapter By looking at the execution view of a package, you may think that

the data buffers move from component to component and data flows from one buffer to another This is in fact not all correct Moving data from one buffer to another is quite

an expensive process and is avoided by several components of the Integration Services

data flow engine Data Flow transformations are classified into different categories

based on their methods of handling and processing data in buffers Some components

actually traverse over the memory buffers and make changes only in the data columns

that are required to be changed, while most other data in the memory buffer remains as

is, thus saving the costly data movement operation SSIS has some other components

that do require actual movement of data from one buffer to another to perform the

required operation So, it depends on what type of operation is required and what type

of transformation is used to achieve the objective Based on the functions performed by

Data Flow transformations, they may use different types of outputs that may or may

not be synchronous with the input This has quite an impact on the requirement of

making changes in place and can force data to be moved to new memory buffers Let’s

discuss the synchronous and asynchronous nature of Data Flow transformations before

discussing their classifications

Synchronous and Asynchronous Transformations

Data Flow transformations can have either synchronous or asynchronous outputs

Components that have outputs synchronous to the inputs are called synchronous

components These transformations make changes to the incoming rows by adding

or modifying columns, but do not add any rows in the data flow—for example, the

Derived Column transformation modifies existing rows by adding a new column

If you go to the Input And Output Properties tab in the Advanced Editor for these

transformations, you may notice that these transformations do not add all the columns

to the Output Columns section, but add only the newly created columns, because all

Trang 9

other columns that are available in the input are, by default, available at the output also This is because these transformations do not move data from one buffer to another; instead, they make changes to the required columns while keeping data in the same buffer So the operation of these transformations is quick and they do not place a heavy load on the server Hence, the transformation with synchronous outputs processes and passes the input rows immediately to the downstream components One point

to note here is that, as these transformations do not move data between buffers, you cannot develop a synchronous transformation for operations that do need to move data between buffers such as sorting and aggregating Synchronous transformations do not block data buffers, and the outputs of these transformations are readily available to the downstream components As the data stays in the same buffer within an execution tree (you will learn more about execution trees later in the chapter) where only synchronous transformations are used, any changes you do in the metadata, for instance, addition

of a column, get applied at the beginning of the execution tree and not at the time the synchronous component runs in the data flow For example, if a Derived Column transformation adds a column in the data flow, the column is added at the beginning

of the execution tree in which this transformation lies This helps in calculating the correct memory allocation for the buffer that is created the first time for this execution tree The column is not visible to you before the Derived Column transformation, but

it exists in the buffer

While the synchronous transformations are fast and lightweight in situations when you are performing derivations and simple calculations on columns, you will need

to use transformations that have asynchronous outputs in other situations when you are performing operations on multiple buffers of data such as aggregating or sorting data The Asynchronous transformations move data from input buffers to new output buffers so that it can perform transformations that are otherwise not possible

by processing individual rows Typically, these transformations add new rows to the data flow—for example, an Aggregate transformation adds new rows and most likely with new metadata containing aggregations of the data columns The buffers that arrive at the input of an asynchronous transform are not the buffers that leave at the asynchronous output For example, a Sort transformation collects the rows, decides the order in which to place the rows, and then writes them to new output buffers These components act as a data source and a data destination at the same time These transformations provide new rows and columns for the downstream components and hence have to work harder While developing custom components, you can develop a transform that can have both synchronous and asynchronous outputs, though no such component has been provided in the collection of prebuilt transformations

If you go to the Input And Output Properties tab in the Advanced Editor for these transformations, you may notice that the transformations define new columns in the Output Columns section While waiting for the complete data set, these transformations

Trang 10

block the data buffers and slow down the processing These transformations also put a

heavy load or huge memory demands on the server to accommodate the complete data

set in the memory If the server doesn’t have enough memory, the data will be cached

on to the disk, further degrading the performance of the package For these reasons, you

need to make sure that you use only the minimum required number of asynchronous

transformations in your package and that the server has sufficient memory to support the data set that will be fed through the asynchronous transformation You have used both

of these types of transformations that are readily available in the Toolbox in BIDS in

Chapter 10 and have also built custom transformations in Chapter 11

Classifying Data Flow Transformations

Earlier you’ve seen the transformations grouped together in Chapter 9 Here, you will

study the three classifications of Data Flow transformations that are based mainly on

performance considerations

Nonblocking Synchronous Row-Based Transformations

These transformations work on a row-by-row basis and modify the data by changing

the data in columns, adding new columns, or removing columns from the data rows,

but these components do not add new rows to the data flow The rows that arrive at the input are those that leave at the output of these transformations These transformations

have synchronous outputs and make data rows available to the downstream components straightaway In fact, these transformations do not cause data to move from one buffer

to another; instead, they traverse over the data buffer and make the changes to the data

columns So these transformations are efficient and lightweight from the process point

of view Their operation is quite visible in BIDS, and you can quickly make out the

differences by looking at the execution of the components in run-time mode However,

bear in mind that the run-time display of BIDS is delayed from the actual execution of

the package This delay becomes quite noticeable, especially when you are running a

heavy processing package with a large data set and complex transformations The

run-time engine gets busier in actual processing that it prefers than to report back to BIDS,

so you can’t rely totally on what you see in BIDS However, for our test cases, you don’t need to worry about these delays

When you execute a package in BIDS, look at the data flow execution of the package You may see certain components processing together with the annotations for row

numbers ticking along, though they may be placed one after another in the pipeline;

you may also notice that certain transformations do not pass data to the downstream

components for some time The transformations that readily pass the data to the

downstream components are classified as Row transformations These transformations

Định dạng
Số trang	10
Dung lượng	216,56 KB