Precedence Constraints and Data Flow Paths You’ve used these two connecting elements in package development, so now you understand that precedence constraints are used to connect control
Trang 1In this Hands-On exercise, you’ve seen how the Progress tab shows the number of executions of a task along with other information This tab also displays alert, warning, and error messages while the package executes These messages are still available after the package completes execution when the Progress tab becomes the Execution Results tab You’ve also used the Locals window to see the values of variables that get updated with each iteration of the Foreach Loop container Then you changed the breakpoint setting and jumped four steps at a time This feature helps a lot while debugging packages with large data sets and you need to check the multiple points of failure The most interesting thing you’ve learned is how to change the variable value on the fly and inject that value into the package execution You can easily simulate various values and see how the package responds to these values using this feature
Figure 15-2 Changing variable value at run time using the Watch 1 window
Trang 2Logging Feature
The logging feature of Integration Services can be used to debug task failures and
monitor performance issues at run time A properly configured log schema to log
run-time events can be an effective tool in your debugging toolkit Logging has been
discussed in detail in Chapter 8 along with a discussion on various log providers Refer
to that chapter for more details on how to configure logging in your package
Precedence Constraints and Data Flow Paths
You’ve used these two connecting elements in package development, so now you
understand that precedence constraints are used to connect control flow tasks whereas
data flow paths are used to connect data flow components You can use precedence
constraints to control the workflow in a package by configuring the evaluation
operation to a constraint or an expression, or a combination of both Using a constraint
as the evaluation operation, you can configure the connected task to run on the basis
of success, failure, or completion of the initial task If you use an expression as an
evaluation operation, you can then specify a Boolean expression whose evaluation result
of True will let the connected task be executed You can also choose to use both a
constraint and an expression with an AND operator or an OR operator to specify when
the next task in the workflow can be executed You can also set multiple precedence
constraints to one task when multiple tasks are connected to this task You can set
it to run when either of the constraint evaluates to True or leave it to default setting
of executing the connected task when all the constraints must evaluate to True By
configuring precedence constraints properly, you can control the order of execution of
tasks in the workflow to the granular level
As with precedence constraints, you use data flow paths in the pipeline to connect
components However, data flow paths do not apply any constraints on the data
flow as precedence constraints apply to control flow tasks Several of the data flow
components support error outputs that are represented by red-colored data flow paths
You can configure the error outputs of components to handle the rows that do not
pass through the main output successfully and can cause truncation of data or failure
of the component These failing rows can be passed on to error outputs to be treated
separately from the main pipeline You can then log these rows to be dealt with later, or you can be more productive by configuring an alternate data flow to fix the errors in the data for such failing rows and put these rows back to the main data flow This ability
to fix errors in the data while processing the bulk of it is extremely powerful and easy
to use So, in the simplistic way, you can use error output to handle errors or the failing
rows in the data flow component, and in a slightly modified way, you can use them
to deploy alternate pipeline paths to create more productive, more resilient, and more
reliable data manipulation packages
Trang 3Data Viewers
Data viewers are excellent debugging tools when used in a pipeline—just like oscilloscopes that are used in electric circuits You see traces of the waves and pulses of current flowing
in the circuit using an oscilloscope, whereas you use data viewers to see the data flowing from one component to the other in the pipeline There is a difference between the two, though Oscilloscopes do not tend to affect the flow of current, whereas data viewers stop the execution of the data flow engine and require you to click Continue to proceed Data viewers are attached to the path connecting the two data flow components, and at run time, these data viewers pop open to show you the data in one of four formats: grid, histogram, scatter plot, or chart format
You were introduced to data viewers in Chapter 9 and used them extensively in Chapter 10 While working with them, you attached the data viewers to the path and saw the data in the grid format popping on the Designer surface If you run a package that has data viewers attached in the data flow using any method other than inside BIDS, the data viewers do not show up—i.e., data viewers work only when the package
is run inside the BIDS environment
Performance Enhancements
The next step to improve your package development skills is to consider the proper use of resources Many scenarios occur in development, staging, and production environments that can affect how packages run For example, your server may be running other processes in parallel that are affected when an Integration Services package starts execution and puts heavy demands on server resources, or your package may use sorts and aggregations that require lots of memory that may not be available
to you, or else the tables required by an Integration Services package may be locked by SQL Server to serve queries from other users
In this section, you will learn skills to design your package and its deployment so that you can manage resources on the server for optimal utilization without affecting other services provided by the server You will study about optimization techniques
to keep your packages running at peak performance levels As with most database applications, you can enhance the performance of Integration Services by managing memory allocations to various components properly Integration Services packages can also be configured to work in parallel, as is discussed later in the chapter
It’s All About Memory
One common misconception is to assume that the memory management of Integration Services either is handled by the SQL Server 2005 database engine or can be managed
in a similar fashion However, Integration Services is a totally independent application
Trang 4that is packaged with SQL Server, but runs exactly the way any other application
would run It has nothing to do with memory management of the SQL Server
database engine It is particularly important for you to understand that Integration
Services works like any other application on Windows that is not aware of Advanced
Windowing Extensions (AWE) memory Hence, SSIS can use only virtual address
space memory of 2GB or 3GB if the /3GB switch is used on 32-bit systems per
process Here a process means a package—that is, if you have spare memory installed
on the server that you want to use, you have to distribute your work among multiple
packages and run them in parallel to enable these packages to use more memory at
run time This goes back to best practices of package design that advocates modular
design for development of more complex requirements The package modules can be
combined using the Execute Package task to form a more complex parent package
Also, you will need to run child packages out of process from the parent package
to enable them to reserve their own memory pool during execution You can do this
by setting the ExecuteOutOfProcess property to True on the Execute Package task
Refer to Chapter 5 for more details on how this property affects running packages If a
package cannot be divided into child packages and requires more memory to run, you
need to consider moving to 64-bit systems that can allocate large virtual memory to
each process As SSIS has no AWE memory support, so adding more AWE memory
on a 32-bit system is irrelevant when executing a large SSIS package
64-Bit Is Here
With 64-bit chips becoming more affordable and multicore CPUs readily available,
performance issues are beginning to disappear—at least for the time being Utilizing
64-bit computer systems in the online data processing, analytical systems, and reporting
systems makes sense If you are up against a small time window and your data processing needs are still growing, you need to consider moving to 64-bit technology The benefits
of this environment not only outweigh the cost of moving to 64-bit, but it may actually
turn out to be the cheaper option when you’re dealing with millions of transactions on
several 32-bit systems and need to scale up
Earlier 64-bit versions in SQL Server 2000 provided limited options, whereas with
SQL Server 2005 and later, 64-bit edition provides the same feature set as 32-bit with
enhanced performance All the components of SQL Server 2005 and later versions
can be run in native 64-bit mode, thus eliminating the earlier requirement to have
a separate 32-bit computer to run tools In addition, the WOW64 mode feature of
Microsoft Windows allows 32-bit applications to run on a 64-bit operating system
This is a cool feature of Microsoft Windows, as it lets third-party software without
64-bit support coexist with SQL Server on the 64-bit server
While discussing advantages of 64-bit technology, let’s quickly go through the
following architectural and technical advantages of moving to the 64-bit environment
Trang 5Compared to 32-bit systems that are limited to 4GB of address space, 64-bit systems can support up to 1024 gigabytes of both physical and addressable memory The 32-bit systems must use Address Windowing Extensions (AWE) for accessing memory beyond the 4GB limit, which has its own limitations The increase in directly addressable memory for 64-bit architecture enables it to perform more complex and resource-intensive queries easily without swapping out to disk Also, 64-bit processors have larger on-die caches, enabling them to use processor time more efficiently The transformations that deal with row sets instead of row-by-row operation, such as the Aggregate transformation, and the transformations that cache data to memory for lookup operations, such as the Lookup transformation and the Fuzzy Lookup transformations, are benefited with increased availability of memory
The improved bus architecture and parallel processing abilities provide almost linear scalability with each additional processor, yielding higher returns per processor when compared to 32-bit systems
The wider bus architecture of 64-bit processors enables them to move data quicker between the cache and the processor, which results in improved performance
With more benefits and increased affordability, deployment of 64-bit servers is growing and will eventually replace 32-bit servers The question you may ask is, what are the optimal criteria to make a switch? As an Integration Services developer, you need
to determine when your packages start suffering from performance and start requiring more resources The following scenarios may help you identifying such situations: With the increase in data volumes to be processed, especially where Sort and c
Aggregate transformations are involved, the pressure on memory also increases, causing large data sets that cannot fit in the memory space to be swapped out
to hard disks Whenever this swapping out of data volumes to hard disks starts occurring, you will see massive performance degradation You can use Windows performance counters to capture this situation so that when it happens, you can add more memory to the system However, if you’ve already run out of full capacity and there is no more room to grow, or if you are experiencing other performance issues with the system, your best bet is to replace it with a newer, faster, and beefier system Think about 64-bit seriously, analyze cost versus performance issues, and go for it
If you are running SSIS packages on a database system that is in use at the c
time when these packages run, you may encounter performance issues and your packages may take much longer to finish than you would expect This may be due
to data sets being swapped out of memory and also due to processor resource allocations Such systems will benefit most due to improved parallelization of processes within 64-bit systems
Trang 6The next step is to understand the requirements, limitations, and implications
associated with use of 64-bit systems Review the following list before making a final
decision:
Not all utilities are available in 64-bit editions that otherwise are available in 32-bit c
editions The only utilities available in 64-bit edition are dtutil.exe, dtexec.exe, and
DTSWizard.exe (SQL Server Import and Export Wizard)
You might have issues connecting to data sources on a 64-bit environment due
c
to lack of providers For example, when you’re populating a database in a 64-bit
environment, you must have 64-bit OLE DB providers for all data sources
available to you
As DTS 2000 components are not available for 64-bit editions of SQL Server
c
2000, SSIS has no 64-bit design-time or run-time support for DTS packages
Because of this, you also cannot use the Execute DTS 2000 Package task in your
packages you intend to run on a 64-bit edition of Integration Services
By default, SQL Server 64-bit editions run jobs configured in SQL Server Agent
c
in 64-bit mode If you want to run an SSIS package in 32-bit mode on a 64-bit
edition, you can do so by creating a job with the job step type of operating system
and invoking the 32-bit version of dtexec.exe using the command line or a batch
file The SQL Server Agent uses the Registry to identify the correct version of the
utility However, on a 64-bit version of SQL Server Agent if you choose Use 32-Bit Runtime on the Execution Options tab of the New Job Step dialog box, the
package will run in 32-bit mode
Not all NET Framework data providers and native OLE DB providers are
c
available in 64-bit editions You need to check the availability of all the providers
you’ve used in your package before deploying your packages on a 64-bit edition
If you need to connect to a 64-bit system from a 32-bit computer where you’re
c
building your Integration Services package, you must have a 32-bit provider
installed on the local computer along with the 64-bit version, because the 32-bit
SSIS Designer displays only 32-bit providers To use 64-bit provider at run time,
you simply make sure that the Run64BitRuntime property of the project is set to
True, which is the default
After you have sorted out the infrastructure requirements of having a 32-bit system
or a 64-bit system, your next step toward performance improvement is to design and
develop packages that are able to use available resources optimally, without causing
issues for other applications running in proximity In the following sections, you will
learn some of the design concepts before learning to monitor performance
Trang 7Architecture of the Data Flow
The Data Flow task is a special task in your package design that handles data movement, data transformations, and data loading in the destination store This is the main component of an Integration Services package that determines how your package is going to perform when deployed on production systems Typically, your package can have one or more Data Flow tasks, and each Data Flow task can have one or more Data Flow sources to extract data; none, one, or more Data Flow transformations
to manipulate data; and one or more Data Flow destinations All your optimization research rotates around these components, where you will discover the values of properties you need to modify to enhance the performance To be able to decipher the codes and logs, you need to understand how the data flow engine works and what key terms are used in its design
When a data flow source extracts the data from the data source, it places that data in
chunks in the memory The memory allocated to these chunks of data is called a buffer
A memory buffer is nothing more than an area in memory that holds rows and columns
of data You’ve used data viewers earlier in various Hands-On exercises These data viewers show the data stored in the buffer at one time If data is spread out over more than one buffer, you click Continue on the data viewers to see the data buffer by buffer
In Chapter 9, you saw how data viewers show data in the buffers, and you also explored two other properties on the Data Flow task, which are discussed here as well
The Data Flow task has a property called DefaultBufferSize whose buffer size is set to 10MB by default (see Figure 15-3) Based on the number of columns—i.e.,
Figure 15-3 Miscellaneous properties of the Data Flow task
Trang 8the row size of pipeline data—and keeping some contingency for performance
optimizations (if a column is derived, it could be accommodated in the same buffer),
Integration Services calculates the number of rows that can fit in a buffer However,
if your row width is small, that doesn’t mean that Integration Services will fit as many
rows as can be accommodated in the buffer Another property of the Data Flow
task, DefaultBufferMaxRows, restricts a buffer from including more than a specified
number of rows, in case there are too few columns in the data set or the columns are
too narrow This design is meant to maximize the memory utilization but still keep the
memory requirements predictable within a package You can also see another property,
EngineThreads, in the same figure that is discussed a bit later in this chapter
While the data flow is executed, you see the data is extracted by Data Flow sources,
is passed on to downstream transformations, and finally lands in the Data Flow
destination adapter By looking at the execution view of a package, you may think that
the data buffers move from component to component and data flows from one buffer to another This is in fact not all correct Moving data from one buffer to another is quite
an expensive process and is avoided by several components of the Integration Services
data flow engine Data Flow transformations are classified into different categories
based on their methods of handling and processing data in buffers Some components
actually traverse over the memory buffers and make changes only in the data columns
that are required to be changed, while most other data in the memory buffer remains as
is, thus saving the costly data movement operation SSIS has some other components
that do require actual movement of data from one buffer to another to perform the
required operation So, it depends on what type of operation is required and what type
of transformation is used to achieve the objective Based on the functions performed by
Data Flow transformations, they may use different types of outputs that may or may
not be synchronous with the input This has quite an impact on the requirement of
making changes in place and can force data to be moved to new memory buffers Let’s
discuss the synchronous and asynchronous nature of Data Flow transformations before
discussing their classifications
Synchronous and Asynchronous Transformations
Data Flow transformations can have either synchronous or asynchronous outputs
Components that have outputs synchronous to the inputs are called synchronous
components These transformations make changes to the incoming rows by adding
or modifying columns, but do not add any rows in the data flow—for example, the
Derived Column transformation modifies existing rows by adding a new column
If you go to the Input And Output Properties tab in the Advanced Editor for these
transformations, you may notice that these transformations do not add all the columns
to the Output Columns section, but add only the newly created columns, because all
Trang 9other columns that are available in the input are, by default, available at the output also This is because these transformations do not move data from one buffer to another; instead, they make changes to the required columns while keeping data in the same buffer So the operation of these transformations is quick and they do not place a heavy load on the server Hence, the transformation with synchronous outputs processes and passes the input rows immediately to the downstream components One point
to note here is that, as these transformations do not move data between buffers, you cannot develop a synchronous transformation for operations that do need to move data between buffers such as sorting and aggregating Synchronous transformations do not block data buffers, and the outputs of these transformations are readily available to the downstream components As the data stays in the same buffer within an execution tree (you will learn more about execution trees later in the chapter) where only synchronous transformations are used, any changes you do in the metadata, for instance, addition
of a column, get applied at the beginning of the execution tree and not at the time the synchronous component runs in the data flow For example, if a Derived Column transformation adds a column in the data flow, the column is added at the beginning
of the execution tree in which this transformation lies This helps in calculating the correct memory allocation for the buffer that is created the first time for this execution tree The column is not visible to you before the Derived Column transformation, but
it exists in the buffer
While the synchronous transformations are fast and lightweight in situations when you are performing derivations and simple calculations on columns, you will need
to use transformations that have asynchronous outputs in other situations when you are performing operations on multiple buffers of data such as aggregating or sorting data The Asynchronous transformations move data from input buffers to new output buffers so that it can perform transformations that are otherwise not possible
by processing individual rows Typically, these transformations add new rows to the data flow—for example, an Aggregate transformation adds new rows and most likely with new metadata containing aggregations of the data columns The buffers that arrive at the input of an asynchronous transform are not the buffers that leave at the asynchronous output For example, a Sort transformation collects the rows, decides the order in which to place the rows, and then writes them to new output buffers These components act as a data source and a data destination at the same time These transformations provide new rows and columns for the downstream components and hence have to work harder While developing custom components, you can develop a transform that can have both synchronous and asynchronous outputs, though no such component has been provided in the collection of prebuilt transformations
If you go to the Input And Output Properties tab in the Advanced Editor for these transformations, you may notice that the transformations define new columns in the Output Columns section While waiting for the complete data set, these transformations
Trang 10block the data buffers and slow down the processing These transformations also put a
heavy load or huge memory demands on the server to accommodate the complete data
set in the memory If the server doesn’t have enough memory, the data will be cached
on to the disk, further degrading the performance of the package For these reasons, you
need to make sure that you use only the minimum required number of asynchronous
transformations in your package and that the server has sufficient memory to support the data set that will be fed through the asynchronous transformation You have used both
of these types of transformations that are readily available in the Toolbox in BIDS in
Chapter 10 and have also built custom transformations in Chapter 11
Classifying Data Flow Transformations
Earlier you’ve seen the transformations grouped together in Chapter 9 Here, you will
study the three classifications of Data Flow transformations that are based mainly on
performance considerations
Nonblocking Synchronous Row-Based Transformations
These transformations work on a row-by-row basis and modify the data by changing
the data in columns, adding new columns, or removing columns from the data rows,
but these components do not add new rows to the data flow The rows that arrive at the input are those that leave at the output of these transformations These transformations
have synchronous outputs and make data rows available to the downstream components straightaway In fact, these transformations do not cause data to move from one buffer
to another; instead, they traverse over the data buffer and make the changes to the data
columns So these transformations are efficient and lightweight from the process point
of view Their operation is quite visible in BIDS, and you can quickly make out the
differences by looking at the execution of the components in run-time mode However,
bear in mind that the run-time display of BIDS is delayed from the actual execution of
the package This delay becomes quite noticeable, especially when you are running a
heavy processing package with a large data set and complex transformations The
run-time engine gets busier in actual processing that it prefers than to report back to BIDS,
so you can’t rely totally on what you see in BIDS However, for our test cases, you don’t need to worry about these delays
When you execute a package in BIDS, look at the data flow execution of the package You may see certain components processing together with the annotations for row
numbers ticking along, though they may be placed one after another in the pipeline;
you may also notice that certain transformations do not pass data to the downstream
components for some time The transformations that readily pass the data to the
downstream components are classified as Row transformations These transformations