1. Trang chủ
  2. » Công Nghệ Thông Tin

Tài liệu SQL Server MVP Deep Dives- P20 ppt

40 360 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Incorporating Data Profiling in the ETL Process
Trường học University of SQL Server
Chuyên ngành Database Management and Data Profiling
Thể loại Lecture Material
Định dạng
Số trang 40
Dung lượng 1,05 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Storing the XML output in a variable is most often done whenyou want to use the profile information later in the same package, perhaps to make anautomated decision about data quality.. C

Trang 1

714 C 56 Incorporating data profiling in the ETL process

inside the package Typically, you will store the XML in a file if you are profiling data to

be reviewed by a person at a later time, and plan on using the Data Profile Viewerapplication to review it Storing the XML output in a variable is most often done whenyou want to use the profile information later in the same package, perhaps to make anautomated decision about data quality

The XML output includes both the profile requests (the input to the task) and theoutput from each profile requested The format of the output varies depending onwhich profile generated it, so you will see different elements in the XML for a ColumnNull Ratio profile than you will for a Column Length Distribution profile The XMLcontains a lot of information, and it can be difficult to sort through to find the infor-mation you are looking for Fortunately, there is an easier user interface to use The Data Profile Viewer, shown in figure 2, provides a graphical interface to thedata profile information You can open XML files generated by the Data Profiling task

in it and find specific information much more easily In addition, the viewer sents some of the profile information graphically, which is useful when you are look-ing at large quantities of data For example, the Column Length Distribution profiledisplays the count associated with specific lengths as a stacked bar chart, which meansyou can easily locate the most frequently used lengths

repre-Figure 2 Data Profile Viewer

Trang 2

Introduction to the Data Profiling task

The Data Profile Viewer lets you sort most columns in the tables that it displays,which can aid you in exploring the data It also allows you to drill down into the detaildata in the source system This is particularly useful when you have located some baddata in the profile, because you can see the source rows that contain the data Thiscan be valuable if, for example, the profile shows that several customer names areunusually long You can drill into the detail data to see all the data associated withthese outlier rows This feature does require a live connection to the source database,though, because the source data is not directly included in the data profile output One thing to be aware of with the Data Profile Viewer: not all values it shows aredirectly included in the XML It does some additional work on the data profiles beforepresenting them to you For example, in many cases it calculates the percentage ofrows that a specific value in the profile applies to The raw XML for the data profileonly stores the row counts, not the percentages This means that if you want to use theXML directly, perhaps to display the information on a web page, you may need to cal-culate some values manually This is usually a straightforward task

Constraints of the Data Profiling task

As useful as the Data Profiling task is, there are still some constraints that you need tokeep in mind when using it The first one most people encounter is in the types ofdata sources it will work with The Data Profiling task requires that the data to be pro-filed be in SQL Server 2000 or later This means you can’t use it to directly profile data

in Oracle tables, Access databases, Excel spreadsheets, or flat files You can workaround this by importing the data you need into SQL Server prior to profiling it Infact, there are other reasons why you may want the data in SQL Server in advance,which will be touched on in this section

The Data Profiling task also requires that you use an ADO.NET connection ager Typically, in SSIS, OLEDB connection managers are used, as they tend to per-form better This may mean creating two connection managers to the same database,

man-if you need to both profile data and import it in the same package

Using the Data Profile Viewer does require a SQL Server installation, because theviewer is not packaged or licensed as a redistributable component It is possible totransform the XML output into a more user-friendly format by using XSL Transforma-tions (XSLT) to translate it into HTML, or to write your own viewer for the information The task’s performance can vary greatly, depending both on the volume of datayou are profiling and on the types of profiles you have requested Some profiles, such

as the Column Pattern profile, are resource intensive and can take quite a while on alarge table One way to address this is to work with a subset of the data, rather than theentire table It’s important to get a representative sample of the data for these pur-poses, so that the data profile results aren’t skewed This is another reason that havingthe data in SQL Server can be valuable You can copy a subset of the data to anothertable for profiling, using a SELECT that returns a random sampling of rows (as dis-cussed in “Selecting Rows Randomly from a Large Table” from MSDN: http://msdn.microsoft.com/en-us/library/cc441928.aspx) If the data is coming from an

Trang 3

716 C 56 Incorporating data profiling in the ETL process

Sampling components in an SSIS data flow to create a representative sample of thedata to profile Note that when sampling data, care must be taken to ensure the data istruly representative, or the results can be misleading Generally it’s better to profilethe entire data set

Making the Data Profiling task dynamic

Why would you want to make the Data Profiling task dynamic? Well, as an example,think about profiling a new database You could create a new SSIS package, add a DataProfiling task, and use the Quick Profile option to create profile requests for all thetables in the database You’d then have to repeat these steps for the next new databasethat you want to profile Or what if you don’t want to profile all the tables, but only asubset of them? To do this through the task’s editor, you would need to add each tableindividually Wouldn’t it be easier to be able to dynamically update the task to profiledifferent tables in your database?

Most tasks in SSIS can be made dynamic by using configurations and expressions

Configurations are used for settings that you wish to update each time a package is loaded, and expressions are used for settings that you want to update during the pack-

age execution Both expressions and configurations operate on the properties of tasks

in the package, but depending on what aspect of the Data Profiling task you want tochange, it may require special handling to behave in a dynamic manner

Changing the database

Because the Data Profiling task uses connection managers to control the connection

to the database, it is relatively easy to change the database it points to You update theconnection manager, using one of the standard approaches in SSIS, such as an expres-sion that sets the ConnectionString property, or a configuration that sets the sameproperty You can also accomplish this by overriding the connection manager’s setting

at runtime using the /Connection switch of DTEXEC

Bear in mind that although you can switch databases this way, the task will onlywork if it is pointing to a SQL Server database Also, connection managers only controlthe database that you are connecting to, and not the specific tables The profilerequests in the task will still be referencing the original tables, so if the new databasedoes not contain tables with the same names, the task will fail What is needed is a way

to change the profile requests to reference new tables

Altering the profile requests

As noted earlier, you can configure the Data Profiling task through the Data ProfilingTask Editor, which configures and stores the profile requests in the task’s Profile-Requests property But this property is a collection object, and collection objectscan’t be set through expressions or configurations, so, at first glance, it appears thatyou can’t update the profile requests

Fortunately, there is an additional property that can be used for this on the Data filing task This is the ProfileInputXml property, which stores the XML representation

Trang 4

Making the Data Profiling task dynamic

of the profile requests The ProfileInputXml property is not visible in the Propertieswindow in BIDS, but you can see it in the Property Expressions Editor dialog box, or inthe Package Configuration Wizard’s property browser You can set an XML string intothis property using either an expression or a configuration For it to work properly, theXML must conform to the DataProfile.xsd schema mentioned earlier

Setting the ProfileInputXml property

So how can you go about altering the ProfileInputXml property to profile a differenttable? One way that works well is to create a string variable in the SSIS package to holdthe table name (named TableName) and a second variable to hold the schema name(named SchemaName) Create a third variable that will hold the XML for the profilerequests (named ProfileXML), and set the EvaluateAsVariable property of theProfileXML variable to True In the Expression property, you’ll need to enter theXML string for the profile, and concatenate in the table and schema variables

To get the XML to use as a starting point, you can configure and run the Data file task with its output directed to a file You’ll then need to remove the output infor-mation from the file, which can be done by removing all of the elements between the

Pro-<DataProfileOutput> and <Profiles> tags, so that the XML looks similar to listing 1.You may have more or less XML, depending on how many profiles you configured thetask for initially

<?xml version="1.0" encoding="utf-16"?>

<DataProfile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xmlns:xsd="http://www.w3.org/2001/XMLSchema"

xmlns="http://schemas.microsoft.com/sqlserver/2008/DataDebugger/"> <DataSources />

<Column IsWildCard="true" />

</ColumnNullRatioProfileRequest>

<ColumnStatisticsProfileRequest ID="StatisticsReq">

<DataSourceID>{8D7CF241-6773-464A-87C8-60E95F386FB2}</DataSourceID> <Table Schema="Production" Table="Product" />

Trang 5

718 C 56 Incorporating data profiling in the ETL process

Once you have the XML, you need to change a few things to use it in an expression.First, the entire string needs to be put inside double quotes (") Second, any existingdouble quotes need to be escaped, using a backslash ( \ ) For example, the ID attri-bute ID="StatisticsReq" needs to be formatted as ID=\"StatisticsReq\" Finally,the profile requests need to be altered to include the table name variable created pre-viously These modifications are shown in listing 2

"<?xml version=\"1.0\" encoding=\"utf-16\"?>

<DataProfile xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"

xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\"

xmlns=\"http://schemas.microsoft.com/sqlserver/2008/DataDebugger/\"> <DataSources />

"\" Table=\"" + @[User::TableName] + "\" />

<Column IsWildCard=\"true\" />

</ColumnNullRatioProfileRequest>

<ColumnStatisticsProfileRequest ID=\"StatisticsReq\">

<DataSourceID>{8D7CF241-6773-464A-87C8-60E95F386FB2}</DataSourceID> <Table Schema=\"" + @[User::SchemaName] +

"\" Table=\"" + @[User::TableName] + "\"/>

Now that we’ve made the task dynamic, let’s move on to making decisions based onthe output of the task

Listing 2 Data profiling XML after converting to an expression

Use variables for schema and table name

Trang 6

Making data-quality decisions in the ETL

Making data-quality decisions in the ETL

The Data Profiling task output can be used to make decisions about the quality ofyour data, and by incorporating the task output into your ETL process, you can auto-mate these decisions By taking things a little further, you can make these decisionsself-adjusting as your data changes over time We’ll take a look at both scenarios in thefollowing sections

Excluding data based on quality

Most commonly, the output of the Data Profiling task will change the flow of your ETLdepending on the quality of the data being processed in your ETL A simple example

of this might be using the Column Null Ratio profile to evaluate a Customer tableprior to extracting it from the source system If the null ratio is greater than 30 per-cent for the Customer Name column, you might have your SSIS package set up toabort the processing and log an error message This is an example of using data profil-ing information to prevent bad data from entering your data warehouse

In situations like the preceding, though, a large percentage of rows that may havehad acceptable data quality would also be excluded For many data warehouses, that’snot acceptable It’s more likely that these “hard” rules, such as not allowing null values

in certain columns, will be implemented on a row-by-row basis, so that all acceptabledata will be loaded into the warehouse, and only bad data will be excluded In SSIS,this is often accomplished in the data flow by using Conditional Split transformations

to send invalid data to error tables

Adjusting rules dynamically

A more complex example involves using data profiling to establish what good datalooks like, and then using this information to identify data of questionable quality Forexample, if you are a retailer of products from multiple manufactures, your Producttable will likely have the manufacturer’s original part number, and each manufacturermay have its own format for part numbers In this scenario, you might use the ColumnPattern profile against a known good source of part numbers, such as your Producttable or your Product master, to identify the regular expressions that match the partnumbers During the execution of your ETL process, you could compare new incom-ing part numbers with these regular expressions to determine if they match the

Expressions in SSIS

Expressions in SSIS are limited to producing output no longer than 4,000 characters.Although that is enough for the example in this chapter, you may need to take it intoaccount when working with multiple profiles You can work around the limitation byexecuting the Data Profiling task multiple times, with a subset of the profiles in eachexecution to keep the expression under the 4,000-character limit

Trang 7

720 C 56 Incorporating data profiling in the ETL process

known formats for part numbers As new products are added to the known goodsource of part numbers, new patterns will be included in the profile, and the rule will

be adjusted dynamically

It’s worth noting that this type of data-quality check is often implemented as a

“soft” rule, so the row is not prohibited from entering the data warehouse After all,the manufacturer may have implemented a new part-numbering scheme, or the partnumber could have come from a new manufacturer that is not in the Product dimen-sion yet Instead of redirecting the row to an error table, you might set a flag on therow indicating that there is a question as to the quality of the information, but allow it

to enter the data warehouse anyway This would allow the part number to be used forrecording sales of that product, while still identifying a need for someone to follow upand verify that the part number is correct Once they have validated the part number,and corrected it if necessary, the questionable data flag would be removed, and thatproduct could become part of the known good set of products The next time that yougenerate a Column Pattern profile against the part numbers, the new pattern will beincluded, and new rows that conform to it will no longer be flagged as questionable

As mentioned earlier, implementing this type of logic in your ETL process canallow it to dynamically adjust data-quality rules over time, and as your data quality getsbetter, the ETL process will get better at flagging questionable data

Now let’s take a look at how to use the task output in the package

Consuming the task output

As mentioned earlier, the Data Profiling task produces its output as XML, which can

be stored in a variable or a file This XML output will include both the profile requestsand the output profiles for each request

Capturing the output

If you are planning to use the output in the same package that the profiling task is in,you will usually want to store the output XML in a package variable If the output will

be used in another package, how you store it will depend on how the other packagewill be executed If the second package will be executed directly from the packageperforming the profiling through an Execute Package task, you can store the output

in a variable and use a Parent Package Variable configuration to pass it between thepackages On the other hand, if the second package will be executed in a separateprocess or at a different time, storing the output in a file is the best option

Regardless of whether the output is stored in a variable or a file, it can be accessed

in a few different ways Because the output is stored as XML, you can make use of theXML task to use it in the control flow, or the XML source to use it in the data flow Youcan also use the Script task or the Script component to manipulate the XML outputdirectly using NET code

Trang 8

The XSLT operation can be used to transform the output into a format that’s easier

to use, such as filtering the profile output down to specific profiles that you are ested in, which is useful if you want to use the XML source to process it The XSLToperation can also be used to remove the default namespace from the XML docu-ment, which makes using XPATH against it much easier

inter-XPATH operations can be used to retrieve a specific value or set of nodes from theprofile This option is illustrated by the Trim Namespaces XML task in the samplepackage that accompanies this chapter, showing how to retrieve the null count for aparticular column using XPATH

NOTE The sample package for this chapter can be found on the book’s website

at http://www.manning.com/SQLServerMVPDeepDives

In the data flow, the XML source component can be used to get information from theData Profiling task output You can do this in two ways, one of which is relativelystraightforward if you are familiar with XSLT The other is more complex to imple-ment but has the benefit of not requiring in-depth XSLT knowledge

If you know XSLT, you can use an XML task to transform and simplify the Data filing task output prior to using it in the XML source, as mentioned previously Thiscan help avoid having to join multiple outputs from the XML source, which is dis-cussed shortly

If you don’t know XSLT, you can take a few additional steps and use the XMLsource directly against the Data Profiling task output First, you must provide an XSDfile for the XML source, but the XSD published by Microsoft at http://schemas.micro-soft.com/sqlserver/2008/DataDebugger/DataProfile.xsd is too complex for the XMLsource Instead, you will need to generate a schema using an existing data profile thatyou have saved to a file Second, you have to identify the correct outputs from theXML source The XML source creates a separate output for each distinct element type

in the XML: the output from the Data Profiling task includes at least three distinct

New to XML?

If you are new to XML, the preceding discussion may be a bit confusing, and the sons for taking these steps may not be obvious If you’d like to learn more aboutworking with XML in SSIS, please review these online resources:

rea-ƒ General XML information: http://msdn.microsoft.com/en-us/xml/default.aspx

ƒ Working with XML in SSIS: http://blogs.msdn.com/mattm/archive/tags/XML/default.aspx

Trang 9

722 C 56 Incorporating data profiling in the ETL process

elements for each profile you include, and for most profiles it will have four or more.This can lead to some challenges in finding the appropriate output information fromthe XML source Third, because the XML source does not flatten the XML output, youhave to join the multiple outputs together to assemble meaningful information Thesample package on the book’s website (http://www.manning.com/SQLServerMVP-DeepDives) has an example of doing this for the Column Pattern profile The dataflow is shown in figure 3

In the data flow shown in figure 3, the results of the Column Pattern profile arebeing transformed from a hierarchical structure (typical for XML) to a flattened struc-ture suitable for saving into a database table The hierarchy for a Column Pattern pro-file has five levels that need to be used for the information we are interested in, andeach output from the XML source includes one of these levels Each level contains acolumn that ties it to the levels used below it In the data flow, each output from theXML source is sorted, so that consistent ordering is ensured Then, each output,which represents one level in the hierarchical structure, is joined to the output repre-senting the next level down in the hierarchy Most of the levels have a ColumnPattern-Profile_ID, which can be used in the Merge Join transformation to join the levels,but there is some special handling required for the level representing the patterns, asthey need to be joined on the TopRegexPatterns_ID instead of the ColumnPattern-Profile_ID This data flow is included in the sample package for this chapter, so youcan review the logic if you wish

Figure 3 Data flow to reassemble a Column Pattern profile

Trang 10

Another approach that requires scripting is the use of the classes in the filer.dll assembly These classes facilitate loading and interacting with the data profilethrough a custom API, and the approach works well, but this is an undocumented andunsupported API, so there are no guarantees when using it If this doesn’t scare youoff, and you are comfortable working with unsupported features (that have a goodchance of changing in new releases), take a look at “Accessing a data profile program-matically” on the SSIS Team Blog (http://blogs.msdn.com/mattm/archive/2008/03/12/accessing-a-data-profile-programmatically.aspx) for an example of using the API toload and retrieve information from a data profile.

DataPro-Incorporating the values in the package

Once you have retrieved values from the data profile output, using one of the ods discussed in the previous sections, you need to incorporate it into the packagelogic This is fairly standard SSIS work

Most often, you will want to store specific values retrieved from the profile in apackage variable, and use those variables to make dynamic decisions For example,consider the Column Null Ratio profiling we discussed earlier After retrieving thenull count from the profile output, you could use an expression on a precedence con-straint to have the package stop processing if the null count is too high

In the data flow, you will often use Conditional Split or Derived Column mations to implement the decision-making logic For example, you might use theData Profiling task to run a Column Length Distribution profile against the productdescription column in your Product table You could use a Script task to process theprofile output and determine that 95 percent of your product descriptions fallbetween 50 and 200 characters By storing those boundary values in variables, youcould check for new product descriptions that fall outside of this range in your ETL.You could use the Conditional Split transformation to redirect these rows to an errortable, or the Derived Column transformation to set a flag on the row indicating thatthere might be a data-quality issue

Some data-quality checking is going to require more sophisticated processing Forthe Column Pattern checking scenario discussed earlier, you would need to imple-ment a Script component in the data flow that can take a list of regular expressionsand apply them against the column that you wanted to check If the column value

Trang 11

724 C 56 Incorporating data profiling in the ETL process

matched one or more of the regular expressions, it would be flagged as OK If the umn value didn’t match any of the regular expressions, it would be flagged as ques-tionable, or redirected to an error table Listing 3 shows an example of the code thatcan perform this check It takes in a delimited list of regular expression patterns, andthen compares each of them to a specified column

col-public class ScriptMain : UserComponent {

List<Regex> regex = new List<Regex>();

public override void PreExecute() {

base.PreExecute();

string[] regExPatterns;

IDTSVariables100 vars = null;

this.VariableDispenser.LockOneForRead("RegExPatterns", ref vars); regExPatterns =

vars["RegExPatterns"].Value.ToString().Split("~".ToCharArray()); vars.Unlock();

foreach (string pattern in regExPatterns) {

regex.Add(new Regex(pattern, RegexOptions.Compiled));

} } public override void Input0_ProcessInputRow(Input0Buffer Row) {

Row.GoodRow = true;

} else { Row.GoodRow = false;

} } } }

Summary

Over the course of this chapter, we’ve looked at a number of ways that the Data ing task can be used in SSIS, from using it to get a better initial understanding of yourdata to incorporating it into your ongoing ETL processes Being able to make yourETL process dynamic and more resilient to change is important for ongoing mainte-nance and usability of the ETL system As data volumes continue to grow, and more

Profil-Listing 3 Script component to check column values against a list of patterns

Trang 12

Summary

data is integrated into data warehouses, the importance of data quality increases aswell Establishing ETL processes that can adjust to new data and still provide validfeedback about the quality of that data is vital to keeping up with the volume of infor-mation we deal with today

About the author

John Welch is Chief Architect with Mariner, a consulting firm cializing in enterprise reporting and analytics, data warehousing,and performance management solutions John has been workingwith business intelligence and data warehousing technologies forseven years, with a focus on Microsoft products in heterogeneousenvironments He is an MVP and has presented at ProfessionalAssociation for SQL Server (PASS) conferences, the Microsoft Busi-ness Intelligence conference, Software Development West (SDWest), Software Management Conference (ASM/SM), and others He has also contrib-uted to two recent books on SQL Server 2008: Microsoft SQL Server 2008 Management and Administration (Sams, 2009) and Smart Business Intelligence Solutions with Microsoft

spe-SQL Server 2008 (Microsoft Press, 2009).

Trang 13

SSIS packages: a brief review

Before we can dive into the deep end with expressions, we need to look at SSISpackages—the context in which expressions are used Packages in SSIS are the units

of development and deployment; they’re what you build and execute, and have afew common components, including

ƒ Control flow—The execution logic of the package, which is made up of tasks,

containers, and precedence constraints Each package has a single controlflow

ƒ Data flow—The high-performance data pipeline that powers the core ETLfunctionality in SSIS, and is made up of sources, transformations, and desti-nations The SSIS data flow is implemented as a task, which allows multipledata flow tasks to be added to a package’s control flow

ƒ Connection managers—Shared components that allow the control flow and

data flow to connect to databases, files, and other resources outside of thepackage

Trang 14

Expressions: a quick tour

ƒ Variables—The sole mechanism for sharing information between components

in an SSIS package; variables have deep integration with expressions as well.SSIS packages include more than just these elements, but for the purposes of thischapter, that’s enough review Let’s move on to the good stuff: expressions!

Expressions: a quick tour

Expressions add dynamic functionality to SSIS packages using a simple syntax based

on a subset of the C language Expression syntax does not include any control of flow(looping, branching, and so on) or data modification capabilities Each expressionevaluates to a single scalar value, and although this can often seem restrictive to devel-opers who are new to SSIS, it allows expressions to be used in a variety of places within

a package

How can we use expressions in a package? The simplest way is to use propertyexpressions All containers in SSIS, including tasks and the package itself, have anExpressions property, which is a collection of expressions and the properties to whichtheir values will be assigned This allows SSIS package developers to specify their owncode—the expression—that is evaluated whenever a property of a built-in or third-party component is accessed How many other development tools let you do that? Let’s look at an example Figure 1 shows the properties for an Execute SQL Taskconfigured to execute a DELETE statement

Although this Execute SQL Task is functional, it isn’t particularly useful unless thepackage always needs to delete the order details for [OrderID]=5 This task would be

much more useful if it instead deleted whatever order number was current for the

pack-age execution To implement this dynamic behavior, we’re going to take two steps.First, we’re going to add a new variable, named OrderID, to the package (If you don’t

Figure 1 Static task properties

Trang 15

728 C 57 Expressions in SQL Server Integration Services

know how to do this already, consider it an exercise—we won’t walk through adding avariable step by step.) Second, we’re going to add a property expression to the Sql-StatementSource property of the Execute SQL Task To do this, we’ll follow the stepsillustrated in figure 2

1 In the Properties window, select Execute SQL Task and then click on the ellipsis( ) button next to the Expressions property This will cause the PropertyExpressions Editor dialog box to be displayed

2 In the Property Expressions Editor dialog box, select the SqlStatementSourceproperty from the drop-down list in the Property column

3 Click on the ellipsis button in the Expression column This will cause theExpression Builder dialog box to be displayed (Please note that figure 2 showsonly a subset of the Expression Builder dialog box to better fit on the printedpage.)

4 Enter the following expression in the Expression text box:

"DELETE FROM [dbo].[Order Details] WHERE [OrderID] = " + (DT_WSTR, 50)

@[User::OrderID]

5 Click on the Evaluate Expression button to display the output of the expression

in the Evaluated Value text box (At this point it may be useful to copy and pastethis value into a SQL Server Management Studio query window to ensure thatthe expression was constructed correctly.)

Figure 2 Adding a property expression

Trang 16

Expressions in the control flow

6 Click on the OK buttons to close the Expression Builder and Property sions Editor windows and save all changes

Expres-7 Execute the package to ensure that the functionality added through the sion behaves as required

expres-Several important techniques are demonstrated in these steps:

ƒ We started with a valid static value before we added the expression Instead ofstarting off with a dynamic SQL statement, we started with a static statement

which we tested to ensure that we had a known good starting point.

ƒ We added a single piece of dynamic functionality at a time Because our ple was simple, we only added a single piece of dynamic functionality in total;but if we were adding both a dynamic WHERE clause and a dynamic table name,

exam-we would’ve added each dynamic expression element to the static SQL ment individually

state-ƒ We tested the expression after each change This basic technique is often looked, but it’s a vital timesaver The Expression Editor has limited debuggingcapabilities, and locating errors in a complex expression can be painfully diffi-cult By testing the expression after each change, the scope of debugging can

over-be significantly reduced

With this example setting the stage, let’s dive deeper into SSIS expressions by ing how they can be used to add dynamic functionality to our packages, and solve real-world problems

illustrat-Expressions in the control flow

We’ll continue by looking at expressions in the SSIS control flow Although the ple in the previous section is technically a control flow example (because we applied aproperty expression to a property of a task, and tasks are control flow components)there are more interesting examples and techniques we can explore One of the mostimportant—and overlooked—techniques is using expressions with precedence con-straints to conditionally execute tasks

Consider the following requirements:

ƒ If a specific table exists in the target database, execute a data flow task

ƒ If the table does not exist, execute an Execute SQL Task to create the table, andthen execute the data flow task

If this problem needed to be solved using a traditional programming language, thedeveloper would add an if statement and that would be that But SSIS does notinclude an if statement, a branching task, or the like, so the solution, although sim-ple, is not always obvious

An often-attempted approach to solve this problem is to add a property expression

to the Disabled property of the Execute SQL Task The rationale here is that if the cute SQL Task is disabled then it won’t execute, and only the data flow task will run Themain problem with this approach is that the Disabled property is designed to be used

Trang 17

Exe-730 C 57 Expressions in SQL Server Integration Services

only at design time; setting Disabled to True is similar to commenting out a task so that

it remains part of the control flow—but as far as the SSIS runtime is concerned, the taskdoesn’t exist

The preferred way to achieve this goal is to use expressions on the precedence straints that connect the various tasks in the control flow In addition to the three dif-ferent constraints that can be used (success, failure, and completion), eachprecedence constraint can be edited to include an expression that determines whether

con-or not this particular branch of the control flow logic will execute The expression musthave a Boolean return value—it must evaluate to true or false—and this value controlsthe conditional execution Figure 3 illustrates the control flow configuration necessary

to implement the required behavior using expressions

Implementing this solution has three primary steps:

1 The results of the SELECT statement run by the Execute SQL Task are stored in aBoolean package variable named TableExists To map the value into a Bool-ean variable, CAST the data type to BIT in the SELECT statement, returning 1 ifthe table exists, and 0 if not

2 Each precedence constraint has been edited to apply the Expression and straint Evaluation operation option, with the appropriate expression (for one,

Con-@TableExists; for the other, !@TableExists) specified to enforce the requiredlogic Note that the two expressions are both mutually exclusive (they cannotboth be true at the same time) and also inclusive—there is no condition that’snot represented by one of the two expressions

3 The @TableExists precedence constraint has been edited to specify the cal OR option—this is why the constraints that reference the data flow task aredisplayed with dotted lines This is required because, as you’ll recall, the two

Logi-Success and !@TableExists Success and @TableExists

Load Data Into Table

Figure 3 Conditional execution with expressions

Trang 18

Expressions in the control flow

paths from the first Execute SQL Task are mutually exclusive, but both pathsend at the data flow task Unless one of the two precedence constraints that end

at the data flow task is so edited (you only need to edit one, because the change

in operation will apply to all precedence constraints that end at the sametask)—the data flow task will never execute

The final settings for the @TableExists precedence constraint can be seen in figure 4 One additional requirement for this approach is the need for a task from which theprecedence constraints can originate In this example, the need for the Execute SQLTask is obvious—the package uses this task to check to see if the target table exists But

Self-documenting precedence constraints

If you would like your precedence constraints to include the constraint options andexpressions shown in figure 3, all you need to do is set the ShowAnnotation propertyfor each precedence constraint The default value for this property is AsNeeded, anddoes not cause this information to be displayed; but setting this property to Con-straintOptions for each precedence constraint will cause these annotations to bedisplayed Unfortunately, SSIS does not support setting a default value for this prop-erty, but it is easy to select multiple precedence constraints and set this property forall of them at one time Taking this step will make your packages self-documenting,and easier to debug and maintain

Figure 4 Precedence constraint with expression

Trang 19

732 C 57 Expressions in SQL Server Integration Services

there are other common scenarios, such as when the state upon which the conditionallogic must be based is set via a package configuration, and the first task in the packagemust be executed or skipped based on this state, where the natural package logic doesnot include a task from which the expression-based precedence constraints shouldoriginate This can pose a predicament, because precedence constraints must originatefrom a task or container, and this type of conditional logic is implemented in SSIS byusing precedence constraints and expressions

In situations such as these, a useful technique is to add a placeholder task—one that

serves as the starting point for precedence constraints—to the control flow Two ous candidates for this placeholder role are the Script Task and the Sequence Con-tainer; each of these components will work without any configuration required, andwon’t alter the package logic

obvi-Expressions and variables

In addition to using property expressions, you can also use expressions with SSIS ables In fact, variables in SSIS have a special ability related to expressions: they notonly have a Value property, but also an EvaluateAsExpression property If this Bool-ean property is set to true, when the variable’s value is accessed, instead of returningthe value of the Value property, the variable will evaluate the expression that's stored

vari-in its Expression property

Configuring variables to evaluate as expressions is a powerful technique Instead

of always returning a hard-coded value—or relying on the Script Task or other ponents to update the variable’s value—the variable can return a dynamic value thatreflects the current state of the executing package For developers with object-oriented programming experience, this is analogous to using a property get accessorinstead of a field; it provides a mechanism by which you can add custom code that isrun whenever the variable’s value is read This technique allows you to use variables

com-as containers for expressions, so that they can be used in multiple places throughoutthe package

Additional reading online

For more detailed examples on how to use expressions in the control flow, see theseonline resources:

ƒ Expressions and precedence constraints: http://bi-polar23.blogspot.com/2008/02/expressions-and-precedence-constraints.html

ƒ Using placeholder tasks: http://bi-polar23.blogspot.com/2007/05/

conditional-task-execution.html

ƒ Expressions and the Foreach Loop Container: http://bi-polar23

blogspot.com/2007/08/loading-multiple-excel-files-with-ssis.html

Trang 20

Expressions and variables

One real-life example of this technique is managing the tions of filesystem resources Consider a package that works withfiles in a folder structure, like the one shown in figure 5

As you can see, there is a DeploymentRoot folder that tains subfolders for the different types of files with which thepackage interacts In the real world, the root folder could exist

con-on different drives and in different locaticon-ons in the filesystemstructure, on the different machines to which the package may

be deployed To handle this eventuality, you’d use package configurations—or a lar mechanism—to inform the package where the files are located, probably by usingthe configuration to set the value of a @DeploymentRootPath variable You could thenuse multiple configurations to set the values of multiple variables, one for eachfolder, but there is a better way And as you have likely guessed—this better way usesexpressions

For the folder structure shown in figure 5, you could create four additional ables, one for each subfolder, configured to evaluate as the following expressions:

vari-ƒ @ErrorFilePath - @DeploymentRootPath + "\\ErrorFiles"

ƒ @ImportFilePath - @DeploymentRootPath + "\\ImportFiles"

ƒ @LogFilePath - @DeploymentRootPath + "\\LogFiles"

ƒ @OutputFilePath - @DeploymentRootPath + "\\LogFiles"

And it doesn’t stop there It’s not uncommon to see packages where a different folder must be used per client, or per year, or per day—and having a set of variablesbased on expressions that can in turn be used as the basis for more granular expres-sions is a great way to achieve reuse within a package And, in this scenario, only oneconfiguration is required—the value for the @DeploymentRootPath variable can be setvia a configuration, and all other filesystem paths will be automatically updatedbecause they’re based on expressions that use this variable as their source

sub-Figure 5 Deployment folders

Additional reading online

For more detailed examples on how to use expressions with package variables, seethese online resources:

ƒ Filesystem deployment: file-system-deployment.html

http://bi-polar23.blogspot.com/2007/05/flexible-ƒ Dynamic filename expressions: http://bi-polar23.blogspot.com/2008/06/file-name-expressions.html

ƒ Dynamic filenames and dates: http://bi-polar23.blogspot.com/2008/06/looking-for-date-what-in-name.html

Ngày đăng: 24/12/2013, 19:15

TỪ KHÓA LIÊN QUAN