Hands-On Microsoft SQL Server 2008 Integration Services part 49 docx

When you open Term Extraction Transformation Editor, you will see the following three tabs: Term Extraction c In this tab, you can select an input column from the list of Available Input

Trang 1

column and uses its internal algorithms and statistical models in line with the options you’ve selected to generate the output results The output of this transformation contains only two columns, Term and Score by default The Term column contains the extracted term, while the Score column contains the number of times that term is found in the input column To meet its objective of reading from an input column and writing to

an output, this transformation supports one input and one output along with one error output This transformation can extract terms from the input column that of either the DT_WSTR or DT_NTEXT data type

Generally, you will use this transformation as the last transformation in that branch

of the data flow, as it doesn’t let the input columns pass through However, it does provide an output that contains only the two resulting output columns When you open Term Extraction Transformation Editor, you will see the following three tabs:

Term Extraction

c In this tab, you can select an input column from the list of Available Input Columns from which you want this transformation to extract a term You also can specify the names for the two output columns in this tab, which

by default are Term and Score

Exclusion

c You can choose to use exclusion terms by clicking the Use Exclusion Terms check box, which tell the transformation to exclude some of the terms from extraction While you are trying to build a meaningful list of terms that you can use for data mining purposes, you may want to exclude certain terms because they are appearing everywhere and causing you to lose focus from the key terms You can specify the exclusion terms in a lookup table that must be in SQL Server

2000 or later editions or Microsoft Access This lookup operation works in a fully precached mode, which means that the transformation loads the exclusion terms from the lookup table into its private memory before it starts extracting terms from input column

You specify an OLE DB connection manager to let it connect to the data source Then you choose a table or view from the drop-down list, which this transformation can access using the OLE DB connection manager specified previously; finally you select the column from the drop-down list of columns that contains exclusion terms

in the specified table or view

Advanced

c This tab has four sections In the Term Type section, select one of the radio buttons to specify that the term is a Noun, a Noun phrase, or a combination

of Noun and noun phrase For this transformation, a noun is a single noun; a noun phrase is at least two words, of which one is a noun and the other one is a noun or

an adjective For example, car is a noun and red car is a noun phrase.

In the Score Type section, you choose either Frequency or TFIDF (Term Frequency Inverse Document Frequency) by selecting either of the radio buttons While the

Trang 2

Frequency represents the number of times the normalized term appears in the

input, the TFIDF is a statistical technique used to evaluate the importance of

a term in a document This importance weight increases with the number of

times the term or the word appears in the document, but is also offset by the

commonality of the word in all the documents This is an important measure used

in text mining and is often used by search engines

In the Parameters section, you specify values for the frequency threshold and the

maximum length of term Frequency Threshold is the minimum number of times

a term must occur for it to be extracted and the Maximum Length Of Term is

applicable for noun phrases only and specifies the maximum number of words

in a noun phrase For example, the noun phrase “top-of-the-line competition

mountain bike” contains seven words

You can select to use a case-sensitive match for extracting a term from the

input column in the Options section When you select this option, you tell the

transformation to treat uppercase words different from lowercase words In that

case, bicycle and Bicycle will be treated as two separate terms However, if Bicycle

is the first letter in the sentence, it will still be treated as the same as bicycle.

As you can see, configuring this transformation for use in your package is not

difficult; however, you need to work with it a few times to get the results you want

to see, because this transformation behaves quite differently with different types of

terms in the data, so you need to work out exactly what you are going to get for a

given set of data The term extraction process is based on its internal English language

dictionary and statistical model that may not be 100 percent accurate against the data

set it is working for; however, understanding the process that this transformation uses to

extract terms from the input column will help you get going Following are the steps this

transformation uses in the process of term extraction process:

Tokenizing

c The Term Extraction transformation identifies words by first

breaking down the text into sentences and then separating the words from the

sentences To break the text into sentences, this transformation reads ASCII line

break characters such as a carriage return (0x0d) or a line feed (0x0a) It is clever

enough to recognize other characters as a sentence boundary, such as a hyphen

(-) or an underscore (_), when neither the character to the left nor to the right

of a hyphen or an underscore is a letter It can recognize an acronym separated

by one or more periods (.) and does not break it into multiple sentences For

example, it does not convert G.T.I into multiple sentences After separating the

sentences, it breaks down the sentences further into separate words using spaces,

tab characters, line breaks, and other word terminators but preserves the words

that are connected by hyphens or underscores This transformation takes care of

Trang 3

other special characters and is intelligent to extract the words properly, sometimes

by separating the special characters and sometimes not For example, (bicycle) is extracted as bicycle, whereas the term you’re will generate only you Refer to Books

Online for more details on how this transformation handles tokenization

Tagging

c Depending upon your choice of Term type in the Advanced tab, the Term Extraction transformation will keep and tag only the words that match with your selection—i.e., if you’ve selected Noun Only, it will tag only the singular and plural nouns and reject all others; if you’ve selected Noun phrase only, it will tag only the terms that have two or more words containing at least one noun After the words have been separated out by the Tokenizing process, they are tagged

as either a singular noun, plural noun, singular or plural proper noun, adjective, comparative or superlative adjective, or number All other words are discarded

Stemming

c After tagging the words, especially plural nouns that are not lemmatized, this transformation stems those words to their dictionary form by

using its internal dictionary For example, it converts cars to car and lorries to lorry.

Normalizing

c Once the words have been separated and tagged, they are normalized

so that the capitalized words and non-capitalized are treated alike This process converts the capitalized letters in a word (for example, first letter of a word may be

capital because it is the first word in the sentence) to lowercase—so, for example, Cars becomes car However, note that the capitalized words that are not the first word in a

sentence are not normalized and are marked as proper nouns, which are not included

in its internal dictionary and hence not normalized

Fuzzy Grouping Transformation

This transformation is part of the Enterprise Edition of SQL Server and is designed

to help in the data cleaning process by grouping records that are likely to be duplicates and selecting a canonical record from the group to standardize the group At run time, this transformation first groups together all the likely duplicate rows and then identifies

a canonical row of data for each group This identified canonical row and other likely duplicate rows are outputted after marking them with proper tokens in additional output columns The duplicate rows are not deleted from the data; instead, they are outputted but marked so that you can identify and remove them from the data flow using the downstream components (such as Conditional split) if you want This transformation uses a comparison algorithm to compare rows in the transformation input You can customize this algorithm to be exhaustive if you want to compare every row in the input

to the every other row in the input Though this is quite an expensive method from the performance point of view, it can yield more accurate results To compare rows

Trang 4

against each other in the input, this transformation creates temporary tables in an SQL

Server database To perform groupings of likely duplicate rows on the input data, this

transformation supports one input and one output only

When you open the editor of the Fuzzy Grouping transformation, you will see the

following three tabs

Connection Manager Tab

You can create a new connection manager in this tab by clicking New, or you can

select already configured OLE DB Connection Manager While specifying the

connection manager here, you need to think seriously about the operations that this

transformation will perform in the database using this connection First of all, this

transformation will create temporary tables and their indexes in the database, which

requires that the user account you use in the connection manager setup must have

necessary permissions to create tables and indexes in the database Second, to compare

rows against each other in the input data set, the algorithm that this transformation

uses will create temporary tables much larger than the input data set The sizes of

the tables and indexes are proportional to the number of rows flowing through the

transformation and the number of tokens you select to tokenize the data elements This gets further aggravated if you choose to perform an exhaustive comparison This may

put quite stringent requirements of space on the database to which this transformation

is connecting You must ensure that the reference database has enough free space to

perform a fuzzy comparison given the data set and your selections while configuring

this transformation

Some performance controls are provided in this transformation If you go to

Component Properties tab in the Advanced Editor, you will see four properties—

Delimiters, Exhaustive, MaxMemoryUsage and MinSimilarity—under the Custom

Properties section that help you fine-tune the balance between accuracy and performance The Delimiters property lets you specify additional characters that you can use to

separate strings into multiple words Use only the required ones that generate the

acceptable level of results The Exhaustive property, discussed earlier, lets you specify

whether you want to compare each input row with every other row in the input

If your data set is a few thousand rows long, you can use this option without any

major impact However, if you are dealing with data set that has millions of rows,

consider using it only during the design and debugging phase or on a subset of data

to fine-tune the similarity threshold requirements You can also control memory

requirements by using the MaxMemoryUsage property, for which you can specify a

value in megabytes to limit its usage or specify 0 to enable dynamic memory usage

based on requirements and available system memory Finally, using MinSimilarity

property, you can specify the minimum similarity threshold value between 0 and 1

Trang 5

The closer the value to 1, the fewer rows will be selected as likely duplicates, which means less processing required by the transformation

Columns Tab

In the Columns tab, you select the columns that you want to pass through as is and the columns you want to compare with other rows in the input When you select the check box in front of a column, that column will be selected for comparison and a row will be added in the lower grid section of the editor window In the grid, you can further specify the criteria by which you want this column to be compared against other columns Using the Match Type field, you can either specify the column to be exactly matched or fuzzy-matched For the columns for which you select Match Type

as Exact, the Minimum Similarity is automatically set to 1, indicating that the column has to be matched 100 percent; however, for other columns that you set Match Type

as Fuzzy, you can specify a Minimum Similarity value between 0 and 1 The column value closer to 1 indicates a closer match will happen Using this, you can specify the values that match the rows that are approximately same It requires more efforts for the transformation to identify duplicates with Minimum Similarity values closer to

1 While you specify minimum similarity values for each of the selected column in this tab, you can also specify minimum similarity thresholds at the component level

in the Advanced tab At run time, the Fuzzy Grouping transformation measures the similarity and groups together the rows on the basis of the similarity score The rows that fall below the specified minimum similarity score are not grouped together In real life, you may not know the minimum similarity score that works for your data You can determine it by running the Fuzzy Grouping transformation several times using different minimum similarity threshold values against the subset or sample data that you can prepare using the Row Sampling or Percentage Sampling transformations Once you find out the minimum similarity value, you can deploy it to production to group similar rows together

So, in the output of the transformation, you will get all the input columns that you’ve selected to pass through or compare, the columns with standardized data (taken from canonical row), and a column containing the similarity score for each column that you select to participate in the Fuzzy grouping The aliases of these columns are specified in the Similarity Output Alias fields in this tab Finally, you can also use Comparison Flags, such as Ignore Case or Ignore Character Width, to specify how this transformation should handle the string data in the column while doing comparisons

Advanced Tab

At run time, the transformation tokenizes each of the columns selected for comparison and then compares them against the columns of other rows Based on the algorithm and

Trang 6

the settings, it then produces output, which is basically one output row for each input

row with three additional columns that are specified in the Advanced tab The first

column—Input Key Column Name field—contains the _key_in value by default, which

is the name of a new column it adds in the output You can change this name if you want This new column, _key_in, contains a string value that uniquely identifies each row The

second column—Output Key Column Name field—specifies the name assigned to a new

column added in the output By default, this name is _key_out, and it can be changed

This new column, _key_out, contains a string value that is same for all the rows that

have been identified as likely duplicates and are grouped together During run time, after

having identified and grouped together the likely duplicate rows, it carries on to select a

canonical row and copies its _key_in value in the _key_out column of all the rows in the

group, making both values the same for the canonical row This makes it possible for you

to identify the rows in the group because they all have the same _key_out value, and the

row that has _key_in value equal to _key_out value is the canonical row

The third column added in the output is shown by the Similarity Score Column

Name field, for which the default name is _score This column holds values between 0

and 1, indicating the similarity of the input row to the canonical row When the input

row is selected as the canonical row, the value in the _score column is 1 For other rows

in the group, the value of _score will vary, depending on how closely they match with

the canonical row The more the similarity, the closer the value of _score will be to 1

The exact duplicates to the canonical row will also be included in the output and will

have a _score of 1

You can specify a value for the Similarity Threshold attribute using the slider You’ve

used a similar attribute in the Columns tab called Minimum Similarity That value is

applied against each column, whereas the Similarity threshold is applied at the component

level The rows that have _score values smaller than the value you set for the Similarity

threshold will not be considered as duplicates and hence will not be grouped together

As explained earlier, you may have to run the package containing a Fuzzy Grouping

transformation several times to find the value that works with your data Finally, you

can select the token delimiters from Space, Tab, Carriage Return, and Line Feed by

clicking the appropriate check boxes to tokenize data You can also specify additional

tokens in the Additional Delimiters field This is the delimiters property in the Advanced

Editor discussed earlier in this transformation

Fuzzy Lookup Transformation

Earlier you have used the lookup transformation in a Hands-On exercise to look for

exact matches of Postcodes in the database table and to add a City column in the

output; and if there was no match for Postcode, the transformation extracted those

rows into a flat file for your review The Fuzzy Lookup transformation does more

Trang 7

for you than the lookup transformation, as it can do a fuzzy match and return one or more similar matches from the lookup table The Lookup transformation’s strength

is that it can enhance data quality and standardize data by looking up matches from the reference table; however, this strength is limited by the fact that the lookup has

to be an exact lookup—so, for example, Postcode has to match exactly If you need

to match First Name for Stephen, who may write his name as Steve also, you can’t

use an exact match Your matching criterion has to pick up similarity in the text in a matching column to locate matching data, and this is precisely why the fuzzy lookup transformation was designed This transformation enables you to correct, standardize, and enrich data by providing missing information using fuzzy matching technique This transformation is available in the Enterprise Edition of SQL Server and requires

a connection to an SQL Server 2005 or newer database to create temporary tables

As you would expect, the Fuzzy Lookup transformation has one input and one output

to support its operation

The Fuzzy Lookup transformation creates tokens of the data to be fuzzy matched and uses a lookup technique to fuzzy-match these tokens To create tokenized data, this transformation needs a connection to an SQL Server where it can create a temporary table and index to store the tokenized information As it will be matching tokenized data, there is a possibility that this transformation may return more than one match for

a row These matches carry different confidence levels for determining how close the match is This transformation also takes into consideration the minimum similarity value before outputting that a row as a possible match

The custom UI for a Fuzzy Lookup transformation is similar to that of the Lookup transformation and provides three tabs as described next

Reference Table Tab

In the Reference Table tab, you specify the connection manager that this transformation uses to connect to the reference table and the match index options that this table will use to create, use, and maintain the index for fuzzy lookup matches Once you specify a connection manager in the OLE DB Connection Manager field, you can then choose whether to use an existing index or create a new index

The Generate New Index radio button is used to specify the creation of the new match index each time the Fuzzy Lookup transformation runs When you select this option, you can then select the reference table from the drop-down list in the Reference Table Name field At run time, the Fuzzy Lookup transformation connects to the reference table using the specified OLE DB Connection Manager and creates a copy

of the reference table, adds an integer data type key to the copied reference table, and builds an index on the key column Then, this transformation tokenizes the data in the

columns that you want to reference and stores them in an index table called match index

Trang 8

You can also select the Store New Index option, which allows you to save the match

index for use in the subsequent processing of this transformation On selection of

this option, you can assign a name to the newly created index, which by default is

FuzzyLookupMatchIndex If you prefer to save the match index so that you can reuse

it and avoid high processing costs at package run time, you may want to keep this

index fresh and up to date all the times—i.e., you may want to update the match index

whenever the reference table is updated with new records For this, select the Maintain

Stored Index check box The transformation then creates triggers on the reference table

to keep the match index table synchronized with the reference table You may prefer

to use the Maintain Stored Index option to keep the match index updated However,

before using this option, understand the effect of triggers on database performance and

maintainability of the reference table Refer to Microsoft SQL Server Books Online for

more details on how to manage triggers when using this option with the Fuzzy Lookup

transformation

The process of creating a match index can be an expensive process, depending upon

the size of the data you’re dealing with Thus this transformation provides a facility

with which you can reuse an existing match index if the reference data is fairly static

When you select the Use Existing Index radio button, you can then choose a match

index table from the drop-down list, which this transformation can use for repeated

operations

If you are dealing with millions of rows in the reference table, the recommended way

to implement a Fuzzy Lookup transformation will be to generate and save the match

index the first time by running the package containing this transformation and then

reusing it using the Use Existing Index option in subsequent executions of the package

Columns Tab

In the Available Input Columns you can select the Pass Through check boxes for the

columns that you want to be passed through to the output as is And in the Available

Lookup Columns you can select the check boxes for the columns that you want to

add in the output for the Fuzzy Lookup matching rows You can create mappings

in this tab between Available Input Columns and the Available Lookup Columns

for which you want to perform lookup operations The mappings you create here are

visually different than those you create in the Lookup transformation because they are

displayed as dotted lines in this transformation This dotted line represents the fuzzy

match You can perform exact matches for some of the columns in the Fuzzy Lookup

transformation as long as you keep at least one column using a fuzzy match To change

the match type fuzzy to exact, you have to open the Advanced Editor and change the

JoinType property of the input column in the Input And Output Properties tab

Trang 9

Advanced Tab

Here in the first option, you can set the “Maximum number of matches to output per lookup” by specifying an integer value At run time, the transformation identifies matches considering similarity thresholds and can return the matches up to the number you have specified in this option These matches may contain duplicates

if you’re looking for more than one output per lookup Next, you can specify the Similarity Threshold value using the slider This value can be a floating-point value from 0 to 1 When you specify a similarity threshold here, you apply it at the component level You can also apply a similarity threshold at the column level—also known as the join level—using the MinSimilarity property of the input columns, which is accessible in the Input and Output Properties tab of the Advanced Editor The closer its value is to 1 for a row or a column, the closer the row or column will

be to match against the reference table and qualify as a duplicate As mentioned, the output also contains a column for a confidence score The combination of similarity score and confidence determines how close the input row or column is to the reference table column or row The similarity score describes the closeness or the textual similarity between the input columns and the reference table columns, whereas the confidence describes the quality of this fuzzy match Columns having a high similarity score and a high confidence score are the most likely candidates for duplicates; however, not all columns having a high similarity score will always have

a high confidence score as well You should understand a subtle difference between the two terms: for example, if you are looking for match on a series of cars, then the 3 series, 5 series, or 7 series returns a high similarity score, but the confidence score will

be poor Similarly, if you are looking for a PC and the only term used in reference table

is Personal Computer, then the confidence for this will be high, whereas similarity, as

you can see, is low

At run time, the transformation creates or uses an existing match index to perform the fuzzy lookup and outputs the pass-through columns, plus the columns added from the lookup table, plus the additional columns carrying component-level similarity scores and confidence level information and the column-level similarity score column for the each column that participates in performing a fuzzy lookup

Finally, you can select the token delimiters by clicking the check boxes provided for space, tab, carriage return, and line feed default delimiters You can also specify Additional Delimiters in the provided field Delimiters are the characters used to tokenize and separate fuzzy match fields into the words used for matching and scoring

Trang 10

Other Considerations

Having configured all the options, you are ready to run the transformation However,

consider the following performance issues before you begin:

As this transformation needs a connection to an SQL Server 2005 or later to

c

create and maintain a match index table, connecting to a database server that has

lots of free space is advisable At index creation time, the reference table is locked

by this transformation, so consider using another machine for the reference table

if multiple users access this table Also, it is a good idea to copy the reference table

to a non-production server if the data changes regularly, especially during package

execution, in which case results may be inconsistent

The Exhaustive property, which is available in the Custom Properties section of

c

Component Properties tab in the Advanced Editor, is a Boolean field and can

be set to True or False This property yields more accurate results if set to True

However, setting the Exhaustive property to True should be done with care,

because it will mean that each row in the input will be matched against every

row in the reference table Also, to perform this match, the entire reference table

will be loaded in to the main memory, which will put high pressure on memory

requirements If your reference table is extremely large and you have little free

memory available, avoid using this option However, for a smaller reference table

and with lots of free memory on the system, setting this option to True will yield

better results

You can specify the maximum amount of memory in megabytes that this

c

transformation is allowed using the MaxMemoryUsage option Specifying a

maximum amount of memory to match its requirements will greatly improve

its performance However, if enough free memory is not available on the system

or you do not know how much memory will be required by this transformation,

you can specify a value of 0, which indicates that the transformation will manage

memory dynamically based on the requirements and the available free memory

You can manage memory on the basis of input rows as well If you have many

c

input rows to process, you can set WarmCaches to True to indicate that the

match index and the reference table are to be loaded into memory This can

greatly enhance the performance by caching reference data and index in the main

memory before the transformation starts processing input rows Be aware that

after tokenizing, only the tokenized tables are used and not the original reference

data set

Định dạng
Số trang	10
Dung lượng	103,26 KB