1. Trang chủ
  2. » Công Nghệ Thông Tin

Hands-On Microsoft SQL Server 2008 Integration Services part 50 docx

10 247 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 10
Dung lượng 712,53 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Hands-On: Removing Duplicates from Owners DataThe Fuzzy Lookup and Fuzzy Grouping transformations are the data flow components that you can use for data cleaning purposes.. You will remo

Trang 1

Hands-On: Removing Duplicates from Owners Data

The Fuzzy Lookup and Fuzzy Grouping transformations are the data flow components that you can use for data cleaning purposes One of the main issues with data quality is

to remove duplication in data, whether it occurs at loading time or it already exists in the data that you want to cleanse Until now, you’ve worked on a couple of instances to remove duplicates in the earlier Hands-On exercises, but those instances were dealing with exact duplicates In this exercise, you will deal with fuzzy duplication of data You will work with these components to remove exact as well as fuzzy duplicates from the input data

The scenario is that you are maintaining an Owner table that contains contact details for the owners of your products You regularly receive an Owner’s data feed that sometimes contains duplicate data This duplicate data is not consistent, as users tend

to provide their details differently at different occasions You need to make sure no duplicate record is added to the Owner table

Method

You have the Owner table in the Campaign database and receive OwnersFeed.xls files regularly that contain owner records This Excel file can contain duplicate records for the same person The complication, however, is that these records may not be exact duplicates, as persons provide their contact details differently at different occasions Our sample OwnersFeed.xls file contains 13 records, of which 3 are unique records and the other 2 records have five variants each, with different name spellings and address details; one of these 2 records already exists in the Owner table Open the OwnersFeed .xls file to have a look at the incoming data (see Figure 10-25)

Figure 10-25 Incoming data contains variants of duplicate records

Trang 2

The owner with first name Johnathon already exists in the table and has five

different variants of contact details The other owner with first name Kathrine is a

new owner, but has five different variants of contact details in the feed You want to

load only four records into the Owner table consisting of three unique records plus

an owner record for Kathrine, while all the duplicate variants and the owner records

for Johnathon should be removed You will remove the exact duplicates from the

OwnersFeed file using the Sort and Lookup transformation After removing exact and

matched duplicates, you will use the Fuzzy Lookup transformation to remove fuzzy

matched duplicates At this stage only the duplicate rows that exist in variant forms in

the pipeline would be left You will remove these duplicate variants using the Fuzzy

Grouping transformation

Exercise (Create Removing Duplicate Owners Package)

Start by creating a new package, adding the Data Flow task into this package, and then

adding an Excel source to get the data in the data flow

1 Open the Data Flow transformations project in BIDS Add a New SSIS Package

in the project and then rename it as Removing Duplicates.dtsx.

2 Drop a Data Flow task from the Toolbox onto the Control Flow surface and

rename it Removing Duplicates from OwnersFeed.

3 Double-click the Removing Duplicates from OwnersFeed task to go to the Data

Flow panel Drop an Excel source from the Toolbox onto the Data Flow surface

Rename the Excel source OwnersFeed.

4 Double-click OwnersFeed to open the Excel Source Editor Click the New

button next to the OLE DB Connection Manager field Type C:\SSIS\RawFiles\

OwnersFeed.xls in the Excel File Path field Leave the “First row has column

names” option checked Click OK to add an Excel Connection Manager and

return to the Excel Source Editor Leave Table Or View selected in the Data

Access Mode field Click in the Name of the Excel sheet field and select Owners$ from the drop-down list

5 Go to the Columns page and verify that all the columns have been selected Click

OK to close the Excel Source Editor

6 Take an opportunity to rename the Excel Connection Manager as OwnersFeed

Connection.

7 Right-click anywhere on a blank surface in the Connection Managers area and

choose New Connection from the context menu Select Excel from the Connection

Manager Type list and click Add This will open the Excel Connection Manager

dialog box Type C:\SSIS\RawFiles\DuplicateOwners.xls in the Excel File Path

field Leave the First Row Has Column Names check box selected and click OK

to add this connection manager Once it is added, rename this connection manager

DuplicateOwners Connection.

Trang 3

Exercise (Remove Exact Duplicates)

If you look at the data in the Figure 10-25, you will see that Johnathon Skinner and Kathrine Morris have duplicate records in the OwnersFeed You will use a Sort transformation to remove these duplicates Then you will check the OwnersFeed against the Owner table in the database to see whether any of the record exists in the Owner table Johnathon Skinner also exists in the Owner table and will be removed from the input data flow using a Lookup transformation The other variants of these owners will remain unaffected and will have to be dealt with separately

8 Drop the Sort transformation onto the Data Flow surface and join OwnersFeed with it using the green path

9 Double-click the Sort transformation to open the editor Select all the Available Input Columns starting with FirstName, as shown in Figure 10-26 Select the check box for “Remove rows with duplicate sort values” and click OK to close the

editor Rename the Sort transformation Removing Exact Duplicates by Sort Key.

10. Add a Lookup transformation to the data flow and join the Sort transformation

by dragging the green arrow onto it

11. Double-click the Lookup transformation to open the Lookup Transformation Editor Select “Redirect rows to no match output” option in the “Specify how to handle rows with no matching entries” field Click the Connection page link in the pane

12. Add the localhost.Campaign connection manager using the New button next to the OLE DB connection manager field Then select [dbo].[owner] table from the drop-down list in the “Use a table or a view” field

13. Go to the Columns tab, and map all the Available Input Columns to the Available Lookup Columns for the matching names as shown in Figure 10-27 Click to select the OwnerID column in the Available Lookup Columns to add it as a new column in the output In this transformation, you are matching input records with the records in the Owner table, and for the matching record, you are adding OwnerID in the output to capture the ID of the record with which a match has

been found Click OK to close this editor Rename the Lookup transformation

Removing Exact Duplicates by Lookup.

14. Drop an Excel destination just below Removing Exact Duplicates By Lookup and join it to the Excel Destination You will be asked to select an Output in the Input Output Selection dialog box Select Lookup Match Output to connect to

Excel Destination Rename this Excel Destination Exact Duplicates.

15. Double-click the Excel Destination to configure it Choose DuplicateOwners Connection in the OLE DB Connection Manager field if it’s not already selected

Trang 4

16. Click the New button next to the Name of the Excel sheet field and verify the

CREATE TABLE statement that it is creating an Exact Duplicates table—i.e.,

a worksheet in Excel Click OK to accept Select Exact_Duplicates in this field

Go to the Mappings page to create the necessary column mappings that happen

automatically Click OK to close this editor

Figure 10-26 Removing exact duplicates using a Sort transformation

Trang 5

Exercise (Remove Fuzzy Duplicates)

You’ve extracted exact duplicates from the Lookup transformation to an Excel destination Now you will redirect the remaining rows to a Fuzzy Lookup and extract the duplicates using fuzzy match logic You will be removing all the variants of Johnathon Skinner

17. Drop a Fuzzy Lookup transformation onto the Data Flow surface to the right

of the Removing Exact Duplicates by Lookup component Click Removing Exact Duplicates by Lookup and drag and drop the second green arrow, which is

Figure 10-27 Removing exact duplicates using the Lookup transformation

Trang 6

actually the Lookup No Match Output, onto the Fuzzy Lookup transformation

You have redirected all the remaining rows that did not match in the exact lookup operation to the Fuzzy Lookup transformation

18. Double-click the Fuzzy Lookup transformation and make sure the Generate New Index radio button is selected Select the [dbo].[Owner] table from the drop-down

list in the Reference Table Name field Go to the Columns tab

19. Select the check box of the OwnerID column in the Available Lookup Columns

to add it in the fuzzy matched records, as shown in Figure 10-28

20. Go to the Advanced page and set the Maximum number of matches to output

per lookup equal to 3 Slide the Similarity Threshold slider to 0.30 Click OK

Figure 10-28 Identifying fuzzy matched records with Fuzzy Lookup

Trang 7

to close the Transformation Editor Rename this transformation as Identifying

Duplicates by Fuzzy Lookup.

21. Drop the Conditional Split transformation just below the Identifying Duplicates

by Fuzzy Lookup transformation and connect both the components using a green data flow path Double-click the Conditional Split to edit it You need to identify the duplicates using _Similarity and _Confidence columns added by the Fuzzy Lookup transformation in the pipeline You may have to run your package several times before you can get workable values for this identification Here, I’ve already done the work for this data to identify these values; add the following in the Condition field:

_Similarity >= 0.60 && _Confidence >= 0.50

T ype Fuzzy Lookup Matches in the Output Name field and Remaining Owners

in the Default Output Name field, as shown in Figure 10-29 Click OK to close

the editor Rename this transformation Splitting Fuzzy Matched Duplicates.

22. Drop an Excel Destination below the Conditional Split transformation and join both of these components using a green arrow Select Fuzzy Lookup Matches in the Output field of the Input Output Selection dialog box Rename this Excel

Destination Fuzzy Matched Duplicates.

23. Double-click the Excel Destination to configure it Select DuplicateOwners Connection in the OLE DB Connection Manager field if it is not already selected

24. Click the New button next to the Name of the Excel sheet field, verify in the CREATE TABLE statement that is creating the Fuzzy Matched Duplicates table—i.e., the worksheet in Excel—click OK to accept, and then select the Fuzzy_Matched_Duplicates in this field

25. Go to the Mappings page to create necessary column mappings automatically Note that in addition to the _Similarity and _Confidence columns, the Fuzzy Lookup transformation has added one _Similarity_ColumnName column for each column that participated in the fuzzy match Click OK to close this editor

Exercise (Remove Duplicates by Fuzzy Grouping)

Here you will remove duplicate variants of the Kathrine Morris These records do not exist in the reference table but appear multiple times in the input file in variant forms You will capture the Remaining Owners from Split transformation and use a fuzzy grouping to identify canonical row and the likely duplicates

26. Drop Fuzzy Grouping in the pipeline to the right of the Splitting Fuzzy Duplicates component Click Splitting Fuzzy Duplicates and then drag the available green arrow from the Splitting Fuzzy Duplicates and drop it on the Fuzzy Grouping transformation

Trang 8

27. Double-click the Fuzzy Grouping transformation and verify that the localhost.

Campaign is selected in the OLE DB Connection Manager field Go to the

Columns tab Note that all the Available Input Columns have Pass Through check boxes selected Uncheck all of them Select only the columns that are coming from

OwnersFeed—i.e., from FirstName to Country, as shown in Figure 10-30 As you

select these columns, a line corresponding to each selected column will be added

in the grid Type 0.25 in the Minimum Similarity column for the FirstName and

LastName rows

Figure 10-29 Splitting fuzzy matched records from the Data Flow.

Trang 9

28. Go to the Advanced tab and set the Similarity threshold to 0.50 using the slider Click OK to close this editor Rename the Fuzzy Grouping transformation

Identifying Duplicates by Fuzzy Grouping.

29. Drop the Conditional Split transformation just below the Identifying Duplicates

by Fuzzy Grouping transformation and connect both the components using a green data flow path Double-click the Conditional Split to edit it You need

to split the duplicates using _key_in and _key_out columns added by the Fuzzy

Figure 10-30 Identifying and grouping duplicates using the Fuzzy Grouping transformation

Trang 10

Grouping transformation in the pipeline Add the two rows in the grid by adding

the following in the Condition field:

_key_in == _key_out

_key_in != _key_out

T ype Canonical Row in the first row and Fuzzy Grouped Matches in the second

row of Output Name field, as shown in Figure 10-31 Click OK to close the

editor Rename this transformation Splitting Fuzzy Grouped Duplicates.

Figure 10-31 Splitting fuzzy grouped and unique records in the data flow

Ngày đăng: 04/07/2014, 15:21