Openk data cleansing system a clustering based approach for detecting data anomalies

In this nario, a cluster will comprise all data that are similarity-based assumptions, is detected sce-by several techniques: Nearest Neighbor Levenshtein Distance, Damerau-LevenshteinDi

Trang 1

FACULTY OF COMPUTER SCIENCE AND ENGINEERING

GRADUATION THESIS

CLUSTERING-BASED APPROACH FOR

DETECTING DATA ANOMALIES

Ho Chi Minh City, August 2021

Trang 2

VT姶云PI"A萎K"J窺E"DèEJ"MJQC

KHOA:KH & KT Máy tính PJK烏O"X影"NW一P"èP"V渦V"PIJK烏R

D浦"OðP<KHMT Ej¿"#<"Ukpj"xk‒p"rjＶk"f p"vぜ"p {"x q"vtcpi"pjＸv"eてc"dＶp"vjw{ｘv"trình

J窺"XÉ"VçP< PIW[右P"AîPJ"MJ姶愛PI MSSV: 1752306

1 A亥w"8隠"nw壱p" p<

DETECTING DATA ANOMALIES

40"Pjk羽o"x映"*{‒w"e亥w"x隠"p瓜k"fwpi"x "u嘘"nk羽w"dcp"8亥w+<

- Learn requirements, analysis, design and implementation of data cleansing system running on web

app platform Research and apply Edit-based similarity algorithms, using knowledge and methodologies from Algorithm Design and Analysis, Database Management System, Clustering Methods, Web development to provide reasonable and optimized approach in detecting and clustering cluster of anomalies data, which will be ready for the next steps

- Reading scientific papers and proposing a solution to prevent inconsistent and duplicate data

based on clustering methods

- Researching different related works - others data cleansing systems such as GoogleRefine,

BigDansing, NADEEF, thereby making reasonable assessments and comparisons for the advantages and disadvantages of the current system After that, developing further functions performance and system optimization

- Apply K-NN methods (LD, Damerau LD, Hamming), Similarity (Jaro, Jaro-Winkler) methods and

Key Collision (Fingerprint, N-gram Fingerprint) for detecting and clustering

- Test and evaluate the proposed system.

Trang 3

-

Ngày 10 tháng 08 năm 2021

PHIẾU CHẤM BẢO VỆ LVTN!

(Dành cho người hướng dẫn)

1 Họ và tên SV: Nguyễn Đình Khương

2 Đề tài: OPEN!: DATA CLEANSING SYSTEM – A CLUSTERING-BASED APPROACH FOR DETECTING DATA ANOMALIES

3 Họ tên người hướng dẫn: PGS.TS Đặng Trần Khánh

4 Tổng quát về bản thuyết minh:

6 Những ưu điểm chính của LVTN:

Developed a cleansing tool for improving (big) data quality in order to achieve the high utility in businesses

Moreover, the student had finished the following:

-! Studying Pandas, Numpy, JSON Python library, and other relevent programming

tools

-! Investigating algorithms for measuring text similarity using different methods

-! Studying cleansing and validating data tools such as OpenRefine, Cerberus

-! Reading scientific papers and proposing a solution to prevent inconsistent and

duplicate data based on the clustering method

-! Build a visualization method for users to have a better view about the collected data -! Build an API-based library for the developer community

7 Những thiếu sót chính của LVTN:

The thesis presentation can be improved

8 Đề nghị: Được bảo vệ □ Bổ sung thêm để bảo vệ □ Không được bảo vệ □

9 3 câu hỏi SV phải trả lời trước Hội đồng:

a Point out a better functionality of OPEN !

comparing with the known existing work/systems?

10 Đánh giá chung (bằng chữ: xs/giỏi, khá, TB): Xuất sắc Điểm: 10 /10

Ký tên (ghi rõ họ tên)

PGS.TS Đặng Trần Khánh

Trang 4

- Ngày 03 tháng 08 p<o 2021

RJK蔭W"EJ遺O"D謂Q"X烏"NXVP

*F pj"ejq"pi⇔ぜk"rjＶp"dkうp+

30"J丑"x "v‒p"UX< Nguy宇p"Aình Kh逢挨pi

40"A隠"v k< Openk: Data Cleansing System - A Clustering-based Approach for Detecting Data

a Would you please show a use-case in that a user can benefit from your system?

b Any comparison with some related work (e.g., Open Refine)?

c

320"A pj"ik "ejwpi"*d茨pi"ej英<"ik臼k."mj "VD+< Good Ak吋o: 9/10

Ký tên (ghi rõ h丑"tên)

Trang 5

First and foremost we would like to thank my supervisor Dr Dang Tran Khanh,not only for his academic guidance and assistance, but also for his patience and personalsupport which made me truly grateful

I would like to guarantee that this research is my own, conducted under the pervision and guidance of Dr Dang Tran Khanh The result of my research is legitimateand has not been published in any forms prior to this All materials used within thisresearched are collected by myself, by various sources and are appropriately listed inthe references section In addition, within this research, we also used the results of sev-eral other authors and organizations They have all been aptly referenced In any case ofplagiarism, we stand by my actions and are to be responsible for it Ho Chi Minh cityUniversity of Technology therefore is not responsible for any copyright infringementsconducted within my research

Trang 6

At the moment, massive amounts of data are created every second over the net, making the most efficient decisions has become a critical goal Assume that we hadall of the information, but that extracting the valuable knowledge would be extremelydifficult The following are the reasons for this assumption: data is not always clean or

inter-at least correct since dinter-ata obtained from many sources may be redundant, some of themcan be duplicated These data must be cleaned before they can be utilized for furtherprocessing

Any inconsistencies or duplication in the datasets should be detected using a tection procedure Widowing, blocking, and machine learning are among of the methodsthat are utilized to identify anomalous data The goals of this thesis are to offer OpenK,

de-a simple yet efficient dde-atde-a clede-ansing system bde-ased on clustering de-approde-aches In this nario, a cluster will comprise all data that are similarity-based assumptions, is detected

sce-by several techniques: Nearest Neighbor (Levenshtein Distance, Damerau-LevenshteinDistance, Hamming Distance), Similarity Measurement (Jaro Similarity, Jaro-WinklerSimilarity) and Key Collision (Fingerprints, N-gram Fingerprints) This tool will beevaluated in order to see how the efficiency of it and compare to other tool for betterview of assessment We used airlines dataset from https://assets.datacamp.com/production/repositories/5737/datasets and special case study -Real Estate dataset which is crawled from https://batdongsan.com.vn OpenK

also aids the user in loading and viewing data Beside that, CRUD procedures, tion, Toggle column ON/OFF, Sort column, and Search keywords are being used foranalyzing and wrangling input data

Pagina-Keywords: Data Cleansing , Levenshtein Distance, Jaro-Winkler Similarity,

Fin-gerprints , Anomaly detection

Trang 7

1.1 Problem Statement 10

1.2 Objective 10

1.3 Scope 10

1.4 Thesis Structure 11

2 Theoretical Background 13 2.1 Related Works 13

2.2 Data Anomalies Detection 13

2.2.1 Conception 13

2.2.2 Existing methods 14

2.3 Clustering Methods 15

2.3.1 Key Collision 15

2.3.1.a Fingerprint 16

2.3.1.b N-gram Fingerprint 17

2.3.2 Nearest neighbors 18

2.3.2.a Hamming distance 19

2.3.2.b Levenshtein distance 20

2.3.2.c Damerau-Levenshtein distance 22

2.3.2.d Jaro Distance - Jaro-Winkler Distance 23

Trang 8

3.1 General Architecture 26

3.1.1 Main components 26

3.1.2 Detecting anomaly execution flow 28

3.1.3 Use-case of clustering data site 29

3.1.3.a Actor determination and following use-case 29

3.1.3.b Use-case diagram and specification 30

3.2 Existing System and Design 36

4 System Implementation 39 4.1 Technologies and Framework 39

4.2 Function implementation 40

5 System Evaluation 50 6 Thesis Denouement 54 6.1 Achievements 54

6.2 Assessment of Thesis Connotation 55

6.3 Future Advancement 55

Trang 9

List of Figures

2.1 Example of applying fingerprint algorithm for name 16

2.2 Formula of Hamming distance calculation 19

2.3 3-bit binary cube for finding Hamming distance 20

2.4 Levenshtein Distance calculation formula 21

2.5 Example of Levenshtein distance calculation table 22

2.6 Example of Damerau-Levenshtein distance calculation table 22

2.7 Formula of Jaro similarity calculation 23

2.8 Jaro-Winkler similarity calculation example 24

2.9 Comparision of barcode correction using different techniques 25

3.1 Overall architecture of OpenK system 26

3.2 Data type format converter 27

3.3 Data cleansing component illustration 27

3.4 Clustering Operations illustration 28

3.5 Activity diagram of OpenK system 28

3.6 Use case diagram of Data site of OpenK system 31

3.7 Use case specification of viewing data 32

3.8 Use case specification of paging data 32

3.9 Use case specification of searching data keywords 32

3.10 Use case specification of sorting data column 33

3.11 Use case specification of export data 33

Trang 10

3.13 Use case specification of Manage data cluster 34

3.14 Use case specification of cluster data using knn method 35

3.15 Use case specification of cluster data using similarity method 35

3.16 Use case specification of cluster data using key collision method 36

3.17 Overall architecture of BigDansing 37

3.18 Overall architecture of NADEEF 38

4.1 Relation diagram of OpenK routing system 42

4.2 Flow chart diagram of Upload function implementation 43

4.3 Flow chart diagram of Data function implementation 44

4.4 Class diagram of clustering method 45

4.5 Flow of clustering data with KNN class 46

4.6 Flow of clustering data with Similarity class 47

4.7 Implementation code of clustering data with Fingerprint algorithm 48

4.8 Flow of clustering data with Fingerprint algorithm 49

5.1 Time performance for loading & visualizing input dataset of OpenK and OpenRefine 51

5.2 Time performance for detecting & clustering input dataset of OpenKand OpenRefine 52

5.3 Error percentage for detecting & clustering input dataset of OpenK and OpenRefine 52

Trang 11

1 Introduction

“Most of the world will make decisions by either guessing or using their gut.

They will be either lucky or wrong.”

– Suhail Doshi, CEO of Mixpanel

In the scope of this thesis, we will carry out the following tasks:

• Study data analysis tools such as: Pandas, Numpy, JSON Python library,

• Study about algorithms for measure text similarity using different method such as

Trang 12

Distance using similarity-based metric, Fingerprint method,

• Study about cleansing and validating data tools such as OpenRefine, Cerberus

• Propose a solution to prevent inconsistent and duplicate data

• Build a visualization method for users to have a better view about the collected data

• Build an API library for the developer community

• Chapter 1 : Introduction - In this chapter, the thesis covers the (i) problem

state-ment - the problem that this thesis is aimed to solve (ii) Also, thesis will involve thelist of mission need to be accomplished (iii) The scope of thesis - the carry out’slist of tasks, (iv) and the structure of the thesis is also inscribed

• Chapter 2: Theoretical Background - In this chapter, (i) First is to mentioned

the related works of others authors, so that we can have an overview of what havebeen done previously Secondly, (i) Data anomalies detection - the conception andexisting methods Clustering methods (iii) are acknowledged, these methods - KeyCollision and Nearest Neighbors - will be written about the conception and how towork with it

• Chapter 3: Methodologies and Design - In this chapter, (i) This part consists the

system objectives - clearly show the target of the cleansing system Additionally,(ii) General architecture will give you the overall point of view of the system design.Moreover, (iii) execution flow - show how which individual statements, components

or function calls of an imperative program are executed or evaluated (iv) Existingdesign Lastly, (v) achievements and discussion

• Chapter 4: System Implementation - In this chapter, (i) Technologies and

Frame-work used namely: Pandas, Numpy, JSON lweb for python, fingerprints library andmore, (ii) Function implementation - list of functions, classes, parameters arebeing adopted

Trang 13

• Chapter 6: Thesis Denouement - In this chapter, the summarization of the thesis is

written - which includes (i) final culmination And thesis evaluation (ii), to ensurethe contribute of the thesis Finally, is (iii) the future advancement for the laterdevelopment of the cleansing system

• Chapter 7: References - In this chapter, all the referred results and references of

the others paper are kindly denoted

Trang 14

2 Theoretical Background

For the time being, there are scads of systems and researches on data ing system In [1], BIGDANSING - the Big Data Cleansing System Ire presented totackle the problem of cleaning big data, detecting anomalies and generating possiblefixes for it Moreover, [2] NADEEF - An open-source single-node platform supportingboth declarative and user defined quality rules, research efforts have targeted the usabil-ity of a data cleansing system, but at the expense of performance and scalability WithGoogle Refine, latter known as OpenRefine [3], is a Java-based powerful tool that allowsuser to load data, understand it, clean it up, reconcile it, and augment it with data comingfrom the web All from a web browser and the comfort and privacy of user’s computer.OpenRefine notes that: clustering in OpenRefine works only at the syntactic level (thecharacter composition of the cell value) and, while very useful to spot errors, typos,and inconsistencies, it’s by no means enough to perform effective semantically-awarereconciliation - but can deny the fact that it is very effective in treating duplicate andinconsistencies data point And [4], provides powerful yet simple and light-weight datavalidation functionality out of the box and is designed to be easily extensible allowingfor custom validation

cleans-In this thesis, we will approach the method of detecting data anomalies problemusing clustering-based techniques like what OpenRefine did previously

2.2.1 Conception

In data analysis, anomaly detection (also outlier detection) is the identification

of rare items, events or observations which raise suspicions by differing significantly

Trang 15

Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions.

In particular, in the context of abuse and network intrusion detection, the esting objects are often not rare objects, but unexpected bursts in activity This patterndoes not adhere to the common statistical definition of an outlier as a rare object, andmany outlier detection methods (in particular unsupervised methods) will fail on suchdata, unless it has been aggregated appropriately Instead, a cluster analysis algorithmmay be able to detect the micro clusters formed by these patterns

inter-Three broad categories of anomaly detection techniques exist Unsupervisedanomaly detection techniques detect anomalies in an unlabeled test data set under theassumption that the majority of the instances in the data set are normal by looking forinstances that seem to fit least to the remainder of the data set Supervised anomaly de-tection techniques require a data set that has been labeled as “normal” and “abnormal”and involves training a classifier (the key difference to many other statistical classifica-tion problems is the inherent unbalanced nature of outlier detection) Semi-supervisedanomaly detection techniques construct a model representing normal behavior from agiven normal training data set, and then test the likelihood of a test instance to be gener-ated by the utilized model

2.2.2 Existing methods

There are heretofore several designs [10] for detecting data anomaly:

1 One-class support vector machines [11]

2 Fuzzy logic-based outlier detection

3 Deviations from association rules and frequent itemsets

4 Bayesian networks [12]

5 Hidden Markov models (HMMs) [12]

6 Replicator neural networks [12], autoencoders - variational autoencoders [13]

7 Long Short term memory neuron network [14]

Trang 16

The performance of different methods depends a lot on the data set and parameters, andmethods have little systematic advantages over another when compared across many datasets and parameters [15]

In this context, we use clustering-based technique for detecting anomaly - KeyCollision & K-Nearest neighbors

In OpenK, clustering on a column is a great way to look for inconsistencies inyour data and fix those Clustering uses a variety of comparison methods to find textentries that are similar but not exact, then shares those results with you so that you canmerge the cells that should match Where editing a single cell or text facet at a time can

be time-consuming and difficult, clustering is quick and streamlined

OpenK has clustering always requires the user to merge and edit the name of thecluster - it will display value that we apply on the previous and it will be the cluster’sname and apply it to all the elements of the cluster

In order to do its analysis clustering performs a lot of cleaning actions behind thescenes, but only the merges that you accept affect your data Understanding the variousbehind-the-scenes cleanups can assist you in determining which clustering approach isthe most accurate and successful

2.3.1 Key Collision

Key Collisionmethods are based on the idea of creating an alternative tation of a value (a “key”) that contains only the most valuable or meaningful part of thestring or a sequence, a word together different ones based on the fact that their key is thesame (hence the name “key collision”)

Trang 17

represen-2.3.1.a Fingerprint

Fingerprint is the least likely to produce false positives, so it’s a good place

to start It does the same kind of data-cleaning behind the scenes that you might think

to do manually: fix whitespace into single spaces, put all uppercase letters into case, discard punctuation, remove diacritics (e.g accents) from characters, split up allstrings (words) and sort them alphabetically (so “Khương, Nguyễn Đình” becomes “dinhkhuong nguyen”)

lower-The process that generates the key from a string value is the following (note thatthe order of these operations is significant):

+ Remove leading and trailing whitespace

+ Change all characters to their lowercase representation

+ Remove all punctuation and control characters

+ Normalize extended Western characters to their ASCII representator example “g¨odel"

→“godel")

+ Split the string into whitespace-separated tokens

+ Sort the tokens and remove duplicates

+ Join the tokens back together

Figure 2.1: Example of applying fingerprint algorithm for name

Trang 18

Due to the fact that the normalization of space and reduced characters and deletedpunctuation, the fingerprint of these portions is not distinguished Because these stringcharacteristics are the least important in meaning distinction, they are the most changingsections of the strings and their removal has a considerable advantage in developingclusters.

Since the portions of the string are sorted, it doesn’t matter the provided tokenssequence (“Cruise", “Tom Cruise" and “Tom Cruise," both finish in a fingerprint andend in the same cluster)

Normalizing extended western characters plays the role of reproducing data try mistakes performed when entering extended characters with an ASCII-only key-board Note that this procedure can also lead to false positives For example, “g¨odel”and “godél” would both end up with “godel” as their fingerprint, but they’re likely to bedifferent names, so this might work less effectively for data sets where extended charac-ters play substantial differentiation role

en-2.3.1.b N-gram Fingerprint

N-gram Fingerprintallows us to change the n value to anything you want, and

it will generate n-grams of n size (after cleaning), alphabetize them, and then reassemblethem into a fingerprint A 1-gram fingerprint, for example, will simply sort all of the let-ters in the cell into alphabetical order by dividing them into segments of one character

A 2-gram fingerprint will locate all two-character segments, eliminate duplicates, betize them, and reassemble them (for example, "banana" yields "ba a na a na," whichbecomes "anbana")

alpha-This can aid in matching cells with typos and spaces (for example, matching

"lookout" and "look out," which are not identified because fingerprinting separates words).This can assist The greater the n number, the lower the clusters Keep a watch on mis-spelled values that are nearly one another (for example, the ’wellington’ and the ’ElginTown’) with 1 gram

Trang 19

+ Remove all punctuation, whitespace, and control characters

+ Obtain all the string n-grams

+ Normalize extended western characters to their ASCII representator example “g¨odel"

→“godel")

+ Split the string into whitespace-separated tokens

+ Sort the tokens and remove duplicates

+ Join the tokens back together

So, for example, the 2-gram fingerprint of "Paris" is "arispari" and the 1-gramfingerprint is "aiprs"

Check the code if you’re curious about the details

Why is this useful? In practice, using big values for n-grams doesn’t yield anyadvantage over the previous fingerprint method, but using 2-grams and 1-grams, whileyielding many false positives, can find clusters that the previous method didn’t find evenwith strings that have small differences, with a very small performance price

For example "Krzysztof", "Kryzysztof", and "Krzystof" have different lengthsand different regular fingerprints, but share the same 1-gram fingerprint because theyuse the same letters

2.3.2 Nearest neighbors

Nearest neighbors - while key collisions methods are very fast, they tend to beeither too strict or too lax with no way to fine tune how much difference between strings

we are willing to tolerate

The Nearest Neighbor methods (also known as kNN), on the other hand, provide

a parameter (the radius, or k) which represents a distance threshold: any pair of stringsthat is closer than a certain value will be binned together

Trang 20

2.3.2.a Hamming distance

dis-Figure 2.2: Formula of Hamming distance calculation

Hamming Distance application

Coding theory, especially block codes, in which the equal-length strings are tors over a finite field, is a key application

vec-The minimal hamming distance is used to establish several basic coding theoryideas, for example error detection and error correction codes In specifically, if and only

if the lowest Hamming distance between any two of its codewords is at least k+1, acode C is considered to be k error detecting Consider the code “000" and “111", whichconsists of two codewords Because the hamming distance between these two words is

3, the error detection is k=2 This means that the error can be identified even if one ortwo bits are reversed “000" becomes “111" when three bits are inverted, and the error is

Trang 21

words can identify and fix at most d-1 errors The latter number is also known as thecode’s packing radius or error-correcting capabilities.

Complexity and example of Hamming Distance

Complexity of Hamming distance is:

+ Worst-case performance : O(n)

+ Best-case performance : O(1)

+ Average performance : O(n)

+ Worst-case space complexity : O(n)

Below is an example of Hamming distance between bits in 3-bit binary cube:

Figure 2.3: 3-bit binary cube for finding Hamming distance

2.3.2.b Levenshtein distance

Definition

Levenshtein distanceis the edit distance is proposed by Russian scientist VladimirLevenshtein in 1965, character as the editing unit, which is the minimum number of op-erations (insert, delete, replace) that a string is turned into another string It is often used

in the similarity calculation of strings Set two strings S and T with length m and n spectively, construct matrix Lev [n + 1, m + 1], circulation calculating the value of eachcell Lev (i,j) in the matrix, a calculation formula is as follows:

Trang 22

re-Calculation formula

The Levenshtein distance between two strings a,b (of length| a |and| b |tively) is given by :

respec-Figure 2.4: Levenshtein Distance calculation formula

where the tail of some string x is a string of all but the first character of x, andx[n] is the nthcharacter of the string x, starting with character 0

Note that the first element in the minimum corresponds to deletion (from a to b),the second to insertion and the third to replacement

Complexity and example of Levenshtein distance

It has been shown that the Levenshtein distance of two strings of length n cannot

be computed in time O(n2)

Below is an example of Levenshtein distance between 2 words “Saturday” and

“Sunday”:

Trang 23

Figure 2.5: Example of Levenshtein distance calculation table.

Trang 24

2.3.2.d Jaro Distance - Jaro-Winkler Distance

The Jaro–Winkler distance is a string metric used in computer science and tics to measure the edit distance between two sequences It’s a Jaro distance metric vari-ation introduced by William E Winkler in 1990 (1989, Matthew A Jaro)

statis-The Jaro–Winkler distance uses a prefix scale p which gives more favourableratings to strings that match from the beginning for a set prefix length l

The shorter the Jaro–Winkler distance between two strings, the closer they are.The score is adjusted so that a score of 0 indicates an exact match and a score of 1indicates no similarity Because the metric was described in terms of similarity in theoriginal study, the distance is defined as the inversion of that value (distance = 1 similar-ity)

The Jaro–Winkler distance, though sometimes referred to as a distance metric,

is not a metric in the mathematical sense since it does not follow the triangle inequality

Figure 2.7: Formula of Jaro similarity calculation

Jaro–Winkler similarity uses a prefix scale p which gives more favorable ratings

to strings that match from the beginning for a set prefix length l Given two strings s1and s2, their Jaro–Winkler similarity simw is:

Below is calculation example of similarity Jaro-Winkler distance:

Trang 25

Figure 2.8: Jaro-Winkler similarity calculation example

simw = simj + lp(1 - simj) where:

• simj is the Jaro similarity for strings s1 and s2

• l is the length of common prefix at the start of the string up to a maximum of 4characters

• p is a constant scaling factor for how much the score is adjusted upwards for havingcommon prefixes p should not exceed 0.25 (i.e 1/4, with 4 being the maximumlength of the prefix being considered), otherwise the similarity could become largerthan 1 The standard value for this constant in Winkler’s work is p = 0.1

The Jaro–Winkler distance dw is defined as dw= 1-simw

Although often referred to as a distance metric, the Jaro–Winkler distance is not

a metric in the mathematical sense of that term because it does not obey the triangleinequality The Jaro–Winkler distance also does not satisfy the identity axiom

d(x, y) = 0 ← → x = y.

Trang 26

Figure 2.9: Comparision of barcode correction using different techniques

Trang 27

3 Methodologies And Design

Trang 28

compo-Figure 3.2: Data type format converter

DATA CLEANSER- in this section, the same as the data manipulation process.There are a lot of little ones in this component: The dataset is loaded into the built-intable viewer, which reads the whole CSV file and transforms all of the data into the tableview The project creator creates a replica of the dataset, after which the process projectbegins, allowing for many additional activities The Create/Read/Update/Delete (CRUD)operators, which are similar to those used in other database systems, play a key role Textfilter- the "trim table" action is used, where the keyword is typed and it only shows theparts that contain only the term typed above we may sort a column alphabetically (withwords) in ascending/descending order using the Column Sorter (with number) Facetingoperations enable us to search for patterns and trends So that we can make the bestdecision possible afterwards

Figure 3.3: Data cleansing component illustration

Trang 29

are sub-methods which will be mentioned below.

Figure 3.4: Clustering Operations illustration

3.1.2 Detecting anomaly execution flow

Execution flow of the OpenK system:

Trang 30

4) From here, we can create project from input dataset.

5) Once the project is created, we can choose between CRUD operations which cludes create / read / update / delete records Next function is text filter which willsearch for all records that include the text The following is column sorting, where

in-we can sort the column alphabetically or ascending/ descending order Last but notleast, is the Clustering operation which will cluster all anomalies of the data.6) Merged node is the following step of it, and click saves changes then all the datawill become cleaned

3.1.3 Use-case of clustering data site

3.1.3.a Actor determination and following use-case

Trang 31

- Toggling hide column.

- Managing cluster

Cluster data agent

+ Description: Cluster data agent is an actor that will do the job of cluster allthe anomaly data that seems to be represent the same things This job will be done inback-end

Trang 32

Figure 3.6: Use case diagram of Data site of Open K system

Trang 33

Use-case specification

Figure 3.7: Use case specification of viewing data

Figure 3.8: Use case specification of paging data

Trang 34

Figure 3.10: Use case specification of sorting data column

Figure 3.11: Use case specification of export data

Trang 35

Figure 3.12: Use case specification of Hiding column data

Figure 3.13: Use case specification of Manage data cluster

Trang 36

Figure 3.14: Use case specification of cluster data using knn method

Figure 3.15: Use case specification of cluster data using similarity method

Trang 37

Figure 3.16: Use case specification of cluster data using key collision method

Heretofore, many systems and designs for data cleansing and detecting anomaly,using comparison we can take a good look those system to our system and can acknowl-edge what have been done before hand:

In BigDansing [1], the input data is defined as a set of data units, where each

data unit is the smallest unit of input datasets Each unit can have multiple associated ements that are identified by model-specific functions Also, BigDansing provides a set

el-of parsers for producing such data units and elements from input datasets BigDansingadopts UDFs as the basis to define quality rules Each rule has two fundamental abstractfunctions, namely Detect and GenFix Detect takes one or multiple data units as inputand outputs a violation, i.e., elements in the input units that together are considered aserroneous w.r.t the rule:

GenFix takes a violation as input and computes alternative, possible updates to resolvethis violation:

In order to have a formal description and representation of a system, organized in a waythat supports reasoning about the structures and behaviors of the system An overallsystem architecture:

Trang 38

Figure 3.17: Overall architecture of BigDansing

BigDansing receives a data quality rule together with a dirty dataset from users(1) and outputs a clean dataset (7) BigDansing consists of two main components: theRuleEngine and the RepairAlgorithm The RuleEngine receives a quality rule either in

a UDF based form (a BigDansing job) or in a declarative form (a declarative rule) Ajob (i.e., a script) defines users opera tions as well as the sequence in which users want

to run their operations (see Appendix A) A declarative rule is written using traditionalintegrity constraints such as FDs and DCs In the latter case, the RuleEngine automati-cally translates the declarative rule into a job to be executed on a parallel data processingframework This job outputs the set of violations and possible fixes for each violation.The RuleEngine as three layers: the logical, physical, and execution layers This architec-ture allows BigDansing to (i) support a large variety of data quality rules by abstractingthe rule specification process, (ii) achieve high efficiency when cleansing datasets by

Trang 39

unlike a DBMS, the RuleEngine also has an execution abstraction, which allows Dansing to run on top of general purpose data processing frameworks ranging fromMapReduce-like systems to databases.

Big-In NADEEF [2] works as follows It first collects data and heterogeneous rules

from the users The rule compiler then compiles these heterogeneous rules into neous constructs Next, the violation detection module identifies what data is erroneousand possible ways to repair them, based on user provided rules The data repairing mod-ule fixes the detected errors by treating all the provided rules holistically Note that dataupdates may trigger new errors to be detected As the iterative detection and repair-ing process progresses, NADEEF collects and manages metadata related to its differentmodules A live data quality dashboard exploit these metadata and visualizes information

homoge-to users homoge-to interact with the system

Overall, NADEEF consists of the following components: data loader, rule lector, detection and cleaning core, metadata management, and data quality dashboard

col-Figure 3.18: Overall architecture of NADEEF

Rule collector It collects user-specified rules such as ETL rules, CFDs (FDs),MDs, and other customized rules Detection and cleaning core It contains three com-ponents: rule compiler, violation detection and data repairing (i) Rule compiler Thismodule compiles all heterogeneous rules and manages them in a unified format (ii)Violation detection This module takes the data and the compiled rules as input, andcomputes a set of data errors (iii) Data repairing This module encapsulates holistic re-pairing algorithms that take violations as input, and compute a set of data repairs Bydefault, we use the algorithm [1], which can also be overwritten by e.g., [2] Moreover,

Trang 40

achieve higher quality repairs.

Metadata management and data quality dashboard Metadata management is tokeep full lineage information about data changes, the order of changes, as well as main-taining indices to support efficient metadata operations The data quality dashboardhelps the users understand the health of the data through data quality indicators, as well

as summarized, sampled or ranked data errors presented in live graphs This facilitatesthe solicitation of users’ feedback for data repairing Rule specification we use the termcell to denote a combination of a tuple and an attribute of a table The unified program-ming interface to define the semantics of data errors and possible ways to fix them is asfollows

To implement the system following technologies and framework were used:

FLASK - a micro web framework written in Python It is classified as a microframework because it does not require particular tools or libraries It has no database ab-straction layer, form validation, or any other components where pre-existing third-partylibraries provide common functions OpenK is a web-based system, Flask managed therole of running the server on local server - default on localhost:5000 or 127.0.0.1:5000.Flask depends on the Jinja template engine and the Irkzeug WSGI toolkit All the data

or parameters that passed from Flask Ire written in Jinja format for web application velop Also, API library is proposed with Flask

Tiêu đề	Openk: Data Cleansing System - A Clustering-Based Approach For Detecting Data Anomalies
Tác giả	Nguyen Dinh Khuong
Người hướng dẫn	Assoc. Prof. Dr. Dang Tran Khanh
Trường học	Vietnam National University Ho Chi Minh City University of Technology
Chuyên ngành	Computer Science
Thể loại	Graduation Thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	84
Dung lượng	3 MB