Summary of mathematics doctoral thesis: Discovering functional dependencies and relaxed functional dependencies in databases

The research content in the thesis is the current problems which are renewed with a series of works of foreign authors; while in the country (in Vietnam), there are many published works related to methods and algorithms finding reducts of a decision table by different approaches. The objective of the thesis is to research some analyzed problems in range of relational databases.

Trang 1

MINISTRY OF EDUCATION

AND TRAINING

VIETNAM ACADEMY OF SCIENCE AND TECHNOLOGY

GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY

Trang 2

This work is completed at:

Graduate University of Science and Technology

Vietnam Academy of Science and Technology

Supervisor 1: Assoc Prof Dr Ho Thuan

Supervisor 2: Assoc Prof Dr Nguyen Thanh Tung

Graduate University of Science and Technology

Vietnam Academy of Science and Technology

At hrs day month year

This thesis is available at:

1 Library of Graduate University of Science and Technology

2 National Library of Vietnam

Trang 3

INTRODUCTION Data dependencies play important roles in database design, data quality management and knowledge representation Dependencies in knowledge discovery are extracted from the existing data of the

database This extraction process is called dependency discovery

The aim of dependency discovery is to find important dependencies holding on the data of the database These discovered dependencies represent domain knowledge and can be used to verify database design and assess data quality

Dependency discovery has attracted a lot of research interests from scientists since early 1980s At the present time, the problem of discovering data dependencies on big data sets becomes more important because these big data sets contain a lot of valuable knowledge

Currently, with the development of digital devices, especially social networks and smart phone applications, the amount of data in the applications increases very quickly, these arise proplems in data storage, data management, especially the problem of knowledge discovery from those big data sets The problem of discovering FDs and RFDs in databases is one of important proplems of knowledge discovery Three typical types of data dependencies which are interested in discovering are FD, AFD and CFD AFD is an

extension of FD, the "approximation" is based on a degree of

satisfaction or an error measure; CFD is an extension of FD which aims to capture inconsistencies in data

The research directions which solve the problem of RFD discovery in databases, firstly focus on FD discovery because FD is the separate case of all types of RFD, the research results about FD

Trang 4

discovery can be adapted to discover other types of data dependencies (such as AFD) The general model of FD discovery problem includes steps: generating a search space of FDs, verifying the satisfaction of each FD, pruning the search space, outputting the set of satisfied FDs and reducing redundancies in this set of satisfied FDs In the FD discovery problem, the key discovery is the special case and is also an important problem in normalizing relational databases

The time complexity of the FD discovery problem is polynomial

in the number of tuples in the relation but is exponential in the number of attributes of that relation Therefore, for reducing the processing time, effective pruning rules should be developed Among the proposed pruning rules, it is important to prune keys, and if a key

is discovered then it is possible to prune (delete) all sets containing the key in the search space However, the disadvantage of existing key pruning rules is to find keys on the entire set of attributes  of the database (this is really a very difficult problem because the time complexity can be exponential in number of attributes of ) So is there any way to find keys in a proper subset of ? This question is one of basic motivations of this thesis

After the set of data dependencies is discovered, this set can be very large and difficult to use because it contains unnecessary redundancies The important problem is how to eliminate (as much

as possible) the redundancy in the set of discovered data dependencies This is also a problem interested in the thesis

Another research direction in the thesis is to focus on discovering two typical types of RFD, namely AFD and CFD Both AFD and CFD have many applications and occurences in relational databases,

Trang 5

especially CFD is also a powerful tool for dealing with data cleaning problems For AFD, the most important problem is to improve and develop techniques for computing approximate measures; For CFD,

in addition to discovering them, the research about a unified hierarchy between CFD and other types of data dependencies is also

a very interesting problem

The research content in the thesis is the current problems which are renewed with a series of works of foreign authors; while in the country (in Vietnam), there are many published works related to methods and algorithms finding reducts of a decision table by different approaches

The objective of the thesis is to research some analyzed problems

in range of relational databases The main contents of the thesis are described as follows:

Chapter 1 An overview of relational data model, concepts of functional dependency, closure of a set of attributes, key for a relational schema, etc This chapter also focuses on RFD and the generalization of methods used for discovering FDs and RFDs Chapter 2 The presentation of AFD and CFD (two typical types

of RFD) and some related results

Chapter 3 The presentation of the closure computing algorithms

of a set of attributes under a set of FDs, reducing the key finding problem of a relation schema and some related results

Chapter 4 The presentation of an effective preprocessing transformation for sets of FDs (to reduce redundancies in a given set

of FDs) and some related results

Trang 6

Chapter 1 FUNCTIONAL DEPENDENCIES AND

RELAXED FUNCTIONAL DEPENDENCIES IN THE

RELATIONAL DATA MODEL

1.1 Recalling some basic notions

A relation r on the set of attributes Ω = {A 1 , A 2 ,…,A n}

r  {(a 1 , a 2 ,…,a n ) | a i  Dom(A i ), i = 1, 2,…, n}

where Dom(A i ) is the domain of A i , i = 1, 2,…, n

A relation schema S is an ordered pair S = <Ω, F>, where Ω is a finite set of attributes, F is a set of FDs S can also denoted by S()

1.2 Functional dependency

Functional dependency Given X, Y   Then X  Y if for all

relations r over the relation schema S(), t 1 , t 2  r such that t 1 [X] =

t 2 [X] then t 1 [Y] = t 2 [Y]

Armstrong's axioms For all X, Y, Z  , we have:

Q 1 (Reflexivity): If Y  X then X  Y

Q 2 (Augmentation): If X  Y then XZ  YZ

Q 3 (Transitivity): If X  Y and Y  Z then X  Z

The closure of X   under a set of FDs F, is the set X F:

F

X= {A    (X  A)  F +}

Keys for a relation schema Let S = <, F> be a relation schema

and K   We say that K is a key of S if the following two

conditions are simultaneously satisfied:

(i) (K  )  F +

(ii) If K'  K then (K'  )  F+

If K only satisfies (i) then K is called a superkey

Trang 7

1.3 Relaxed functional dependency (RFD)

1.3.1 Approximate functional dependency (AFD)

An AFD is a FD that almost holds To determine the degree of

violation of X  Y in a given relation r, an error measure, denoted

e X Y r , shall be used Given an error threshold , 0    1 We

say that X  Y is an AFD if and only if e X( Y r, ) 

1.3.2 Metric functional dependency (MFD)

Consider X  Y in a given relation r A MFD is an extension of functional dependency by replacing the condition t 1 [Y] = t 2 [Y] with

d(t 1 [Y], t 2[Y]) ≤ , where d is a metric on Y, d: dom(Y)  dom(Y)  R

and   0 is a parameter

1.3.3 Conditional functional dependency (CFD)

A CFD is a pair  = (X  Y, T p ), where X  Y is a FD and T p is a

pattern tableau with all attributes in X and Y Intuitively, the pattern tableau T p of  reﬁnes the FD embedded in  by enforcing the binding of semantically related data values

1.3.4 Fuzzy functional dependency (FFD)

Let r be a relation on Ω = {A 1 , A 2 ,…,A n } and X, Y   For each

A i  Ω, the degree of equality of data values in Dom(A i) is defined

by the fuzzy tolerance relation R i

Given a parameter  (0 ≤  ≤ 1), we say that two tuples t 1 [X] and

t 2 [X] are equal with the degree , denoted t 1 [X] E() t 2 [X], if

R k (t 1 [A k ], t 2 [A k ])   for all A k  X Then, X  Y is called a FFD with

the degree  if t 1 , t 2  r, t 1 [X] E() t 2 [X]  t 1 [Y] E() t 2 [Y]

1.3.5 Differential dependency (DD)

DD extends the equality relation (=) in FD X  Y The conditions

t 1 [X] = t 2 [X] and t 1 [Y] = t 2 [Y], in turn, are replaced by the conditions which t 1 , t 2 satisfies differential functions L and R

Trang 8

In fact, the differential functions use metric distances to extend the equality relation used in FD

FD is a special case of DD if L [t 1 [X], t 2 [X]) = 0 and R [t 1 [Y],

t 2 [Y]) = 0 In addition, DD is also an extension of MFD if L [t 1 [X],

t 2 [X]) = 0 and R [t 1 [Y], t 2 [Y]) ≤ 

1.3.6 Other types of RFDs

There are many other types of RFDs Starting from reality applications, each type of RFDs is the result of extending (relaxing) the equality relation in the traditional FD concept by a certain way 1.4 FD Discovery

Top-down methods These methods generate candidate FDs

following an attribute lattice, test their satisfaction, and then use the satisfied FDs to prune candidate FDs at lower levels of the lattice to reduce the search space An important prolem is how to check if a candidate FD is satisfied? Two specific methods were used: the partition method (algorithms: TANE, FD_Mine) and the free-set method (algorithm: FUN)

Bottom-up methods Different from the top-down methods above,

bottom-up methods compare the tuples of the relation to find sets or difference-sets These sets are then used to derive FDs satisfied by the relation The feature of these mothods is that they do not check candidate FDs against the relation for satisfaction, but check candidate FDs against the computed agree-sets and difference-sets Two typical algorithms using these methods are Dep-Miner and FastFDs

agree-The worst case time complexity of the FD discovery problem is exponential in the number of attributes of 

There are some topics relating to FD discovery, such as sampling,

Trang 9

maintenance of discovered FDs, key discovery,

Three typical algorithms for CFD discovery are CFDMiner, CTANE and FastCFD

1.6 Summary of chapter 1

This chapter presents an overview of FD and RFD in the relational data model The dependency discovery problem has an exponential search space on the number of attributes involved in the data

The FD discovery methods can be adapted to discover RFDs For example, an error measure can be used in a FD discovery algorithm for finding AFDs

Some algorithms are proposed for discovering FDs and RFDs

Trang 10

Chapter 2

APPROXIMATE FUNCTIONAL DEPENDENCIES AND CONDITIONAL FUNCTIONAL DEPENDENCIES

2.1 About some results relating to FD and AFD

This section shows relationships for the results in two works of two groups of authors (([Y Huhtala et al., 1999] and [S King et al., 2003]) and proves some important lemmas as the foundation to discover FD and AFD (these lemmas have not been proven)

in detail in the thesis

Theorem 2.1 FD X  A holds if and only if  X refines A

Theorem 2.2 FD X  A holds if and only if | X| = |X{A}|

Theorem 2.3 FD X  A holds if and only if g3(X) = g3(X  {A})

Theorem 2.4 We have  X  Y = X  Y

Theorem 2.5 For B  X and X - {B}  B Then, if X  A then X -

{B}  A If X is a superkey then X - {B} is also a superkey

Theorem 2.6 C+(X) = {A  R | B  X, X - {A, B}  B does not

hold}

Theorem 2.7 For A  X and X - {A}  A FD X - {A}  A is

minimal iff for all B  X, we have A  C+(X - {B})

Trang 11

2.2 FD and AFD discovery

Some approximate measures proposed and usually used for

discovering AFD are TRUTH r (X  Y), g 1 (X  Y, r), g 2 (X  Y, r) and g 3 (X  Y, r)

Choosing a certain approximate measure for discovering AFDs affects the output results In the thesis, we establish some new relationships between the measures:

Given a relation r on a schema S() For each X  , we define

an equivalence relation X on r as follows:

t  X u if and only if t[X] = u[X] for all t, u  r

Suppose rt t1, , ,2 t m Each equivalence relation X on r can

be expressed in terms of a binary matrix with elements 1 or 0 (called

an equvalence matrix) where a ij 1if t i X t j and a  ij 0 otherwise

Using equivalence matrices (attribute matrices), we give

algorithms which their time complexities are only O(m2) for

Trang 12

discovering FD (testing satisfaction) and AFD (computing measures

TRUTH r (X  Y), g 1 (X  Y, r), g 2 (X  Y, r))

2.3 Conditional Functional Dependencies (CFD)

Definition A CFD  on a relation schema R is a pair  = (X  Y,

T p ), where X  Y is a standard FD (referred to as the FD embedded

in ) and T p is a tableau with all attributes in X  Y (referred to as the

pattern tableau of ), where for each A in X or Y and each tuple t 

T p , t[A] is either a constant "a" in the domain Dom(A) of A or an

unnamed variable ""

Semantics The pattern tableau T p of CFD  = (X  Y, T p)

defines tuples (in the relation) which satisfy FD X  Y Intuitively, the pattern tableau T p of  reﬁnes the standard FD embedded in  by enforcing the binding of semantically related data values

The consistency problem for CFDs is NP-complete The inference system  is sound and complete for implication of CFDs The proposed algorithms for discovering CFD are CFDMiner, CTANE and FastCFD

2.4 About a unified hierarchy for FDs, CFDs and ARs

The work of [R.Medina et al., 2009] is interesting and original The authors have shown a hierarchy between FDs, CFDs and ARs: FDs are the union of CFDs while CFDs are the union of ARs The hierarchy between FDs, CFDs, and ARs has many benefits: algorithms for discovering ARs can be adapted to discover many other types of data dependencies and further generate a reducted set

of dependencies

The contents below are some remarks and preliminary results after researching the work of [R.Medina et al., 2009]:

Trang 13

Remark 2.1 It is different from most authors researching into CFDs,

[R.Medina et al., 2009] have extended all t pT p, these pattern

tuples are now defined on the whole set Attr(R), where t p [A] =  if A

 X  Y

Remark 2.2 Instead of matching of a tuple t  r with a tuple t p  T p

(t p is now defined on Attr(R)), we match t(X) with t p (X), t(Y) with

t p (Y) More formally, t(X) and t p (X) (respectively t(Y) and t p (Y)) are

matching if

A  X: t(X)[A] = t p (X)[A] = a  Dom(A)

or t(X)[A] = a and t p (X)[A] = 

Remark 2.3 Consider a pattern tuple t p defining a fragment relation

of [R.Medina et al., 2009] as follows:

p

t

r = {t  r | t p  t} (*)

It is clear that the formula (*) is incorrect The reason is that in

almost cases, (*) returns the empty set In fact, in case t p contains at

least one component , then there exists not t  r such that t p  t In the opposite case (t p does not contain the component ) and X  Y 

Attr(R), we have t p [A] =  and t[A] = a for A  X  Y Therefore, there does not exist t  r such that t p  t So,

p

t

r , defined by (*)

returns the non-empty result only when X  Y = Attr(R) and t p

coineides with a certain tuple t in r Hence, the expression (*) must

be changed to

p

t

r = {t  r | t(X  Y)  t p (X  Y)}

[R.Medina et al., 2009] used the following definitions:

 X-complete property A relation r is said to be X-complete if and only if  t 1 , t 2  r we have t 1 [X] = t 2 [X]

 X-complete pattern: (X, r) =  {t  r}

 X-complete horizontal decomposition:

R X (r) = {r'  r | r' is X-complete}

Định dạng
Số trang	26
Dung lượng	702,4 KB