On the discovery of semantically meaningful sql constraints from armstrong samples foundations, implementation, and evaluation

Challenges with constraints

Constraints play a crucial role in database modeling and design, serving as a foundation for key data management services such as updates, queries, security, sampling, data cleaning, exchange, and integration Identifying valuable classes of constraints is a significant challenge, as these constraints must express critical properties relevant to application domains while allowing for efficient maintenance by database management systems Efficient maintenance requires that the implication problem related to these constraint classes can be resolved effectively.

In the context of constraints in C, a set Σ ∪ {ϕ} is considered where Σ implies ϕ, meaning that any database fulfilling all constraints in Σ must also satisfy ϕ However, for highly expressive classes, the challenge of determining this implication becomes impractical.

A notable example is the category of first-order formulae, which includes important classes of constraints such as uniqueness constraints and functional dependencies After identifying these valuable constraint classes, the next challenge is to uncover the semantically meaningful constraints relevant to the specific application domain If semantically meaningful constraints are not identified, the database management system may allow instances that do not accurately reflect the real world While this thesis primarily focuses on tackling the second challenge, it is essential to first address the initial challenge.

State of database practice

The database industry has surpassed a valuation of 32 billion US dollars and continues to experience double-digit growth Major vendors like Oracle, IBM, and Microsoft offer database systems, alongside open-source options such as MySQL, PostgreSQL, and Ingres These systems adhere to the ISO and ANSI standards for data definition and querying, primarily utilizing Structured Query Language (SQL), which was developed by IBM based on Edgar Codd's relational model After over 40 years, SQL remains the dominant force in the market, shaping new paradigms as the industry evolves Web models like XML and RDF are mainly used for deploying, exchanging, and integrating SQL-based data High scalability is essential for many websites and distributed applications, such as e-commerce platforms, yet their core data stores and services continue to rely on SQL.

State of database theory

Additional features facilitate data processing by organizing SQL data in tables These tables may include duplicate rows to simplify the removal process and null markers to represent incomplete information in designated columns In contrast, relations are defined as tables that contain neither duplicate rows nor null markers.

Codd introduced the relational model of data in his seminal paper in 1970

In 1981, Codd was awarded the Turing Award for his significant contributions to database management systems, fundamentally transforming data management into a science While mainstream research has explored various challenges in the relational model of data, only a limited number of constraints have been practically motivated Studies indicate that certain classes of constraints can enhance quality data management theoretically However, there is a notable scarcity of research focusing on constraints in general SQL tables, with much of the community's efforts directed towards constraints on Web data Investigating SQL data constraints could not only address the prevalent data format used in practice but also deepen our understanding of more complex data formats, including those found on the Web.

State of disparity

In practice, SQL data is the leading data format, yet theoretical constraints on SQL have not been adequately explored Additionally, C-implication problems involving partial information are significantly more complex than those concerning relations Consequently, the existing foundation for constraints is limited to very specific cases of SQL data, creating a notable gap between theory and practice that severely hampers the quality of data management.

Objective and approach of the thesis

The thesis aims to tackle challenges related to SQL database constraints by offering tools for effectively discovering semantically meaningful SQL constraints It will explore the concept of Armstrong databases, which are sample databases that accurately represent a specific set of database constraints While previous research has primarily focused on Armstrong databases in relation to idealized SQL data, this study will investigate their structural and computational properties Prototypes developed from these properties will provide empirical evidence supporting the utility of Armstrong databases in identifying semantically significant constraints.

The thesis aims to create a toolbox of Armstrong databases focused on uniqueness constraints and functional dependencies in SQL data Given the popularity of these constraints and SQL's prevalence in real-world applications, this contribution is poised to significantly influence database theory, practice, and education Theoretically, it enhances our understanding of SQL constraints by moving beyond the idealized view of strictly relational constraints Practically, database designers can utilize this toolbox to uncover a more comprehensive array of semantically meaningful SQL constraints.

Organization

Improving database designs and enhancing data processing capabilities can be achieved through the evaluation of Armstrong databases In an educational context, these assessments can serve as a tool for grading non-multiple choice questions in database assignments and exams Additionally, the computation of Armstrong databases can facilitate automated feedback for students, promoting a deeper understanding of database concepts.

Chapter 2 presents a comprehensive literature review, formally defining key terms introduced in the introductory section It explores prior research on data dependencies and Armstrong databases, emphasizing uniqueness constraints and functional dependencies The discussion extends to two prominent methods for incorporating partial information in databases, alongside an examination of the implication problem related to uniqueness constraints and functional dependencies The chapter concludes with a summary that identifies a research gap and outlines the objectives of the thesis.

Chapters 3 and 4 focus on foundational concepts in database theory Chapter 3 addresses the implication problem related to uniqueness constraints, NOT NULL constraints, and functional dependencies, providing an axiomatically, algorithmically, and logically characterized approach based on Codd's interpretation of null markers as values that are currently unknown In Chapter 4, the structural and computational properties of Armstrong tables for this combined class of constraints are thoroughly established.

Chapter 5 introduces SQL-Sampler, a toolbox created during this research SQL-Sampler features both web-based and desktop graphical user interfaces, enabling users to compute Armstrong tables for a range of SQL constraints, including those from recent studies and the classes explored in the previous chapters.

Chapter 5 contains a description of the system requirements, the design of SQL-Sampler, implementation details, as well as a use case example with screenshots.

Chapters 6 and 7 focus on the empirical evaluation of the toolbox and the effectiveness of Armstrong tables in identifying significant SQL constraints Chapter 6 outlines the experimental design and introduces various metrics to define "usefulness." In Chapter 7, both quantitative and qualitative analyses of the experimental data reveal that Armstrong tables generated by SQL-Sampler effectively uncover meaningful SQL uniqueness constraints and functional dependencies that may initially seem insignificant Conversely, the analysis demonstrates that these tables do not aid in identifying truly meaningless SQL uniqueness constraints and functional dependencies that might be mistakenly viewed as significant before reviewing the Armstrong tables.

The conclusion of the thesis is presented in Chapter 8 which contains a summary of the main results as well as an outlook into possible future work.

Publications

Several of the results that will be presented in this thesis have been an- nounced in international conferences and journals These are:

V B T Le, S Link, and F Ferrarotti present "SQL-Sampler," a tool designed to visualize and consolidate domain semantics through the generation of perfect SQL sample data This work was featured in the proceedings of the 10th Asia-Pacific Conference on Conceptual Modeling (APCCM), edited by G Grossmann and M Saeki.

154 of Conferences in Research and Practice in Information Technol- ogy, 10 pages, Australian Computer Society, 2014.

This paper received the Best Student Paper Award.

In their 2013 paper presented at the 32nd International Conference on Conceptual Modeling (ER), V B T Le, S Link, and F Ferrarotti explore the effective recognition and visualization of semantic requirements through the use of perfect SQL samples Their findings, published in Lecture Notes in Computer Science, highlight the importance of aligning SQL samples with semantic needs to enhance understanding and application in conceptual modeling.

This paper received the Best Student Paper Award.

• V B T Le, S Link, and M Memari, “Schema- and data-driven discovery of SQL keys”, Journal of Computing Science and Engineer- ing, vol 6, no 3, pp 193-206, 2012.

• V B T Le, S Link, and M Memari, “Discovery of keys from SQL tables”, in Proceedings of the 17th International Conference on Database Systems for Advanced Applications (DASFAA) (S Goo Lee, Z Peng,

X Zhou, Y.-S Moon, R Unland, and J Yoo, eds.), vol 7238 of Lecture Notes in Computer Science, pp 48-62, Springer, 2012.

In their 2011 paper presented at the 22nd International Conference on Database and Expert Systems Applications, F Ferrarotti, S Hartmann, V B T Le, and S Link explore Codd table representations through the lens of weak possible world semantics The research, included in the Lecture Notes in Computer Science (vol 6860), spans pages 125 to 139 and is edited by A Hameurlain, S.W Liddle, K.-D Schewe, and X Zhou, published by Springer.

This chapter presents a focused literature review on key constraints and functional dependencies, addressing two challenges identified in Chapter 1 Section 2.1 summarizes key results regarding the implication problem for these constraints in pure relations In Section 2.2, the discussion shifts to Armstrong relations relevant to these classes Section 2.3 reviews two interpretations of null markers in partial relations: "no information" and "value unknown at present." Section 2.4 explores previous research on keys and functional dependencies in partial relations, reflecting both interpretations Results concerning Armstrong tables for these constraints in partial relations are presented in Section 2.5 Finally, Section 2.6 provides a summary of the literature review, highlights existing research gaps, and outlines the objectives of the thesis.

Data Dependencies over Relations

The Relational Data Model

The relational model of data [25], or simply the relational model, refers to a database model with at least the following three components:

• Structural component: a set of relations, each of which can be represented in the form of a table containing values.

• Integrity component: integrity constraints which identify the relations that are semantically meaningful in the given application domain.

• Manipulative component: languages to define data structures and semantics, and to perform updates and queries.

For the purpose of this thesis, only the structural and the integrity component are of interest Excellent resources for a broader introduction to the relational model of data include [1, 3, 19].

Understanding the distinction between syntax and semantics in databases is crucial At the syntactic level, relation schemata represent the characteristics that describe all entities of interest A relation schema is defined as a finite, non-empty set of elements known as attributes, where each attribute A within the set R signifies a property of the entities The potential values for an attribute A are derived from its specific domain, denoted as dom(A).

On the semantical level we have relations These are finite sets of tuples Formally, a tuplet over the relation schema Ris a functiont : R → S

A∈R dom(A)that assigns to every attributeA ∈ R a value t(A) ∈ dom(A).

Relations over a relation schema are commonly illustrated in the form of tables As an example consider a simple database about employees in Ta- ble 2.1.

Dilbert Information Systems Gates Alice Information Systems Gates

Table 2.1: The relationr total over WORK

Table 2.1 presents a relation defined over the WORK schema, which includes the attributes Emp, Dept, and Mgr Each tuple in this schema represents an employee working in a department that has a designated manager The attributes can be populated with appropriately restricted sets of strings.

Data Dependencies

Data dependencies are essential sentences in predicate calculus that must be adhered to by any valid relation, addressing the limitations in semantics of pure relations They help clarify complex relationships among attributes, such as one-to-one or one-to-many connections, which would otherwise be difficult to represent.

A dependency over a relation schema R is defined as a function d that assigns a value d(r) ∈ {0,1} to every relation r over R, indicating whether the relation satisfies the dependency If d(r) = 1, the relation r is said to satisfy the dependency d; if not, it is considered to violate d A relation r is said to satisfy a set Σ of data dependencies if it meets every individual data dependency σ within that set In this thesis, the terms constraint and data dependency will be used interchangeably, as data dependencies are a specific subclass of integrity constraints, although this distinction is not crucial for the discussion.

In general, we will use the term that is most common in the literature, e.g.there will be uniqueness constraints and functional dependencies.

Central to the study of data dependencies is their implication problem. The following subsections review principle concepts of data dependencies

The implication problem for a fixed class C of data dependencies involves determining whether a finite set Σ, along with a dependency ϕ, leads to the conclusion that Σ implies ϕ Specifically, Σ implies ϕ if every relation that meets the criteria of Σ also satisfies ϕ, meaning there are no relations that satisfy Σ while violating ϕ If this condition holds true, we denote it as Σ |= ϕ; otherwise, it is represented as Σ 6|= ϕ.

The setΣ ∗ ={ϕ∈ C |Σ|=ϕ}, that consists of all data dependencies in

C implied byΣ, is called thesemantic closure ofΣ The implication problem for the classCis to decide whether for an arbitrarily given relation schema

R, and an arbitrarily given setΣ∪ {ϕ}of data dependencies inC overR, Σ|=ϕholds.

For two setsΣ 1 and Σ 2 of data dependencies inC, we callΣ 2 acoverof Σ 1 if and only ifΣ ∗ 1 = Σ ∗ 2 Covers are just different representations of the same semantics.

The implication problem is crucial in understanding data dependencies within a database When the elements of Σ ∪ {ϕ} represent key semantic properties, the relationship Σ |= ϕ indicates that verifying a relation against Σ is sufficient to confirm it also meets the criteria of ϕ, thus eliminating redundant checks and conserving system resources Conversely, if Σ ⊭ ϕ, the database management system must directly validate whether the relation satisfies ϕ, as it cannot be inferred from the satisfaction of Σ alone.

This article focuses exclusively on finite relations and examines the finite implication problem Notably, for the data dependencies addressed in this thesis, the finite implication problem aligns with the unrestricted implication problem.

2.1 DATA DEPENDENCIES OVER RELATIONS 13 one also permits infinite relations In general, the finite and unrestricted implication problems are different [1].

To determine the implication problem for a specific class C of data dependencies, one can utilize a syntactic approach that involves the application of inference rules These rules typically follow the structure of a premise leading to a conclusion, while rules that do not have any premises are referred to as axioms.

An inference rule is considered sound for the implication of data dependencies when the elements of the premise imply the conclusion, provided the rule's conditions are met For a finite set of data dependencies Σ ∪ {ϕ} and a set of inference rules R, we denote the inference of ϕ from Σ by R as Σ `R ϕ, indicating that there exists a sequence of data dependencies leading to ϕ, where each element is either part of Σ or derived from applying an inference rule in R The syntactic closure of Σ under inferences by R is represented as Σ + R = {ϕ | Σ ` R ϕ} A set of inference rules R is termed sound and complete for the implication of data dependencies in a relation schema C if it satisfies the conditions Σ + R ⊆ Σ ∗ and Σ ∗ ⊆ Σ + R for every set Σ of data dependencies over R Additionally, a finite set R is regarded as a finite axiomatization for data dependency implications if it is both sound and complete.

This article examines two prominent types of data dependencies in relational databases: keys and functional dependencies The relational model literature has proposed around 100 distinct classes of data dependencies, highlighting the complexity and diversity in this area A comprehensive survey and classification of these classes can be found in the referenced book.

The Class of Keys

Keys are arguably the most important class of data dependencies [25, 42–

44] Formally, a key over a relation schema R is an expression key(K) where K ⊆ R A relation r over R is said to satisfy the key key(K) over

A relation R is considered to have a key K if, for any two distinct tuples t and t0 in R, there exists an attribute A in K such that the values of A in t and t0 are different This means that a key K is violated by R if there are two distinct tuples t and t0 where the values of K are the same for both tuples.

The implication problem of keys is defined by the set K of inference rules as stated in Theorem 2.1 In accordance with database literature, we denote the union of sets X and Y as XY, representing X ∪ Y.

Theorem 2.1 The following setKof inference rules forms a finite axiomatization for the class of keys over relations. key(R) key(K) key(KK 0 ) (relation axiom) (superkey)

The relation axiom ensures the integrity of a relation by prohibiting duplicate tuples, meaning that no two tuples can share identical values across all attributes of the relation schema Additionally, the superkey rule asserts that any superset of a key qualifies as a key itself.

In the context of the set Σ over WORK, which includes the key key(Emp), the superkey rule enables us to derive additional keys such as key(Emp,Dept), key(Emp,Mgr), and key(Emp,Dept,Mgr).

The axiomatization provides important hints on how to solve the implication problem of keys algorithmically That is, to decide whether for a

2.1 DATA DEPENDENCIES OVER RELATIONS 15 given relation schema R and a given set Σ∪ {key(K)} of keys over R it is true thatΣ |= key(K) holds, it is sufficient to check whetherK = R or there is somekey(K 0 )∈Σsuch thatK 0 ⊆K If|X|denotes the cardinality of a setX, and||Σ||denotes the total number of attributes that occur inΣ, then the worst-case time-complexity to decide the implication problem of keys over relations isO(max{|R|,||Σ||}).

In the context of the database WORK, the set of keys Σ = {key(Emp, Dept), key(Emp, Mgr)} does not imply the key key(Emp), represented as ϕ This illustrates that while Σ consists of specific keys, it lacks the necessary information to derive the broader key ϕ.

Indeed, the singletonEmpdoes not equal WORK, and is not a superset of any of the keys inΣ.

The axiomatization K indicates that only minimal keys need to be specified for a relation schema R A key, key(K), is considered minimal if it satisfies the condition Σ |= key(K) and there is no other key K0 in Σ* such that K0 is a proper subset of K Essentially, minimal keys consist of the least number of attributes necessary to differentiate between any two tuples in the database, making them crucial in database practice Furthermore, keys represent a unique class of data dependencies that database management systems inherently support Notably, the number of minimal keys can grow exponentially with the number of attributes in the relation schema.

The Class of Functional Dependencies

Keys often fail to convey the rich semantic properties that data engineers seek, highlighting the need for more expressive data dependency concepts in database practices Among these, functional dependencies are a widely recognized and essential category.

A functional dependency (FD) in a relation schema R is denoted as X → Y, where X and Y are subsets of R A relation r adheres to the functional dependency X → Y if, for any tuples t and t0 in r, whenever t(X) equals t0(X), it follows that t(Y) equals t0(Y) This implies that the values in X uniquely determine the corresponding values in Y.

In the relation schema WORK, the functional dependency (FD) FDEmp → Dept indicates that each employee can belong to only one department, while the FD Dept → Mgr signifies that each department can have at most one manager The relation r total from Table 2.1 adheres to these dependencies; however, it violates the FD Dept, Mgr → Emp, as it shows two employees working under the same manager in the same department.

In relational databases, the concept of functional dependencies encompasses the notion of keys Specifically, for any relation schema R and any subset of attributes X within R, a relation r adheres to the key key(X) if and only if it meets the criteria set by the functional dependency FDX → R.

As an example, the relationr total of Table 2.1 satisfies the keykey(Emp) and the FDEmp→Dept,Mgr.

Armstrong [52] proposed the axiomatizationFin Theorem 2.2 for the implication of functional dependencies over relations.

Theorem 2.2(Armstrong, 1974) The following set Fforms a finite axiomatization for the implication of functional dependencies over relations.

In the context of the WORK relation schema, the functional dependencies (FDs) include Emp → Dept and Dept → Mgr From these, we can derive the FD Emp,Dept → Dept using the reflexivity axiom Additionally, by applying the transitivity rule to the FDs in the set Σ, we can infer Emp → Mgr Furthermore, the extension rule allows us to conclude that Emp → Emp,Mgr is also valid.

An important notion is that of a closureX Σ ∗ for an attribute setX ⊆Rwith respect to a setΣof FDs overR Indeed,

The attribute set closure, denoted as X Σ ∗ ={A∈R |Σ|=X →A}, encompasses all attributes in R that are functionally determined by X under the set of functional dependencies Σ This concept is crucial because for any relation schema R and any set of functional dependencies Σ∪ {X → Y}, it holds that Σ |= X → Y if and only if Y is a subset of X Σ ∗ Therefore, determining the implication problem for functional dependencies requires computing the attribute set closure A straightforward and efficient method to achieve this is outlined in Algorithm 2.1, originally presented in [47].

If||Σ||denotes the total number of attributes that occur inΣ, the algorithm can be implemented such that the implication problem with input Σ∪ {X →Y}can be decided in timeO(||Σ∪ {X →Y}||)[46, 53].

In the context of the relation schema WORK, we have a functional dependency set Σ that includes Emp → Dept and Dept → Mgr To determine if Σ entails the functional dependency ϕ, represented by Emp → Mgr, we define X as {Emp} and apply Algorithm 2.1 to compute X Σ ∗, resulting in the set {Emp, Dept, Mgr} Since {Mgr} is a subset of X Σ ∗, it confirms that Σ indeed implies ϕ.

Input: attribute setX, FD setΣover relation schemaR

Output: attribute set closureX Σ ∗ ofX with respect toΣ

Fagin demonstrated a connection between the implication problem for functional dependencies in relational databases and the implication problem for Horn clauses in propositional logic This relationship highlights the parallels between these two areas of study, which we will summarize here.

For a finite set of propositional variables \( L \), the propositional language \( L^* \) is generated using the unary connective \( \neg \) (negation) and the binary connective \( \lor \) (disjunction) This language \( L^* \) represents the smallest set that fulfills specific criteria within propositional logic.

Negation is considered to have a stronger binding power than disjunction, allowing us to omit parentheses when there is no risk of ambiguity A literal refers to either a propositional variable or the negation of such a variable Additionally, a clause represents an element of this framework.

L ∗ represents the disjunction of literals, where a variable's occurrence in a clause is considered negative if it is preceded by a negation; otherwise, it is positive A clause over L is classified as a Horn clause if it contains at most one positive occurrence of any variable.

An interpretation of L is a total function ω : L → {F,T} that maps every variableA 0 ∈Lto its truth valueω(A 0 ) An interpretationωofLcan be lifted to a total functionΩ :L ∗ → {F,T}by means of simple rules:

An interpretation ωis a modelof a setΣ 0 of formulae inL ∗ if and only if Ω(σ 0 ) = T holds for every σ 0 ∈ Σ 0 We say that Σ 0 logically implies an

L-formulaϕ 0 , denoted byΣ 0 |= L ϕ 0 , if and only if every interpretation that is a model ofΣ 0 is also a model ofϕ 0

In the context of the set L = {Emp 0, Dept 0, Mgr 0}, the set Σ 0, which includes the Horn clauses ơEmp 0 ∨ Dept 0 and ơDept 0 ∨ Mgr 0, does not logically imply the Horn clause ϕ 0 = ơDept 0 ∨ ơMgr 0 ∨ Emp 0 Specifically, the interpretation ω, where Emp 0 is assigned a value of F and both Dept 0 and Mgr 0 are assigned a value of T, serves as a model for Σ 0 but fails to satisfy ϕ 0.

In our analysis, we consider that every functional dependency (FD) over a relation schema R can be expressed in the form X → A, where A is an attribute in R This is achieved by substituting any FD X → Y with the corresponding FDs X → A for each attribute A in Y We define a bijection φ that maps the attributes of R to a set L of propositional variables This mapping is then extended to create a new mapping, Φ, which transforms functional dependencies over R into Horn clauses over L Specifically, for a functional dependency X → A in R, we denote its transformation as Φ(X → A).

∨φ(A) In what follows, for a functional dependencyσwe writeσ 0 instead ofΦ(σ), and for a finite setΣof FDs we writeΣ 0 instead of{σ 0 |σ∈Σ}.

Theorem 2.3 Let R be a relation schema, and letΣ∪ {ϕ}denote a set of FDs overR ThenΣimpliesϕif and only ifΣ 0 logically impliesϕ 0

For example, the setΣ, consisting of the FDsEmp→ DeptandDept→ Mgr, does not imply the FD ϕ = {Dept,Mgr} → Emp, illustrated by the two-tuple relationr:

Indeed, we have already seen above that the associated set Σ 0 of Horn clauses does not logically imply the associated Horn clauseϕ 0

The example demonstrates that the counterexample relations to the implications of functional dependencies (FDs) correspond directly to counterexample truth assignments related to the logical implications of the associated Horn clauses Specifically, the special truth assignment ωr assigns true to the variable A0 when the two tuples in relation r have matching values for attribute A Consequently, Theorem 2.3 can be easily derived from the principle that for any FD ϕ not implied by a set of FDs Σ, there exists a two-tuple relation that adheres to all FDs in Σ while violating ϕ.

Functional dependencies are essential for database design and normalization, significantly impacting various areas such as query optimization, database maintenance, security, and data cleaning.

70], entry [5], exchange [71] and integration [15, 72, 73] Fundamental to all these applications is the assumption that the set of functional dependencies that are semantically meaningful for a given relation schemaRhas been correctly identified.

Armstrong Relations

General Definition of an Armstrong Relation

For future reference we repeat the formal definition of an Armstrong relation here [39, 40].

LetCdenote a class of data dependencies, letRdenote a relation schema, andΣa set of data dependencies inC overR We say that a relationrover

A relation R is considered an Armstrong relation for a set of data dependencies Σ if it meets the conditions of Σ and disregards all data dependencies in C that are not implied by Σ We define that a set of data dependencies C enjoys Armstrong relations if, for every relation schema R and every set Σ of data dependencies in C over R, there exists a relation over R that qualifies as an Armstrong relation for Σ.

Armstrong Relations for Functional Dependencies

We now review the existing theory of Armstrong relations for the class of functional dependencies [41, 82].

Armstrong showed that the class of functional dependencies over relations does enjoy Armstrong relations [52] Characteristics of these Armstrong

Emp, Dept Emp, Dept, Mgr Emp, Mgr Emp, Dept, Mgr

Dept, Mgr Dept, Mgr Emp, Dept, Mgr Emp, Dept, Mgr

Table 2.3 illustrates the closure operation (ã) ∗ Σ for the set Σ, which includes the functional dependencies Emp→Dept and Dept→Mgr The closed header sets of WORK are highlighted in bold, reflecting three distinct concepts The primary concept discussed is that of a closed attribute set, where R represents a relation schema and Σ denotes a collection of functional dependencies over R.

An attribute set \( X \subseteq R \) is considered closed with respect to \( \Sigma \) if \( X \Sigma^* = X \) The notation \( cl_\Sigma(R) \) represents the collection of all attribute subsets of \( R \) that are closed in relation to \( \Sigma \) The closure operation, denoted as \( \tilde{a}^*_\Sigma: P(R) \rightarrow P(R) \), maps any attribute set \( X \subseteq R \) to its closure \( X \Sigma^* \) with respect to \( \Sigma \) This operation is extensive, meaning \( X \subseteq X \Sigma^* \); it is increasing, as \( X \subseteq Y \) implies \( X \Sigma^* \subseteq Y \Sigma^* \); and it is idempotent, indicating that \( (X \Sigma^*)^*_\Sigma = X \Sigma^* \).

As an example consider Table 2.3 which shows a closure operation(ã) ∗ Σ for Σ = {Emp → Dept,Dept → Mgr} The closed attribute sets of W ORK are therefore∅,{Mgr},{Dept, Mgr}and{Emp, Dept, Mgr}.

A maximal set X ⊆ R for an attribute A ∈ R, concerning a set Σ of functional dependencies (FDs) over R, is defined such that Σ does not imply X → A, and for every attribute B in R−X, it holds that Σ implies XB → A Therefore, X is considered maximal among all attribute sets that do not functionally determine A, given Σ The notation maxΣ(A) represents this maximal set.

Table 2.4: Families of maximal sets of WORK for Σ = {Emp → Dept;Dept→Mgr} of all attribute sets that are maximal for A with respect to Σ Further, let maxΣ(R) = S

A∈RmaxΣ(A)denote the maximal sets ofR. The third notion is that of an agree set Given two tuples t andt 0 over

R, the agree set ag(t, t 0 ) = {A ∈ R | t(A) = t 0 (A)} consists of all those attributes of R on whicht and t 0 agree, i.e., have the same value [41, 82]. Now, the agree setag(r)of a relationris the set of the agree sets of all pairs of distinct tuples inr That is,ag(r) = {ag(t, t 0 )|t, t 0 ∈randt6=t 0 }.

As an example consider the Armstrong relationr Arm for Σ ={Emp→Dept,Dept→Mgr} from Table 2.2 The agree sets of this relation are{Dept, Mgr},{Mgr},∅.

We can now state the anticipated structural characterization of Arm- strong relations [41, 82].

Theorem 2.4 Let R be a relation schema,Σ a set of FDs andr a relation over r Thenris an Armstrong relation forΣif and only if the following condition is satisfied: maxΣ(R)⊆ag(r)⊆clΣ(R) (2.1)

The condition maxΣ(R) ⊆ ag(r) guarantees that every maximal set functions as an agree set within the relation, thereby confirming that the relation breaches all functional dependencies (FDs) not implied by Σ Additionally, the condition ag(r) ⊆ clΣ(R) ensures that every agree set is closed, which further validates that r satisfies Σ.

In the relation schema WORK with the functional dependency set Σ = {Emp→Dept, Dept→Mgr}, we summarize the maximum sets as follows: maxΣ(WORK) includes {Dept, Mgr}, {Mgr}, and ∅ Additionally, the aggr function results in {Dept, Mgr}, {Mgr}, and ∅, while the closure clΣ(WORK) yields {Dept, Mgr}, {Mgr}, ∅, and {Emp, Dept, Mgr}.

We conclude thatr Arm is indeed an Armstrong relation forΣ.

Theorem 2.4 provides a strategy for computing an Armstrong relation for a given set Σof FDs over a given relation schemaR: Compute the set of maximal sets with respect toΣ, and generate pairs of tuples whose agree sets realize these maximal sets As every maximal set is closed, this strategy produces an Armstrong relation The remainder of this section is used to develop this strategy in detail.

LetRbe a relation schema, andΣ = Σ 0 ∪ {X →A}be a set of FDs over

R ForW C ⊆R, it takesO(|R| × ||Σ||)time to test whetherW ∈maxΣ(C).

The test mtest(W, C, R, Σ) determines if W is part of the maximal sets for R concerning the functional dependencies (FDs) in Σ Calculating the maximal sets for R based on Σ by evaluating all subsets of R is inefficient An iterative method is proposed for this computation Given a relation schema R and a set of FDs Σ = Σ0 ∪ {X → A}, for any C in R, if V is in maxΣ(C), then V is either in maxΣ0(C) or there exists some B in X that influences its status.

Z ∈maxΣ 0 (B)andW ∈maxΣ 0 (C)we haveV =W ∩Z[41].

In the context of the relation schema WORK, the functional dependency set Σ includes Emp → Dept and Dept → Mgr The accompanying table illustrates the maximal set families for each attribute within WORK, organized according to the progressive sets of functional dependencies.

Note, in particular, that∅ ∈ maxΣ(Mgr)satisfies ∅ = V = W ∩Z for

A∈R W ∈max∅(A) W ∈max{ Emp → Dept }(A) W ∈maxΣ(A) Emp {Dept, Mgr} {Dept, Mgr} {Dept, Mgr}

Dept {Emp, Mgr} {Mgr} {Mgr}

Mgr {Emp, Dept} {Emp, Dept} ∅

Algorithm 2.2 computes the families of maximal sets [4] It starts with the maximal sets forRwith respect to an empty FD set, and then adds the FDs ofΣone by one while monitoring the resulting changes to the family of maximal sets.

Algorithm 2.3 computes an Armstrong relationrfor a setΣof FDs [41]. First, it computes the families maxΣ(R)of maximal sets using Algorithm 2.2 Subsequently, it produces a relationrsuch thatmaxΣ(R)⊆ag(r)holds.

As an example, Algorithm 2.3 is applied to compute an Armstrong relation for Σ = {Emp → Dept,Dept → Mgr} over the relation schema

W ORK Table 2.5 shows this relation When substituted adequately, this relation yields the one in Table 2.2.

Emp Dept Mgr c Emp,0 c Dept,0 c Mgr,0 c Emp,1 c Dept,0 c Mgr,0 c Emp,2 c Dept,2 c Mgr,0 c Emp,3 c Dept,3 c Mgr,3

Table 2.5: An Armstrong relation for WORKcomputed by Algorithm 2.3

The complexity of finding an Armstrong relation, given a set of functional dependencies over R, is precisely exponential in the size |R| of R.

In the context of database theory, "precisely exponential" refers to an algorithm that computes an Armstrong relation based on a set Σ of functional dependencies (FDs), with a running time that grows exponentially relative to the number of attributes involved Additionally, it indicates that for a specific set Σ of FDs, the minimum-sized Armstrong relation contains a number of tuples that is also exponentially related to the attributes in Σ.

Input: relation schemaR, a setΣof FDs overR

Output: setsmaxΣ(C)of maximal sets for allC ∈R

Input: Relation schemaR, a setΣof FDs over R

10: end for exponential — thus, an exponential amount of time is required in this case simply to write down the relation [41, 82].

Algorithm 2.3 is a straightforward method for generating Armstrong relations, demonstrating a conservative approach to time usage, even amidst the generally challenging exponential complexity associated with the number of attributes.

The size of an Armstrong relation is defined by the number of tuples it contains, with the most desirable Armstrong relation for a functional dependency (FD) set Σ being the one with the minimum size A smaller number of tuples enhances human comprehension, making it essential to determine the tuple count required for a minimum-sized Armstrong relation An Armstrong relation r for Σ is considered minimum-sized if there is no other Armstrong relation r' for Σ with fewer tuples, ensuring that it contains the least number of tuples possible.

Theorem 2.5 Let Σ be a set of FDs over relation schema R, and let r be a minimum-sized Armstrong relation forΣ Then p1 + 8ã |maxΣ(R)|

Algorithm 2.3 consistently generates an Armstrong relation that is relatively small in size When given the input (R, Σ), it produces an Armstrong relation for Σ that is at most quadratic in relation to the size of the smallest possible Armstrong relation for Σ.

Silva and Melkanoff were pioneers in recognizing the practical potential of Armstrong relations for identifying semantically meaningful dependencies They developed a prototype that assists design teams by providing an Armstrong relation for specific functional and multivalued dependencies This innovation allows teams to focus on whether a dependency is satisfied or violated, rather than determining if it is implied by the input Additionally, the Database Design Expert System (DBE) includes this capability for functional dependencies (FDs), while the DBA companion offers similar functionality for standard FDs and inclusion dependencies The algorithms for computing Armstrong relations have also been integrated into the Design-by-example tool.

Evidence For the Usefulness of Armstrong Relations 30

Armstrong relations, previously referred to as "user-friendly representations" of data dependencies, are recognized for their utility in database design Their usefulness is primarily justified by their structural and functional aspects, highlighting their significance in the effective organization and management of databases.

2.2 ARMSTRONG RELATIONS 31 algorithmic properties of Armstrong relations For instance, this may re- fer to the fact that FDs enjoy Armstrong relations Other interpretations of

The term "useful" can denote the magnitude of an Armstrong relation within a functional dependency set Σ or the effectiveness of algorithms designed to compute such relations This dual interpretation of "useful" has garnered significant attention from the research community, as evidenced by various studies and discussions in the literature.

Langeveldt and Link conducted an empirical investigation into the effectiveness of Armstrong relations for identifying semantically meaningful functional dependencies They introduced three key measures: soundness, completeness, and proximity Their findings revealed that Armstrong relations lack soundness, meaning they do not help in identifying meaningless functional dependencies that may be misinterpreted as meaningful Conversely, they demonstrated that Armstrong relations are beneficial for completeness, as they aid in recognizing meaningful functional dependencies that might otherwise be overlooked This suggests that while it is challenging to identify semantically meaningless dependencies, it is more feasible to detect violations of meaningful ones using Armstrong relations.

The study by Langeveldt and Link is significant as it is the first to demonstrate how Armstrong databases can be effectively utilized by humans Their findings reveal that these databases assist designers in identifying a comprehensive set of requirements for target databases, with design quality heavily influenced by completeness The effectiveness of normalization algorithms and de-normalization strategies relies on the successful acquisition of business rules within the application domain Consequently, the design quality directly impacts the performance of the associated business Furthermore, this research establishes an initial framework for empirically evaluating the usefulness of Armstrong databases across various measures, which this thesis aims to extend to include uniqueness constraints and functional dependencies over partial databases, highlighting their complex interactions compared to pure relations.

Constraint Acquisition by Sample Data and Natural

The article proposes an informal method for identifying semantic constraints using natural language processing and sample data, focusing on keys, functional dependencies, cardinality constraints, and inclusion and exclusion dependencies By leveraging targeted dialogues in natural language, the authors derive the structural and semantic components of future databases The output is assessed for ambiguity and fuzziness, with the structural part validated through a graphical editor and semantics discussed using sample relations The authors introduce heuristics to reduce the search space for potential semantic constraints, and this approach has been integrated into a larger database design system known as Rapid Application and Database Development (RADD) Additionally, the discovery of semantic constraints, heuristics, and informal validation methods are beneficial for reverse engineering, aiding in the translation of existing databases into different data models.

Partial Information in Databases

No Information

SQL supports a single null marker that represents various forms of partial information about domain values This null marker indicates either the non-existence of a domain value or that a domain value exists but is unknown, a concept referred to as "no information" in academic literature The null marker, denoted as ni, is considered a unique element within the domain of each attribute Relations that include null markers are termed partial relations A tuple t within a relation schema R is classified as X-total if it does not contain the null marker ni for all attributes A in a subset X of R Consequently, a partial relation is defined as X-total if all its tuples are X-total.

Table 2.6 presents a partial relation over WORK, highlighting that the last tuple includes the ni marker at the Emp attribute This indicates that either the employee associated with this tuple does not exist or the employee is known but currently unknown.

It is possible for a partial relation to have a tuple whose information content subsumes that of another tuple in the same partial relation For

Emp Dept Mgr Dilbert Information Systems Gates Alice Information Systems Gates ni E-commerce Jobs

Table 2.6: The partial relationr ni over WORK example, we may extend the partial relation r ni of Table 2.6 by another tuple, say

The new tuple's information content encompasses that of the third tuple (ni, E-commerce, Jobs) in ni While some may contend that a partial relation should exclusively include tuples with maximal information content, eliminating subsumed tuples from a partial relation could prove to be prohibitively expensive in practical database systems.

More formally, a tupletoverRsubsumesa tuplet 0 overR, denoted by t 0 v t, if and only if for all A ∈ R it is true thatt 0 (A) = niort 0 (A) = t(A)

[30] A partial relationr is said to be subsumption-freeif and only if there are no two tuplest, t 0 ∈rsuch thatt 0 6=tandt 0 vthold.

Value Unknown at Present

Codd proposed the use of a single null marker, "unk," to manage incomplete information within attribute domains, indicating that the value is currently unknown Relations that include this null marker are referred to as Codd relations, while those that utilize the "ni" marker are known as SQL relations.

In the following we use the term partial relations to express that a certain fact holds for Codd and for SQL relations.

The null markerunkis quite different from the null markerni As an illustration consider Table 2.7 which shows a Codd relation over WORK.

Dilbert Information Systems Gates Alice Information Systems Gates unk E-commerce Jobs

Table 2.7: The Codd relationr unk over WORK

Table 2.8: A possible world relation forrunk

The last tuple features the unk marker at the Emp attribute This means that the employee for this tuple does exist but is unknown at present.

In the literature, the semantics of Codd relations has been investigated

A widely accepted method in database theory is possible world semantics, which suggests that a possible world of a Codd relation is created by substituting each instance of the unknown marker with a value from the corresponding attribute's domain Formally, the collection of all possible worlds associated with a Codd relation rover R is represented as Poss(r).

Poss(r) := {r 0 |r 0 is a relation overRand there is a bijectionb :r→r 0 such that∀t ∈r, tis subsumed byb(t)andb(t)isR-total}.

The definition of possible worlds reflects the closed world assumption (CWA), indicating that only R-total tuples from the relation r can exist in Poss(r) For example, Table 2.8 illustrates a relation that serves as a possible world for the Codd relation runk, as presented in Table 2.7.

Data Dependencies over Partial Relations

Functional Dependencies over SQL Relations

Lien [30] pioneered research on the class of functional dependencies over subsumption-freeSQL relations We summarize his results on the implication problem associated with this class.

A functional dependency with nulls (NFD) over a relation schema R is defined as an expression X → Y, where X and Y are subsets of R An SQL relation r over R satisfies the functional dependency X → Y if, for any tuples t and t0 in r, the condition holds that if t(X) equals t0(X) and both tuples are X-total, then t(Y) must equal t0(Y).

When two tuples share a non-null restriction on attribute X, they must also align on the restriction for attribute Y, which may be only partially defined An NFDX → Y condition can be breached in an SQL relation r if there exist two distinct tuples t and t' in r that are X-total, where t(X) equals t'(X) but differ in at least one attribute A in Y, resulting in t(A) not equating to t'(A) This discrepancy can occur if both attributes hold different non-null values or if one is null while the other is not.

As examples, the SQL relationr SQL in Table 2.9 satisfies the NFDsEmp→ DeptandDept→Mgr, and it violates the NFDsMgr→Dept,Mgr→Emp, andEmp→Mgr.

For total SQL relations the definition of an NFD reduces to that of an

Functional dependency (FD) is accurately defined and aligns with the no information interpretation Specifically, tuples containing null values in attributes of X cannot violate the functional dependency X → Y.

Y: the nulls mean that no information is available about those attributes.Secondly, the functional dependence of Y on X forces any two X-total tuples t, t 0 where t(X) = t 0 (X) to have the same information on all the

2.4 DATA DEPENDENCIES OVER PARTIAL RELATIONS 37

Table 2.9: The SQL relationr SQL over WORK attributes inY That is, for allA ∈Y we have eithert(A) = ni= t 0 (A)or ni6=t(A) =t 0 (A)6=ni.

Next Lien’s axiomatization and algorithmic solution of the implication problem for the class of NFDs is revisited.

The transitivity rule for Functional Dependencies (FDs) is not applicable to Non-Functional Dependencies (NFDs), as demonstrated in Table 2.9 For instance, the SQL relation satisfies the NFDs Emp → Dept and Dept → Mgr, yet it violates the NFD Emp → Mgr This indicates that the implication problem for NFDs differs significantly from that of FDs.

Theorem 2.6(Lien,1982) The setL, consisting of reflexivity axiom, augmentation rule, decomposition rule, and union rule below, forms a finite axiomatization for the class of NFDs over SQL relations.

In the relation schema WORK, the NFD set Σ includes the functional dependencies Emp → Dept and Dept → Mgr By utilizing the reflexivity axiom, we can derive the NFDEmp → Emp Furthermore, applying the augmentation rule to the dependency Dept → Mgr leads to the conclusion that Emp, Dept → Mgr Lastly, using the union rule on the dependencies Emp → Dept and Emp → Emp allows us to infer Emp → Emp, Dept.

The concept of closure X Σ ∗ for an attribute set X concerning a non-functional dependency (NFD) set Σ is crucial Similar to functional dependencies (FDs), determining whether an NFD set Σ implies an NFD X → Y equates to verifying if Y is a subset of X Σ ∗ However, computing the closure X Σ ∗ with respect to NFDs differs from that for FDs, particularly in SQL relations, where the operation (ã) ∗ Σ is not idempotent For instance, in the relation schema R = WORK, with X = {Emp} and Σ = {Emp → Dept, Dept → Mgr}, we find that X Σ ∗ = {Emp, Dept}, but (Emp, Dept) ∗ Σ results in {Emp, Dept, Mgr} Thus, X Σ ∗ does not equal (X Σ ∗ ) ∗ Σ.

The algorithm efficiently calculates the attribute set closure in relation to a set of non-full functional dependencies (NFDs) It is designed to determine the implication problem for the input set Σ combined with the dependency X → Y within a time complexity of O(||Σ∪ {X → Y}||).

Input: attribute setX, NFD setΣover relation schemaR

Output: attribute set closureX Σ ∗ ofX with respect toΣ

(X ={Emp},Σ = {Emp→Dept,Dept→Mgr}, R=WORK),

Algorithm 2.4 computes X Σ ∗ = {Emp, Dept} Indeed, the only NFDV →

FDs and NOT NULL constraints over SQL Relations 39

SQL enables the declaration of attributes as NOT NULL, ensuring that all SQL relations are complete for these attributes This functionality offers data designers and engineers significant flexibility in determining which information is essential and which can be left incomplete.

Atzeni and Morfuni [28] studied the implication problem for the combined class of NFDs and NOT NULL constraints over subsumption-free SQL relations.

A null-free subschema (NFS) over the relation schema R is defined as an expression nfs(R s), where R s is a subset of R This NFS, nfs(R s), is satisfied by a partial relation r over R.

|= r nfs(Rs), if and only ifrisR s -total.

The SQL relationr SQL of Table 2.9 satisfies the NFSnfs(Emp, Mgr)and violates the NFSnfs(Emp, Dept), for example.

To establish a null-free subschema for a relation schema, it is sufficient to define one that corresponds to the attributes marked as NOT NULL An NFS, denoted as nfs(R s), is considered implied by a collection of NFSs, Σ = {nfs(R 1 s), , nfs(R n s), if and only if R s is a subset of the combined attributes from R 1 s through R n s.

The non-finite set (NFS) nfs(Rs) influences the implications of non-finite dependencies (NFDs) Considering the relation schema R = WORK and the NFD set Σ = {Emp → Dept, Dept → Mgr}, let ϕ represent the NFD Emp → Mgr In this context, nfs(Rs) pertains to the relation R where Rs = {Emp, Mgr} It is important to note that Σ ∪ {nfs(Rs)} does not entail ϕ, as demonstrated by the subsumption-free SQL relation in Table 2.9.

R s ={Dept}, thenΣ∪ {nfs(R s )} |=ϕholds indeed.

The presence of an NFS nfs(Rs) subsumes both the case of total relations (R s =R) and the case where null markers can occur on each attribute (R s =∅).

This article introduces a notation convention regarding Non-Functional Dependencies (NFDs) and Non-Functional Schemata (NFS) It establishes that a single NFS, denoted as nfs(Rs), is sufficient for each relation schema Rs, and that nfs(R's) is implied by nfs(Rs) if R's is a subset of Rs The focus then shifts to the implication problem of NFDs in the context of an NFS, specifically determining whether for any relation schema R, an NFS nfs(Rs), and a set Σ ∪ {ϕ} of NFDs over R, the relationship Σ ∪ {nfs(Rs)} |= ϕ holds true For clarity, the notation Σ |= Rs ϕ is used instead of Σ ∪ {nfs(Rs)} |= ϕ.

Atzeni and Morfuni established an axiomatization for the implication problem of NFDs in the presence of an NFS [28].

Theorem 2.7 The following setA, consisting of reflexivity axiom, union rule, decomposition rule, and null transitivity rule, forms a finite axiomatization for the class of NFDs andNOT NULLconstraints over subsumption-free SQL relations.

X →Z Y −X⊆R s (reflexivity) (union) (decomposition) (null transitivity)

The null transitivity rule allows for the inference of NFDX → Z from the NFDs X → Y and Y → Z, provided that all attributes in Y − X are declared NOT NULL, meaning they are part of the NFSnfs(Rs) Additionally, it is important to consider the implications of the augmentation rule in this context.

XZ →Y follows from the reflexivity axiom and the null transitivity rule [28].

In the context of the WORK relation schema, the NFD set Σ includes the functional dependencies Emp → Dept and Dept → Mgr, along with the non-full functional dependency nfs(Dept) By applying the null transitivity rule, we can deduce the new NFD Emp → Mgr.

Atzeni and Morfuni also established a linear-time algorithm for deciding the implication problem for NFDs in the presence of an NFS [28] As Beeri

2.4 DATA DEPENDENCIES OVER PARTIAL RELATIONS 41 and Bernstein did for total relations [46], Atzeni and Morfuni utilized the notion of an attribute set closureX Σ,R ∗ s = {A ∈ R | Σ |= R s X → A} of an attribute setX with respect to an NFD setΣand an NFSnfs(R s )over the relation schemaR An NFDX →Y overRis implied byΣin the presence of nfs(Rs) if and only if Y ⊆ X Σ,R ∗ s holds Algorithm 2.5 computes the attribute set closureX Σ,R ∗ s ofXwith respect toΣandnfs(Rs)overR[28].

Input: column header setX, NFSnfs(R s ), and NFD setΣoverR

Output: attribute set closureX Σ,R ∗ s ofXwith respect toΣandnfs(R s )

(X ={Emp},Σ ={Emp→Dept,Dept→Mgr}, Rs t, R =W ORK ),Algorithm 2.5 computesX Σ ∗ ={Emp, Dept, Mgr}.

FDs over Codd Relations

Levene and Loizou introduced and axiomatized the classes of weak and strong functional dependencies with respect to a possible world semantics [49].

A weak functional dependency (WFD) over a relation schema Ris an expression(X →Y)whereX, Y ⊆R A Codd relationroverR satisfies

In the context of Codd's relational model, a Well-Founded Dependency (WFD) is defined as WFD(X → Y) if there exists a possible world p within the set of all possible worlds of relation r, where for every pair of tuples t and t' in p, if t(X) equals t'(X), then t(Y) must also equal t'(Y).

A strong functional dependency (SFD) in a relation schema R is defined as an expression (X → Y), where X and Y are subsets of R A Codd relation r over R satisfies the SFD (X → Y) if and only if, for every possible world p in P oss(r) and for all tuples t and t' in p, the condition holds that if t(X) equals t'(X), then t(Y) must equal t'(Y).

Both WFDs and SFDs occur in the real world As an illustration consider the Codd relation r Codd of Table 2.10 over WORK The relation r Codd satisfies:

• (Emp→Dept)since every substitution of theunkoccurrence in the Empcolumn results in a relation that satisfies the FDEmp→Dept.

• (Dept → Mgr) since there is a substitution of the unk occurrence in the Mgrcolumn such that the FD Dept → Mgr is satisfied by the resulting possible world.

Weak Functional Dependencies (WFDs) and Strong Functional Dependencies (SFDs) offer complementary features in database management SFDs can be efficiently maintained since updating an unknown occurrence to a non-null value does not violate the corresponding functional dependency In contrast, WFDs allow for a greater degree of uncertainty in the database, as they do not require strict adherence to functional dependencies in every possible scenario.

Implication Problem and Armstrong Relations

The axiomatization of SFDs is given by the Armstrong axioms in Theo- rem 2.2, while WFDs have the same axiomatization as the NFDs of Lien

Algorithmic solutions for implication problems related to the classes of Strong Functional Dependencies (SFDs) and Normal Functional Dependencies (NFDs) have been established, as noted in Theorem 2.7 Levene and Loizou provided an axiomatization for the combined class of Weak Functional Dependencies (WFDs) and SFDs over Codd relations, although this topic is not covered in this thesis They also demonstrated that this combined class possesses Armstrong relations, suggesting that further research could enhance understanding of the combinatorial aspects of Armstrong relations, which typically exhibit exponential size, and develop algorithms for generating these relations This thesis will concentrate on WFDs, leaving SFDs for future exploration.

Uniqueness Constraints and Keys

This section is devoted to a brief review of previous research work on keys and uniqueness constraints in the presence of null markers.

Codd’s principle of entity integrity asserts that all attributes of a primary key must be NOT NULL, ensuring that no attribute can have an undefined or unknown value This fundamental concept in database theory guarantees that each tuple can be accurately identified, as undefined values would obscure the representation of the corresponding entity.

Thalheim explores alternatives to Codd’s principle of entity integrity by introducing the concept of a key set, examining its combinatorial properties He defines a partial relation as satisfying a key set if, for every pair of distinct tuples, there exists a key within the set that differentiates the two tuples.

Table 2.11 illustrates a partial relation that meets the total and distinct criteria for the key set {{Emp}, {Dept}, {Mgr}} However, it fails to comply with the key sets {{Emp}, {Dept}}, {{Emp}, {Mgr}}, and {{Dept}, {Mgr}}.

Candidate keys adhering to Codd's principle of entity integrity have been proposed and analyzed in previous studies It is important to note that the attributes within a candidate key, which is a potential primary key, must be designated as NOT NULL Consequently, an SQL relation r is defined to fulfill a Codd key Codd(X) if and only if r is X-total, ensuring that no two distinct tuples t and t' in r have identical values for the attributes in X.

The axiomatization of keys over total relations, as presented in Theorem 2.1, is not applicable to Codd keys, as neither the relation axiom nor the superkey rule is valid for deriving Codd keys For instance, the SQL relation shown in Table 2.9 adheres to the Codd key Codd(Emp, Mgr) but fails to meet the requirements for the Codd key Codd(Emp, Dept, Mgr).

Hartmann et al [107] established the following simple axiomatization for Codd keys over SQL relations.

Theorem 2.8(Hartmann et al.,2010) The following inference rule forms a finite axiomatization for the implication of Codd keys over SQL relations.

For example, consider the setΣcomprising the two Codd keysCodd(Emp) andCodd(Dept, Mgr) An application of thekey extensionrule allows us to inferCodd(Emp,Mgr).

The implication problem of Codd keys can be resolved in linear time relative to the input size Specifically, for a set of Codd keys represented as Σ∪ {Codd(K)} over a relation schema R, Algorithm 2.6 efficiently determines in O(||Σ∪ {Codd(K)}||) time whether the condition Σ|=Codd(K) is satisfied.

Input: setΣ∪ {Codd(K)}of Codd keys over relation schemaR

In the context of the relation schema WORK, the set of Codd keys is represented as Σ={Codd(Emp), Codd(Dept,Mgr)} When applying Algorithm 2.6, it outputs NO for the input (Codd(Dept), Σ, R) but outputs YES for the input (Codd(Emp,Mgr), Σ, R).

Uniqueness Constraints over SQL Relations

Surprisingly, the work on uniqueness constraints and keys over partial database instances has received little attention in the research literature.

We provide some reason for why this might be the case, starting with SQL relations Here, one could define a uniqueness constraint as follows.

A uniqueness constraint with nulls (NUC) in a relation schema R is defined as u(X), where X is a subset of R An SQL relation r adheres to the NUC u(X) if, for any tuples t and t0 in r, the condition holds that if t(X) equals t0(X) and both tuples are X-total, then t must be equal to t0.

Research on data dependencies typically assumes that database instances are free of duplicates, meaning no two distinct tuples share identical values across all attributes Under this premise, for any SQL relation r within a relation schema R, r upholds the uniqueness constraint u(X) if and only if it satisfies the non-functional dependency NFDX →R Consequently, uniqueness constraints are essentially special instances of functional dependencies, mirroring their equivalents in the relational data model.

In practice, eliminating duplicates can be costly and may not always be preferable SQL, the industry standard, allows for duplicate tuples by default, returning them in queries unless the DISTINCT clause is used This raises the theoretical interest in studying multi-sets of tuples rather than just sets Therefore, when discussing SQL tables, we acknowledge the inclusion of multi-sets of tuples, which may contain occurrences of null markers.

Recent findings indicate that in SQL tables, the condition that NUCu(X) over schema R holds is no longer equivalent to the requirement that NFDX → R holds Specifically, there exist SQL tables that fulfill the NFDX → R condition while simultaneously violating the NUCu(X) condition.

Table 2.12: SQL table that violatesu(Emp)and satisfiesEmp→Dept,Mgr

As an example consider the SQL table in Table 2.12 over the relation schema W ORK It satisfies the NFDEmp → W ORK , but violates the NUC u(Emp).

These explanations and example warrant a study of the implication problem for at least the combined class of NUCs and NFDs, preferably in the presence of an NFS.

Uniqueness Constraints over Codd Relations

Uniqueness constraints can be defined similarly over Codd relations We will focus here on the weak approach.

A weak uniqueness constraint (WUC) over relation schemaRis an ex- pressionu(X)whereX ⊆R A Codd relationrsatisfies the WUCu(X) if and only if there is a possible world p ∈ P oss(r)that satisfies the key key(X).

In the context of Codd relations over a relation schema R, a relation r satisfies the Weak Uniqueness Condition (WUC) u(X) if and only if it meets the Weak Functional Dependency (WFD) (X → R) Additionally, possible worlds can be understood as multisets of tuples, where the term "table" refers to such multisets of total tuples Furthermore, a "Codd table" is defined as a multiset r of tuples that may contain null markers (unk), with its possible worlds encompassing various tables.

Poss(r) refers to a set of tables over a relation R, where a bijection exists such that every element t in r is subsumed by its corresponding element b(t), and b(t) is R-total Just like NUCs and NFDs, certain Codd tables over R can satisfy the Weak Functional Dependency (WFD) condition, specifically WFD(X → R), while simultaneously violating the Weak Uniqueness Condition (WUCu(X)).

Table 2.13: Codd table that violates ♦u(Emp) and satisfies ♦(Emp → Dept,Mgr) ple consider the Codd table in Table 2.13 over the relation schema W ORK

The study reveals that while the WFD (Emp→W ORK) condition is satisfied, it contradicts the WUCu (Emp) This necessitates an examination of the implications for the combined class of WUCs and WFDs, ideally considering the presence of a non-finite state (NFS).

SQL Armstrong Tables

NFDs Do Not Enjoy SQL Armstrong Tables

In the relational model, functional dependencies are characterized by Armstrong relations, which encompass non-standard functional dependencies, such as the expression ∅ → A, where the left-hand side consists of an empty set of attributes.

In [108] it was shown that the class of NFDs does not enjoy Armstrong tables in the presence of a null-free subschema For the relation schema

W ORK , NFD setΣ ={∅ →Emp;Emp→Dept}, and the NFSnfs(Dept,Mgr) there is no Armstrong table [108].

Theorem 2.9 The class of functional dependencies with nulls in the presence of a null-free subschema does not enjoy Armstrong tables.

Non-standard functional dependencies are rarely encountered in practice, while standard functional dependencies with nulls benefit from Armstrong tables in the context of a null-free subschema This article reviews the structural and computational properties of Armstrong tables for the combined classes of Non-Unique Constraints (NUCs), Non-Standard Functional Dependencies (NFDs), and null-free subschemata in SQL tables It is important to note that we will assume that the sets of NUCs and NFDs over SQL tables do not include any non-standard NFDs or non-standard NUCs of the formu(∅).

Structural Properties of SQL Armstrong Tables

To effectively characterize SQL Armstrong tables structurally, it is essential to first establish generalizations of maximal sets and agree sets Additionally, the possibility of duplicate tuples necessitates the introduction of a new concept.

In the context of a set of Non-Functional Dependencies (NFDs) denoted as Σ, and a Non-Functional Schema (NFS) represented by Rs over a relation schema R, the maximal sets for an attribute C within R are defined Specifically, the maximal sets maxΣ,R s(C) consist of all non-empty subsets X of R such that the dependencies in Σ hold for the subset X leading to the attribute C.

The maximal sets of R with respect to Σ and nfs(R s ) are defined as maxΣ,R s(R) = S

C∈RmaxΣ,R s(C) If Σ and R s are clear from the context we may simply writemax(C)andmax(R), respectively.

Thus, the maximal sets of an attributeC with respect toΣand nfs(Rs) are the maximal attribute subsets ofR that do not functionally determine

C Note that the empty set ∅ is not considered as maximal due to the exclusion of non-standard functional dependencies∅ →Cfrom NFD sets. Therefore, an Armstrong table does not need to violate any non-standard NFDs.

The relation schema WORK includes the attributes Emp, Dept, and Mgr, with a specific focus on the subset WOrkS, which comprises Emp and Mgr Additionally, the set Σ encompasses two non-functional dependencies: Emp → Dept and Dept → Mgr.

• maxΣ,W ORK s(Emp) = {{Dept,Mgr}},

• max Σ,W ORK s(Dept) = {{Mgr}}, and

Specifically,Σdoes not imply the NFDEmp → Mgrin the presence of

According to Theorem 2.4, as outlined in Equation (2.1), a crucial requirement for an Armstrong relation is that each maximal set must correspond to distinct rows within the relation, where the strong agree set matches the maximal set.

2.5 SQL ARMSTRONG TABLES 51 is to guarantee that all the NFDs not implied by the set of NUCs and NFDs in the presence of an NFS can be violated Over tables, however, it is still possible that there are NUCs u(X) not implied by Σ in the presence of nfs(R s )over R, even if the NFD X → R is implied For this reason, it is required for the computed Armstrong table that for all attribute setsXthat are maximal with this property, there must be distinct rows in the table whose strong agree set isX This motivates the following definition [108]. Let Σbe a set of standard NUCs and NFDs and letnfs(Rs)be an NFS over relation schemaR We define theduplicate sets dup Σ,R s (R)of Rwith respect toΣandnfs(R s )as follows: dup Σ,R s (R) := {X ⊆R |Σ|= R s X →R∧Σ6|= R s u(X)∧

IfΣandnfs(R s )are clear from the context we may simply writedup(R).

The relation schema WORK includes the attributes Emp, Dept, and Mgr, with the NFS defined as nfs(WORK s) where WORK s comprises Emp and Mgr Additionally, the set Σ contains two non-functional dependencies: Emp → Dept and Dept → Mgr.

We obtain dup Σ,W ORK s (W ORK ) = {{Emp, Dept, Mgr}}.

The set Σ consists exclusively of NFDs, meaning it does not enforce uniqueness constraints with null values Consequently, an Armstrong table for Σ will inevitably breach all uniqueness constraints involving nulls This violation can be demonstrated by incorporating duplicate tuples that share identical non-null values across all attributes in the set {Emp, Dept, Mgr}.

Strong and weak agree sets

Equation (2.1) of Theorem 2.4 shows that the notion of agree sets is useful in characterizing Armstrong relations For tables, which can feature oc-

The article discusses the occurrences of null markers in Table 2.14, highlighting the need for further refinement of this concept It references a study where the authors delineated strong and weak notions of agree sets, emphasizing the complexity of the topic within the context of IT, design, marketing, and security roles.

LetRbe a relation schema,rbe a table overR, andt 1 , t 2 be two tuples ofr Theagree setof two tuplest1, t2 is defined as ag(t1, t 2 ) := {(X, Y)| ∀A ∈R(

Theagree setofris defined as ag(r) = {ag(t1, t 2 ) | t 1 , t 2 ∈ r∧t 1 6= t 2 }. Thestrong agree setofris defined asag s (r) = {X |(X, Y)∈ag(r)}, and the weak agree setofr isag w (r) = {Y | (X, Y)∈ag(r)} Finally, forX ∈ag s (r), we havew(X) =T{Y |(X, Y)∈ag(r)}.

As an example, consider the SQL tablerover W ORK in Table 2.14 Us- ing the first letters ofEmp,Dept, andMgr, we have:

• ag(r) = {(DM,DM),(M,M),(E,ED),(EDM,EDM),(∅,∅)}

• ag w (r) ={DM,M,ED,EDM,∅}, and

Structural Characterization of SQL Armstrong Tables

As shown in Theorem 2.10 [108], the notions of maximal sets, duplicate sets, strong and weak agree sets allow us to characterize Armstrong tables structurally.

Theorem 2.10 LetRbe a relation schema,Σa set of standard NUCs and NFDs, andnfs(R s )an NFS overR For all tablesroverRit holds thatris an Armstrong table forΣandnfs(Rs)if and only if all of the following conditions are satisfied:

In Theorem 2.10, the initial and final conditions guarantee that all non-functional dependencies (NFDs) and non-unique constraints (NUCs) not implied by Σ in the context of nfs(R s) are violated by r Conversely, the second and fourth conditions confirm that all NFDs and NUCs present in Σ are satisfied by r Additionally, the last condition establishes that r is total on the null-free subschema and does not apply elsewhere.

Computational Properties

Theorem 2.10 has been exploited to devise an algorithm for computing tables that are Armstrong for any given set of standard NUCs and NFDs on any given relation schema [108] Given the input, the idea is to compute the maximal set families and duplicate sets first, and then to construct a table whose agree sets realize the maximal and duplicate sets The algorithms for these computations are summarized in the following.

Computing the maximal set families

The computation of maximal set families for a relation schema R, along with an NFD set Σ and an NFS nfs(Rs) over R, extends the functionality of Algorithm 2.2, which specifically calculates maximal set families based on a given FD set.

To computedup Σ,R s (R)we generate the hyper-graphH= (V, E)with ver- tex setV =Rand the set

In the context of hyper-graphs, we define hyper-edges as E = {K−R s | u(K) ∈ Σ} This leads us to derive dup Σ,R s (R) as dup Σ,R s (R) = {R−X | X ∈ Tr(H) ∧ ∀M ∈ max Σ,R s(R)(R−X 6⊆ M)}, where Tr(H) represents the minimal transversals of the hyper-graph H Additionally, we revisit the relation schema WORK and the NFS nfs(WORK s).

In the context of the hyper-graph H defined by the set WORK, which includes employees (Emp) and managers (Mgr), we consider the two non-functional dependencies (NFDs) where Emp determines Dept and Dept determines Mgr As a result, the hyper-edge set E remains empty, leading to the conclusion that the only minimal transversal is present within this structure.

The empty set X of H has a complement consisting of {Emp, Dept, Mgr} This complement is not included in any of the maximal sets related to Σ and nfs(W ORK s) Therefore, the duplication of Σ and W ORK s results in the set {{Emp, Dept, Mgr}}.

Algorithm 2.8 computes an Armstrong table for any given setΣof NUCs and NFDs and NFS nfs(R s ) over any given relation schema R [108, 110]. The idea of the algorithm is to compute the maximal sets for all attributes with respect to the standard NFD setΣ[FD] ={X →R |u(X)∈Σ}∪{X →

In lines (2-4), the process involves defining the relationship Y |X → Y ∈ Σ, followed by computing the duplicate sets in line (5) Subsequently, all identified sets are realized in lines (8-14), and null marker occurrences are added for attributes not belonging to the NFS in lines (15-22).

Input: relation schemaR, setΣof standard NFDs and NFSnfs(Rs)overR

Output: maximal setsmaxΣ,R s(C)for allC ∈R

7: for allC ∈Rwhere(C =AorA∈R s )do

The table presents a synthetic overview of employee data, detailing various attributes such as employee ID, department ID, and manager ID Each entry corresponds to a unique employee, with some employees associated with multiple departments or managers For instance, employees 1 through 4 are linked to different departments and managers, while employees 5 and 6 have unique managerial assignments Additionally, employees 7 are consistently associated with the same department and manager, illustrating a structured organizational hierarchy This data is essential for understanding workforce distribution and management relations within the company.

Specifically, whenX ∈dup Σ,R s (R), then i)X /∈maxΣ[NFD],R s(R), ii)Z {A ∈ R | X ∈ maxΣ[NFD],R s(A)} = ∅, and iii) R s ⊆ X Therefore, lines (9-11) will add to the table tuples that strongly agree on the elements of dup Σ,R s (R).

In our ongoing example, we examine the relation schema WORKS = {Emp, Dept, Mgr} with the non-full functional dependencies (nfs) defined as WORKS = {Emp, Mgr} The non-full dependency set Σ includes Emp → Dept and Dept → Mgr The output generated is represented in the Armstrong table displayed in Table 2.15, while appropriate substitutions yield the SQL table illustrated in Table 2.14.

The general worst-case exponential time complexity and rather conservative use of time carry over from Algorithm 2.3 that computes Armstrong relations [108, 110].

Theorem 2.11 Let R be a relation schema, Σbe a set of standard NUCs andNFDs and nfs(R s ) an NFS overR Letrbe a minimum-sized Armstrong table

Input: relation schema R, set Σ of standard NUCs and NFDs, and NFS nfs(R s )overR

Output: Armstrong tablerforΣandnfs(Rs)

8: for allX ∈max Σ[NFD],R s(R)∪dup Σ[NFD],R s (R)do

( cA,i ,ifA∈XZRs ni ,else and

 c A,i ,ifA∈X cA,i+1 ,ifA∈Z(Rs−X) ni ,else

21: end if forΣand nfs(R s ) Then q1 + 8ã |maxΣ[NFD],R s(R s )∪dup Σ,R s (R)|

For input(R,Σ, R s ), Algorithm 2.8 computes an Armstrong table forΣand nfs(Rs)whose size is at most quadratic in the size of a minimum-sized Armstrong table forΣand nfs(Rs).

Standard NFD sets Σ of size O(n) can have minimum Armstrong tables of size O(2^n), while some sets have Armstrong tables with O(n) tuples and the best representation of Σ being O(2^n) Therefore, it is beneficial to use both the abstract constraint set Σ and its corresponding Armstrong table Constraint sets aid in identifying constraints that may be mistakenly viewed as meaningful, while Armstrong tables help uncover constraints that could be wrongly considered meaningless This thesis aims to provide evidence supporting this intuition.

Further Remarks

Recent findings indicate that similar outcomes have been achieved for the combined class of cardinality constraints with upper bounds and non-functional dependencies (NFDs) when non-functional specifications (NFS) are present However, this particular class falls outside the focus of this thesis.

The structural and computational properties of Armstrong tables for the specific classes of NUCs and NFDs in the context of an NFS are crucial areas of study, as highlighted in references [108, 111].

Summary and Research Gap

Logical Characterization

Computational Properties

Complexity Considerations

Design

Implementation Details

Use Case Example

Quality measures

Example

Quantitative data analysis

Qualitative data analysis

Định dạng
Số trang	264
Dung lượng	3,35 MB