Best probability queries on probabilistic databases

This thesis focuses on answering probabilistic top-k and skyline queries on prob-abilistic data using the possible worlds semantics model.. In this thesis, novel approaches for answerin

Trang 1

BEST PROBABILITY QUERIES ON PROBABILISTIC DATABASES

Trieu Minh Nhut Le

A thesis submitted in total fulﬁlment

of the requirements for the degree of

Doctor of Philosophy

School of Engineering and Mathematical Sciences Faculty of Science, Technology and Engineering

La Trobe University Bundoora, Victoria 3086

Australia

January 2014

Trang 3

Second, I would like to thank Dr Zhen He at La Trobe University for providinginsightful ideas, guidance, and comments on my research I have been fortunate

to collaborate with him on various work and have learnt precious skills in researchpaper writing He has treated me more like a friend than a student and has alwaysoﬀered sound advice whenever I needed it

Third, I would like to thank my co-supervisor Prof Wenny Rahayu for her role

as a member of my Research Committee and for providing helpful feedback at everystage of my Ph.D I thank Ms Michele Mooney for her careful proof reading of myresearch papers and the ﬁnal draft of this thesis

Last but not least, I would like to express my gratitude and love to my family foralways being there and when I needed them most, and for supporting me throughout

my life Especially, I would like to thank my mum, Tiet Thi Phan, for her continuouslove, care, and support I would like to thank my lover, Tuyen Mong Do, for herlove, care and encouragement

Trang 4

This thesis focuses on answering probabilistic top-k and skyline queries on

prob-abilistic data using the possible worlds semantics model These are two of the mostimportant queries for decision support systems Almost all other existing meth-ods for answering queries on probabilistic data require the user to set a probabilitythreshold However, it is diﬃcult to set a threshold because if it is set too high,important results may be lost, but if it is set too low, a lot of low quality resultsmay be returned

In this thesis, novel approaches for answering probabilistic top-k and skyline

queries are proposed using the dominance principle as natural and eﬀective ods to select results of queries with an acceptable number of answers, ensuring allimportant answers are captured without the need to set a threshold

meth-There are three challenges to answering both probabilistic top-k and skyline queries The ﬁrst challenge is to develop novel probabilistic top-k and skyline queries

using the dominance principle to return only the most interesting results The secondchallenge is to develop formulas based on probabilistic theory to directly calculatethe probabilities of the results without considering any possible worlds and to alsodevelop algorithms to eﬀectively prune the search space The third challenge is toensure that all the semantic properties of the probabilistic queries are covered.The evaluations of the performance of the proposed approaches show that, ﬁrstly,the results of the queries are not only very reasonable in size but also capture allthe important answers Secondly, the proposed algorithms outperform the currentalgorithms by accelerating the pruning search space, thereby reducing executiontime Lastly, all the semantic properties of probabilistic queries are covered

Trang 5

Statement of Authorship

Except where reference is made in the text of this thesis, this thesis contains nomaterial published elsewhere or extracted in whole or in part from a thesis submittedfor the award of any other degree or diploma

No other person’s work has been used without due acknowledgment in the maintext of the thesis

This thesis has not been submitted for the award of any degree or diploma inany other tertiary institution

Trang 6

0 STATEMENT OF AUTHORSHIP

Trang 7

External Refereed Publications

Trieu Minh Nhut Le, Jinli Cao, Top-k best probability queries on

probabilis-tic data, Proceedings of the 17th International Conference on Database Systemsfor Advanced Applications Volume Part II, DASFAA’12, Springer-Verlag, Berlin,Heidelberg, 2012, pp 116

Trieu Minh Nhut Le, Jinli Cao, and Zhen He, Top-k best probability queries

and semantics ranking properties on probabilistic databases Data & Knowledge

Engineering, 2013.

Trieu Minh Nhut Le, Jinli Cao, and Zhen He, Answering skyline queries on

probabilistic data using the dominance of probabilistic skyline tuples The ACM

Transactions on Database Journal, under reviewed, August 2013.

Trang 8

0 EXTERNAL REFEREED PUBLICATIONS

Trang 9

1.1 Uncertain data 1

1.2 Querying probabilistic data 2

1.2.1 Answering probabilistic top-k queries 2

1.2.2 Answering probabilistic skyline queries 6

1.3 Contributions of this thesis 11

1.3.1 Contribution on answering probabilistic top-k queries 11

1.3.2 Contribution on answering probabilistic skyline queries 12

1.4 Thesis organization 13

2 Background 15 2.1 Probabilistic database models 15

2.1.1 The uncertain object model 15

2.1.2 The possible worlds semantics model 16

2.2 Queries on databases 19

2.2.1 The top-k queries on data 19

2.2.2 Skyline queries on data 19

2.3 Summary of chapter 22

Trang 10

3.1 Answering top-k queries on probabilistic data 25

3.1.1 The uncertain-top-k queries 26

3.1.2 The uncertain-k-rank 27

3.1.3 The probabilistic threshold top-k query 28

3.1.4 The global-top-k query 29

3.1.5 The expected-score query 30

3.1.6 The expected-rank 31

3.1.7 The robust rank 32

3.2 Evaluating probabilistic top-k queries on semantic properties. 33

3.2.1 Semantic properties for top-k probabilistic queries 33

3.2.2 Analysing the answers to probabilistic top-k queries 34

3.3 Answering skyline queries on uncertain data 36

3.3.1 Answering skyline queries on incomplete data 37

3.3.2 Answering probabilistic skyline queries using the uncertain object model 38

3.3.3 Computing all skyline probabilities for uncertain data 39

4 Answering top-k best probability queries 41 4.1 Motivation and our proposal 41

4.1.1 Problem deﬁnition 42

4.1.2 Contributions 45

4.1.3 Calculation of top-k probability 46

4.1.4 Calculation of top-k probability with a generation rule 49

4.2 The top-k best probability query 52

4.2.1 Deﬁnition of the top-k best probability query 52

4.2.2 Signiﬁcance of top-k best probability query 54

4.2.3 Finding top-k best probability and pruning rules 55

4.2.4 The top-k best probability algorithm 59

4.3 Semantics of top-k best probability query and other top-k queries 60

4.3.1 Semantics of ranking properties 60

4.3.2 Top-k queries satisfying semantic properties 62

4.4 Experimental study 64

4.4.1 Real data 65

4.4.2 Synthetic data 69

Trang 11

5 Answering the best probabilistic skyline queries 73 5.1 Motivation and our proposal 73

5.1.1 Motivation 74

5.1.2 Our proposal 76

5.2 Answering the bestpro-skyline query 77

5.2.1 Na¨ıve deﬁnition of the probabilistic skyline query 77

5.2.2 The bestpro-skyline query 78

5.3 Overview of a Na¨ıve solution compared to our solution 79

5.3.1 Na¨ıve solution 79

5.3.2 Our solution 80

5.4 Calculating skyline probability without enumerating all possible worlds 82 5.4.1 Formula for calculating the skyline probability of a tuple 82

5.4.2 Handling generation rules 83

5.5 Bestpro-skyline Algorithms 88

5.5.1 Pruning the search using the Guard value of pivot tuples 88

5.5.2 Nearest Neighbour-based bestpro-skyline (NN-BPS) algorithm 92 5.5.3 The skyline result-based bestpro-skyline (SKY-BPS) algorithm 97 5.5.4 Analysis of pruning performance of NN-BPS algorithm against SKY-BPS 101

5.6 Experimental setup 102

5.6.1 Data sets 104

5.6.2 Algorithm setup 106

5.6.3 Measurement metric 107

5.6.4 Hardware setup 108

5.7 Experiment results 108

5.7.1 The results of bestpro-skyline query on real data 108

5.7.2 Comparing Na¨ıve-Threshold and NN-BPS using real data 110

5.7.3 The pruning eﬀectiveness of NN-BPS and SKY-BPS using the synthetic data set 111

5.7.4 Measuring pruning eﬀectiveness as a function of time 114

Trang 12

6.1 Thesis summary 1196.2 Conclusion and Key Findings 1216.3 Future work 121

Trang 13

List of Figures

2.1 Probabilistic database 17

2.2 Skyline tuples t1 and t3 of Table 1.4 21

2.3 Skyline process using the nearest neighbour algorithm 21

3.1 A set of uncertain objects 38

4.1 The answer to the PT-10 vs thresholds 67

4.2 Accessed tuples vs k 68

4.3 Tuples in answer vs k 68

4.4 Accessed tuples vs data sets 69

4.5 Tuples in answer vs data sets 69

5.1 Probabilistic data of Table 5.1 graphed in attribute space 81

5.2 Nearest neighbour ordered pivot tuple selection 82

5.3 Skyline ordered pivot tuple selection 82

5.4 Two heuristic algorithms based on our approach of using the Guard value of pivot tuples to prune the search space 82

5.5 Probabilistic data of Table 5.4 graphed in attribute space 84

5.6 Replacement of generation rules with compound tuples 86

5.7 An example of a probabilistic data set with generation rules 87

5.8 The dominating tuple sets DOR t p and DOR t e of t p and t e, respectively 92

5.9 The region scanned by NN-BPS when processing theDOt p set 96

5.10 The region scanned by SKY-BPS when processing the DOt p set 96

5.11 Example of search order for NN-BPS versus SKY-BPS 102

5.12 Total execution time of varying data set size of real data 110

5.13 Size of result set of varying data set size of real data 110

5.14 Maximum tuple probability of 0.5 of varying data set size of synthetic data 113

Trang 14

LIST OF FIGURES

5.15 Maximum tuple probability of 1.0 of varying data set size of syntheticdata 1135.16 Maximum tuple probability of 0.5 of varying data set size of syntheticdata 1135.17 Maximum tuple probability of 1.0 of varying data set size of syntheticdata 1135.18 Maximum tuple probability of 0.5 of varying number of dimensions 1155.19 Maximum tuple probability of 1.0 of varying number of dimensions 1155.20 Maximum tuple probability of 0.5 of varying number of dimensions 1155.21 Maximum tuple probability of 1.0 of varying number of dimensions 1155.22 Maximum tuple probability of 0.5 of pruning eﬀectiveness as a func-tion of time 1155.23 Maximum tuple probability of 1.0 of pruning eﬀectiveness as a func-tion of time 1155.24 Results of varying the maximum tuple probability 116

Trang 15

Chapter 1

Introduction

Many business intelligence and decision making applications derive reliable resultsfrom querying uncertain data A popular way of modeling uncertain data is to usethe possible worlds semantics model The purpose of this thesis is to propose two

novel approaches to eﬃciently process and answer probabilistic top-k and skyline

queries on probabilistic data using the possible worlds semantics model

Uncertainty in data occurs in many real world sources, such as sensor data, networkdata, survey data etc Uncertain data can arise due to the limited accuracy of theequipment, noise, a lack of information, and errors or delays in data transmission.These data can be incomplete, and/or contain imprecise information For example,

to control moving objects, sensors are used to detect the direction, speed, and/orposition of these objects The data which is collected by this system is usuallyuncertain data, because there may be errors in the transmission of the data whichcan be caused by the high voltage environment, noise, or bad weather Probabilityinformation can be used to indicate the conﬁdence in measurements taken from suchdevices as cameras, radars, satellites etc

In data integration, data from different sources are combined to provide an fied view of data to users The result of the data integration process is usuallyuncertain data due to missing functional annotations, non-uniform conceptualiza-tion, or an unstructured data repository of data mappings [16] [40] [21] For exam-ple, integrating the life science database (biology, enterprises, government, librariesdatabase) is difficult due to the large number of diverse, interrelated data sources,which is usually represented using uncertain data Therefore, processing uncertain

Trang 16

uni-1 INTRODUCTION

data is the largest obstacle faced by these applications [18] [16]

Data may be incomplete, may contain errors, or may be represented using abilistic information We call these types of data uncertain data A number ofimportant applications on decision making and market surveillance need to processuncertain data [4] [40] [34] [17] [42] There has been much work on modeling un-certain data [45], [15], managing uncertain data [63], [4], ranking queries [24], [51],[59], [37], [4], [57], [64], nearest neighbour queries [10], [29], [11], skyline queries

prob-on uncertain data [6], [46] etc It is challenging to obtain accurate, reliable, andmeaningful results from uncertain data

The main requirement of all practical applications is that probabilistic queries onprobabilistic data must be processed eﬃciently and return accurate and reliable

results [24] [46] [11] Top-k and skyline queries are important tools for data

explo-ration on data mining, decision making, and market analysis applications [27] [53][9] [46] [24] [35] In a relational database system, probabilistic queries such as skyline

or top-k queries select interesting and reliable results from the various alternatives

within the probabilistic data Therefore, this thesis focuses on answering

probabilis-tic top-k queries and probabilisprobabilis-tic skyline queries on probabilisprobabilis-tic data using the

possible worlds semantics model

Top-k queries can be used to help investors make business decisions such as ing projects which have the top-k highest proﬁts On probabilistic databases, top-k

choos-queries can be answered by listing all possible worlds [24] [23] [50] [51] [2] [57] Apossible world contains a number of tuples in the probabilistic data set Each possi-

ble world can contain k tuples associated with a non-zero probability for existence Diﬀerent possible worlds can contain diﬀerent sets of k highest tuples Therefore, it

is necessary to list all possible worlds to select the top-k answer.

For example, in business, investors often make decisions about various productsbased on analysis, statistical data and mining data [42], which provide predictionsrelating to potentially successful and unsuccessful projects To analyse the mar-ket, investors ﬁrstly collect the historical statistical data, and then use the data topredict future market trends with probabilistic predictions This is known as prob-

Trang 17

1.2 Querying probabilistic data

abilistic data For example, assume that the data in Table 1.1 have been collectedand analysed statistically, according to historical data resources [41] Each tuplerepresents an investment project of USD $100 to produce a speciﬁc product (Prod-uct ID) Investing in products based on their probabilities (Probability) will result

in an estimated amount of proﬁt In tuple t1, a businessman invests USD $100 onproduct A, and it has a 0.29 chance of obtaining a proﬁt of USD $25

Tuple Product ID Proﬁt of USD $100 investment Probability

Table 1.1: Predicted Proﬁts of USD $100 investment on Products

In the real world, when analysing historical data, predictions on future markettrends return two or more values per product with probabilities that the predictionsare correct Therefore, some tuples in Table 1.1 have the same product ID withdiﬀerent proﬁts In the probabilistic data model, these tuples are mutually exclusive,and controlled by a set of rules (generation rule) [24] [23] [25] [48] For example,

tuples t2 and t4 are projects that invest in product B with a 0.3 probability ofproducing a USD $18 proﬁt and 0.4 probability of producing a USD $13 proﬁt,

respectively In this case, if the prediction for tuple t2 is true, then the prediction

for tuple t4 will not be true It is impossible for both proﬁts to be true for the sameproduct ID They are mutually exclusive predictions In Table 1.1, the probabilistic

data are restricted by the exclusive rules R1 = t2⊕ t4 and R2 = t3 ⊕ t6

Top-k queries can be used to help investors make business decisions such as

choosing projects which have the top-2 highest proﬁts On probabilistic databases,

top-k queries can be answered by using the probability space that enumerates the list

of all possible worlds [24] [23] [50] [51] [2] [57] A possible world contains a number oftuples in the probabilistic data set Each possible world has a non-zero probability

for existence and can contain k tuples with the highest profits Different possible worlds can contain different sets of k tuple answers Therefore, it is necessary to list

all possible worlds of Table 1.1 to ﬁnd the top-2 answers for the top-2 query on theprobabilistic database Thus, Table 1.2 lists three dimensions: the possible world,the probability of existence, and the top-2 tuples in each possible world

Trang 18

There-Tuple Proﬁt Probability Top-2 probability

Problems with existing approaches In previous research [24] [23], the top-k

answers are found using the probability threshold approach called PT-k PT-k query returns a set of tuples with top-k probabilities greater than or equal to the users’

threshold value For example, the answer to the PT-2 query with threshold 0.3 inthe example listed in Table 1.3 is the set containing 4 tuples{t2, t3, t4, t5} We have

identiﬁed three drawbacks with PT-k query These are listed below:

• PT-k query may lose some important results According to the PT-2 query, tuple t1(25, 0.29) is eliminated by the PT-2 algorithm because its top-2 prob-

ability is less than threshold 0.3 In this case, we recommend that tuple t1

Trang 19

should be in the result, the reason being that tuple t1(25, 0.29) is not worse

than tuple t4(13, 0.3072), when comparing both attributes of proﬁt and top-2

probability That is, t1.profit (25) is significantly greater than t4.profit (13) and

t1.top-2 probability (0.29) is slightly less than t2.top-2 probability (0.3072) In

business, investors may like to choose project t1 because they can earn nearly

double the proﬁt compared to project t4, while project t1 is only slightly riskier

than project t4 with a top-2 probability of 0.0172 Therefore, t1 should be

in-cluded in the top-k answers.

• PT-k answers may contain some redundant tuples which should be eliminated from the top-k results earlier in the query process Referring to Table 1.3 again, tuples t4 and t5 should be eliminated immediately from the answer

because the values of both attributes, proﬁt and top-2 probability in t4(13,

0.3072) and t5(12, 0.3298), are less than those in t3(18, 0.7304) It is obvious

that investors will choose project t3 which is more dominant than projects t4

and t5 in both proﬁt and top-k probability.

• It is difficult for users to choose a threshold for the PT-k method The old is a crucial factor in enhancing the efficiency and effectiveness of PT-k

thresh-query [24] [23] Users may not know much about probabilistic values andprobabilistic databases Therefore, it may be difficult for users to find themost suitable threshold initially This, in turn, means they may need to wastetime using trial and error to find the most suitable threshold value

Our Approach In this thesis, we deﬁne a new probabilistic top-k query called

the “top-k best probability query” which overcomes the three drawbacks mentioned above The top-k best probability query takes both top-k proﬁt and top-k probability

of all possible worlds into account when selecting the best top-k answers Below,

we list the desirable properties of the top-k best probability queries, and how we

achieve these properties:

• Inclusion of important tuples and elimination of redundant tuples: users are usually interested in projects with the highest proﬁts Therefore, the k-tuples with the highest proﬁts (scores) should be in the answer set of the top-k queries

on probabilistic data For example, in Table 1.1, tuples t1 and t2 have the

top-2 best proﬁts and therefore should be in the top-top-2 answer In addition, we usethe dominance principal to address both selecting the important tuples (non-dominated tuples) and eliminating the redundant tuples (dominated tuples)

Trang 20

1 INTRODUCTION

The tuples returned are the ones that are not dominated by any other tuples

in terms of both score and probability The non-dominated tuples will have

“the best top-k probabilities” For example, tuple t3 is a non-dominated tuplewhich is added into the result because it has a greater top-2 probability than

all other tuples in the top-2-highest scores Then, the rest of the tuples t4, t5, t6

in the data set are eliminated, because they are dominated by tuple t3 in bothproﬁt and top-2 probability

• Removal of the unclear threshold : The new method which combines the two

previous techniques will remove the need of the threshold for processing the

top-k queries on probabilistic databases.

Therefore, the set {t1, t2, t3} is the answer to the top-2 query on Table 1.1 The

top-2 best probability query returns not only the tuples with the best top-2 rankingscores but also the top-2 highest probabilities to the users

In many increasingly important fields such as decision-making, market analysis, andpersonalized services, querying large sets of uncertain data is becoming popular.This gives rise to two important challenges: firstly, how to find the important data

of interest from a multi-dimensional data set; and secondly, how to deal with theuncertainty in the data The skyline query is one way of addressing the ﬁrst chal-lenge It works by only returning tuples of data that are not dominated by anyother tuples For example, when looking for the best hotels based on distance andprice, a non-dominated tuple is a hotel which has the lowest price and/or shortestdistance compared to all other hotels This allows many tuples (hotels) which areless interesting (hotels which are further away and higher priced) to be pruned fromthe result set In addition, uncertainty in data can occur in many real world sources,for example, network data, sensor data, survey data, etc There has been much work

on answering top-k queries [24] [51] [59] [37] [4] [57] [64], nearest neighbour queries

[10] [29] [11], skyline queries on uncertain data [6] [46] etc It is challenging to obtainaccurate, reliable, and meaningful results from this data

The above discussion motivates the need for answering skyline queries on certain data Currently, existing work on answering probabilistic skyline queriesusing the uncertain object model either require the user to set a threshold [46] orreturns the skyline probabilities of every instance [6] Neither approach is desirable

Trang 21

un-1.2 Querying probabilistic data

since in the case of the former it is hard for users to set a meaningful threshold Ahigh threshold may not return any results A low threshold may return too manyresults The latter approach of returning the skyline probability of all instances isalso undesirable since it returns too many Therefore, this thesis proposes the novelbestpro-skyline query using the possible world semantics model to obtain accurate,reliable, and meaningful results from uncertain data without the need to set anythreshold

The possible worlds semantics model is widely used in many applications such assearching [3] [51], data cleaning [5] [19], decision making [46] [57], and querying [24][51] [37] [10] [11] The possible worlds semantics model consists of probabilistic dataand generation rules The probabilistic data contain a set of probabilistic tuples,each of which contains a set of attribute values and a probability for the existence ofthe tuple Generation rules are existence constraints on the tuples An example of ageneration rule is that only one of a speciﬁc subset of tuples in the same generationrule can exist in every possible world Generation rules allow relationships betweenthe existences of tuples to be expressed, a key feature lacking in the uncertain objectmodel

We provide an example of a probabilistic skyline query within the context ofthe possible worlds semantics model Suppose the government or a not-for-proﬁtorganization needs to choose a set of projects to fund and each tuple represents

a proposed project and its associated probability of success Table 1.4 shows theprobabilistic data of five different project proposals with different costs and timeframes Some of the projects are mutually exclusive, meaning only one of the set

of mutually exclusive projects can be chosen, due to manufacturing constraints,availability of resources, etc In this example, we assume the following exclusive

rule exists: R1 = t1⊕ t2, meaning only one of t1 or t2 , but not both, can be chosen.The other projects are independent of each other

Tuple Project ID Time Cost Probability

Table 1.4: Probabilistic data, where each tuple represents a diﬀerent project proposal

For this example, we may want to ﬁnd the projects which have both low cost and

Trang 22

1 INTRODUCTION

short time (2-dimensions) However, this may be impossible since there may not

be any project that meets both criteria simultaneously Therefore, a skyline querywhich ﬁnds the set of projects that are not dominated in both cost and time would

be very useful in this situation We call such queries probabilistic skyline queries Although top-k queries have been studied extensively in the context of the possi-

ble worlds semantics model, that body of work is not directly applicable to answeringprobabilistic skyline queries The reason is that skyline queries are fundamentally

diﬀerent to top-k queries because skyline queries need to consider multiple attributes separately when applying the dominance principle whereas top-k queries can ﬁrst

combine all the attributes to a single score and then only consider that single scoreattribute However, it is nearly impossible to combine all the attributes to a single

score Therefore, the formulas developed for probabilistic top-k queries cannot be

used for probabilistic skyline queries

In existing work on answering skyline query on uncertain data, Pei at al [46]

proposed the p-skyline query using the uncertain object model to obtain the objects with skyline probabilities above threshold p Atallah and Qi [6] argued that using a

threshold can result in the loss of important answers for users, and a high thresholdvalue could lead to an empty result Therefore, they proposed a new method tocompute all instances of objects The answer set contains all objects with skylineprobabilities above zero However, neither methods answering probabilistic skylinequeries using the uncertain object model can support the inclusive rules in the pos-sible worlds semantics model Secondly, the threshold technique [46] and computingthe skyline probabilities of all instances of objects [6] return too many results whichare not necessary for users Therefore, to answer probabilistic skyline queries usingthe possible worlds semantics model, the Na¨ıve solution is applied to obtain prob-abilistic skyline answers, which are then analysed to select the probabilistic skylinetuples being most interesting to users

Using a Na¨ıve solution for probabilistic skyline queries on probabilistic datawould first enumerate all possible worlds Each possible world is a set of tupleswhich satisfies the generation rule constraints Table 1.5 shows all the possibleworlds for the example in Table 1.4 The first column of Table 1.5 shows the set

of tuples in each possible world, the second column shows the probability of theexistence of each possible world, and the third column shows the skyline tuple(s) foreach possible world The Na¨ıve method will then sum the skyline probabilities of

each possible world for each tuple In our example, the skyline probabilities (ps i) forthe ﬁve tuples are shown in Table 1.6 If the number of tuples is large, the user can

Trang 23

instruct the Na¨ıve algorithm to only return the tuples whose skyline probabilitiesare above a certain threshold, like many of the existing studied [46] [24] [23]

Possible world (w) Probability of w (p(w)) Skyline tuple

Problems with the existing approaches There are two main problems with

the existing approaches

• It is diﬃcult for users to specify a threshold because if it is set too high,

Trang 24

is diﬃcult for users to know what threshold to set before executing the query.

• The Na¨ıve method takes too long to compute the skyline probabilities becausethere is an exponential number of possible worlds that needs to be considered

Tuple Project ID Time Cost Skyline probability Interesting probabilistic skyline tuple

Our Approach In this thesis, a new approach called the “bestpro-skyline query”

is proposed to overcome the above problems

• Selecting interesting probabilistic skyline tuples: the results of bestpro-skyline

queries must contain the interesting probabilistic skyline tuples using the tending dominance concept, because users generally are not interested in tuples

ex-which are dominated by another tuple in terms of both skyline probabilities

and attribute values We deﬁne a probabilistic skyline tuple (t i , ps i) as a

tuple t i with its associated skyline probability ps i We use the term

non-dominated probabilistic skyline tuples to refer to the interesting probabilistic

skyline tuples In our example, in Table 1.7 we have summarized the tion needed to identify the interesting probabilistic skyline tuples We can see

informa-the probabilistic skyline tuple (t1, ps1) dominates probabilistic skyline tuples

(t2, ps2) because t1 dominates t2 in the attribute space and the skyline

proba-bility of t1 (ps1 = 0.3) is greater than the skyline probability of t2 (ps2 = 0.2).

Similar analysis can be applied to derive the fact that probabilistic skyline

tuple (t1, ps1) also dominates probabilistic skyline tuple (t5, ps5) and

proba-bilistic skyline tuple (t3, ps3) dominates probabilistic skyline tuple (t4, ps4)

Therefore, the interesting probabilistic skyline tuples (t1, ps1), (t3, ps3) should

be added to the result set, and the non-interesting tuples (t2, ps2), (t4, ps4),

(t5, ps5) should be removed from the result set

Trang 25

1.3 Contributions of this thesis

• Removal of the threshold : The bestpro-skyline query does not need to deﬁne

the threshold value because the non-dominated skyline tuples are selectedautomatically using the extended dominance concept

• Computing the skyline probability eﬀectively: The new approach using the

new formula based on the probabilistic theorem directly calculates the skylineprobability without listing any possible worlds

The main contribution of this thesis is to propose novel approaches for deﬁning and

answering probabilistic top-k and probabilistic skyline queries We ﬁrst developed the top-k best probability query on probabilistic databases and proposed the semantic ranking property for probabilistic top-k queries Secondly, we introduced a novel

approach which answers probabilistic skyline queries on probabilistic data using thedominance of probabilistic skyline tuples concept

Our work on answering probabilistic top-k queries [31] [33] makes the following

prob-• We develop formulas to calculate the top-k probability and handle the inclusive

and exclusive rules (generation rule) on probabilistic databases

• We introduce some pruning rules using probability theory to improve the

ef-fectiveness of the algorithm of the top-k best probability query These rules

are mathematically proven for correctness and are used to reduce the

compu-tation cost of calculating the top-k probabilities of tuples, which allows early

stopping conditions

• We prove that our top-k best probability query covers the existing

seman-tic properties from [15] [65] and also covers a newly proposed property called

Trang 26

isting semantic properties from [15] [65] or not It can be clearly seen that

the results of the top-k best probability queries satisfy more semantic ranking properties than other existing top-k queries.

• Both real and synthetic data sets are used to verify the eﬀectiveness of theproposed approach in our extensive experimental study The results show that

the top-k best probability method outperforms the PT-k method in terms of eﬀectiveness and eﬃciency, and the top-k best probability answer is more stable on number top-k results (tuples) than the PT-k answer.

The second approach focuses on answering probabilistic skyline queries on bilistic data using the possible worlds semantics model [32] We make the followingcontributions:

proba-• We deﬁne a new query called the bestpro-skyline query, which returns only the

interesting probabilistic skyline tuples The new approach does not need any

thresholds to be set by users but at the same time is very eﬀective at pruning

a lot of uninteresting probabilistic skyline tuples from the result set

• We propose the bestpro-skyline query which uses formulas based on bilistic theory to directly calculate skyline probabilities without consideringany possible world

proba-• We use a pruning technique to prune the number of tuples considered during

the computation Our idea is to incrementally use a set of pivot tuples for pruning Any tuple can be a pivot tuple A pivot tuple t p has an associated

upper bound on the skyline probability of all tuples that are dominated by t p

in terms of the attribute space Using this upper bound, we can prune thesearch space

• We propose two algorithms that answer the bestpro-skyline query by usingdiﬀerent heuristics for selecting the pivot tuples One is a nearest neighbour-

Trang 27

1.4 Thesis organization

based algorithm and the other is a skyline result-based algorithm We sively compare the two algorithms in terms of their eﬀectiveness in pruningthe search space, both experimentally and analytically

exten-• We conduct experiments comparing the Na¨ıve algorithm against our skyline algorithms The results show that the bestpro-skyline algorithms areable to prune the number of returned probabilistic skyline tuples to only 22interesting ones anong 10,000 tuples in a real data set The bestpro-skylinealgorithms are up to three orders of magnitude faster than the Na¨ıve algorithm

bestpro-In addition, the Na¨ıve method can not handle data sets of 10,000 tuples ormore within a reasonable time frame (less than one day) In contrast, thebestpro-skyline algorithms can compute results for data sets of 10,000 tupleswithin 41 seconds

The organization of this thesis is outlined as follows:

Chapter 2 provides the background on modelling uncertain data, and discussesthe uncertain object model and the possible worlds semantics model Then, several

basic deﬁnitions and concepts related to answering top-k queries and skyline queries

on certain data are given These deﬁnitions form a foundation for deﬁning theconcepts used in answering queries on uncertain data

Chapter 3 reviews the current research on answering probabilistic top-k queries

and probabilistic skyline queries Firstly, various deﬁnitions which have been

intro-duced on answering top-k queries are presented and discussed to evaluate the tiveness of these methods The semantic properties of probabilistic top-k queries are presented, which are used to evaluate the correct semantics of probabilistic top-k

eﬀec-queries Then, some existing deﬁnitions on answering probabilistic skyline queriesare reviewed All use the uncertain object model to process probabilistic skylinequeries

Chapter 4 presents the problems with existing techniques for answering

prob-abilistic top-k queries Then, we introduce the dominance concept on score and top-k probability of tuples and use this to define a new type of query called the top-k probability query We then define the pruning rules, an efficient algorithm for answering the query and a new property called k-best ranking scores We also

evaluate whether the semantics of ranking queries on probabilistic data have been

Trang 28

1 INTRODUCTION

satisﬁed or not compared to all the existing approaches Lastly, extensive iments using real and synthetic data are conducted to show the eﬀectiveness and

exper-eﬃciency of the top-k bestpro method.

In Chapter 5, we first define a Na¨ıve probabilistic skyline query and exposeits flaws We then define our bestpro-skyline query, and overview our solutionfor answering the bestpro-skyline query We describe our formula for efficientlycalculating the skyline probability of tuples directly without the need to enumerateall possible worlds Both the nearest neighbor and skyline result-based heuristicsalgorithms for bestpro-skyline query are represented

Chapter 6 concludes the thesis contributions and future work The contributions

of this thesis are answering top-k bestpro query and answering bestpro-skyline query

on probabilistic data using possible worlds semantics model Finally, we suggestfuture work in the area of processing uncertain data

Trang 29

Chapter 2

Background

This chapter ﬁrst describes two probabilistic database models, namely the bilistic database model and the possible worlds semantics model Secondly, the basic

proba-concepts of top-k queries are reviewed, and skyline queries with a nearest neighbour

algorithm are presented and examined

Managing, modelling, and processing uncertain data has been studied in relationaldatabases Probabilistic data is data associated with probabilities representing thelikelihood of information being true Probability can usually be represented by

a value from zero to one or a probability density function (PDF) There are twocommon models of probabilistic databases for these representations, namely theuncertain object model and the possible worlds semantics model

Uncertain object data consists of a set of uncertain objects; each uncertain objectusually describes a moving object or an object’s appearance in a region Therefore,probabilistic information on an object is represented by a probabilistic interval orprobabilistic region The following deﬁnitions of the uncertain object model arefound in papers [13] [1] [55] [12] [46] [6]

Deﬁnition 1 An uncertain object has multiple instances which are mutually

exclusive for the same object.

– An object (U ) contains l number of instances U (u1, u2, , u l ), f (u i ) > 0 (1 ≤

i ≤ l).

Trang 30

2 BACKGROUND

– Each instance (u i ) has m multiple dimensions (attributes) u(d i1, d i2, , d i m ).

Generally, probability of instance u in uncertain object U is described by a value between (0 1], or f (u) a probabilistic density function (PDF), or a probabilistic mass function (PMF).

Table 2.1 shows an example of the probabilities of four uncertain objects (A, B, C, E).

Objects Product ID Proﬁt of USD $100 investment Probability

Table 2.1: List of companies

Deﬁnition 2 Uncertain object data ( D) consists of a ﬁnite number (n) of

uncertain objects (U).

This section provides all the deﬁnitions and concepts of the possible worlds semanticsmodel based on papers [45] [63] [15] [25] [57] [65] [50] [51] The possible worldssemantics model consists of probabilistic data and a set of generation rules Theprobabilistic data contains a set of probabilistic tuples, each probabilistic tuplehaving one or multiple attributes and a probability, as shown in Figure 2.1

Deﬁnition 3 Tuple (t) contains a number of attributes which are speciﬁc values

within the range.

Deﬁnition 4 Probabilistic tuple (t i ) is a tuple (t) associated with probability (p i ) The probability (p i ) of tuple (t i ) is the likelihood of this tuple appearing in the data set ( D).

Trang 31

2.1 Probabilistic database models

Figure 2.1: Probabilistic database

For example, the probabilistic tuples are listed in Table 1.1

Deﬁnition 5 Probabilistic data ( D) is a ﬁnite set of probabilistic tuples, where

each probabilistic tuple is a tuple associated with probability to denote the tainty of the data, and almost all probabilistic tuples in probabilistic databases are independent.

uncer-D = {t1, t2, , t n }

Table 1.1 shows an example of probabilistic data, which represents the predictedproﬁt of an investment of USD $100 investment (Proﬁt of USD $100 investment) onProducts (Product ID)

Deﬁnition 6 A possible world (w i ) contains a number of tuples in D, which

satisﬁes the generation rules for the appearance of tuples in the data (see Deﬁnition 7) The probability of the existence of a possible world (p(w i )) is calculated by the

Trang 32

Deﬁnition 7 Generation rules ( R) contain exclusive rules and inclusive rules

in the possible worlds semantics model.

• If a rule is exclusive, it is in the form of R k+= t k1 ⊕ t k2 ⊕ ⊕ t k q (R+k ∈ R) where ⊕ denotes the exclusive operation of tuples, and the sum of probabilities

of the tuples is always greater than zero and less than or equal to one (0 <

k′ = p t

k′1 = = p t

k′ ≤ 1).

For example, the following exclusive rule R1 = t2⊕t4 on probabilistic data means

tuple t2 and tuple t4 cannot both simultaneously exist in the same possible world,

and R2 = t3 ⊕ t6 means tuple t3 and tuple t6 cannot both simultaneously exist inthe same possible world Table 1.2 lists all of the possible worlds of Table 1.1 which

are restricted by rule R1 = t2⊕ t4 and R2 = t3⊕ t6 For the case of possible world

w7 ={t2, t3, t5}, its probability of existence is calculated as follows:

p(w7) = p2× p3× p5× (1 − p1)

= 0.3 × 0.8 × 1 × (1 − 0.29) = 0.1704

This possible worlds semantics model has been applied to many real applications

on sensors, networks, and biological information [57] [7] [47] [61] Speciﬁcally, the

techniques for ranking queries and top-k queries have been developed using the

possible worlds semantics model for uncertain data The semantics of the answers

to these queries are diﬀerent to the previous work such as, the most probable top-k answers (U -top-k) [51], top-k answers with a probabilistic threshold (PT-k) [24], score distribution and typical answers to top-k queries (typical-top-k) [36], and

Trang 33

2.2 Queries on databases

Expected rank (E-rank) [15] Therefore, the semantic answers to probabilistic top-k

queries have been studied Our research aims to answer skyline queries using thepossible worlds semantics model The possible worlds semantics model is createdbased on the tuple level of probabilistic relational databases [6] [15]

The answers to queries using the possible worlds semantics model is a set ofprobabilistic tuples Examples of these answers will be discussed in more detail infollowing chapters

This section studies some basic concepts on queries relating to top-k and skyline

queries These two queries will be formally deﬁned in this section, which is thefoundation for this research In later chapters, we will extend the deﬁnitions toprobabilistic data

The traditional top-k queries are useful in data exploration and decision making [51] [52] [39] In relational databases, answering top-k queries is deﬁned as follows:

Deﬁnition 8 Top-k queries: the top-k queries return k tuples ( Ls) with k best

scores Score is the outcome of score function (f score ) as deﬁned by the users.

Qk

top(D) = Ls = {t i1, , t i k } : ∀t i ′ ∈ D\Ls, f score (t i ′ ) < f score (t i)

For example, in relation to the certain data in Table 1.1, the top-2 query onTable 1.1 returns 2 tuples {t1, t2}, because tuples t1 and t2 have the highest proﬁts

This deﬁnition of top-k queries on data will be the foundation, which will be investigated to develop several deﬁnitions for answering probabilistic top-k queries

on probabilistic data in Chapter 3 and Chapter 4

This subsection revisits several basic concepts of skyline queries on certain data, inwhich key deﬁnitions and algorithms in the existing literature on skyline queries oncertain data are presented [9] [6] [43] [27] [46] [28] [8] [60]

Trang 34

2 BACKGROUND

a Skyline answers on data

Deﬁnition 9 Dominance of tuples: Given two v-dimensional tuples t a =

(a1, a2, , a v ) and t b = (b1, b2, , b v ), t a ̸= t b , tuple t a dominates tuple t b (t a ≺

t b ) if and only if the values of all dimensions of tuple t a are more desirable than or equal to the corresponding dimensions of tuple t b The dominance of tuples can be represented as follows:

t a ≺ t b ⇔ ∀a k ∈ {a1, a2, , a n }, ∀b k ∈ {b1, b2, , b n } : (a k ≼ b k)

For example, in relation to the certain data in Table 1.4, it is clear to see that

tuple t3 dominates tuple t4 according to Deﬁnition 9, because all attributes of

t3(2, 10) are lower than the corresponding attribute values of t4(6, 14) in terms

of cost and time

Deﬁnition 10 Skyline queries on data: Given data set D = {t1, t2, , t n },

a skyline query returns the list of all tuples ( Ls) in which every tuple t i is not dominated by any other tuple in D, that is, any t i ∈ Ls, ∀t j ∈ D ⇒ t j ̸≺ t i These tuples (non-dominated tuples or skyline tuples) in Ls are called skyline

other tuples in the data set (refer Figure 2.2) Therefore, projects t1 and t3

are the answers on which a government or a non-for-proﬁt organization needs

to make a decision

b The nearest neighbour algorithm for skyline queries

We present the nearest neighbour algorithm here because it is used in Chapter

5 for answering our bestpro-skyline query

The nearest neighbour (NN) algorithm [43] [9] is based on the idea that the

NN tuple from the origin is a member of a skyline set, because non-tuples

in the data set can dominate the NN tuples Therefore, the NN techniquelocates iteratively the NN tuples using a distance function from the origin to

Trang 35

Figure 2.2: Skyline tuples t1 and t3 of Table 1.4

the tuples (e.g Euclidean distance √∑n

i=1 |x i |2

) When each NN tuple is

located, it divides the searching space of the NN algorithm into 2d regions.

The region (region IV in Figure 2.3(a)), which contains all the tuples beingdominated by the NN tuple, will be pruned Then, the NN tuple is added intothe skyline set Similarly, the next NN tuple is selected as the next skylinetuple This process is repeated to select all of the skyline tuples, in which thesearching space of the NN algorithm will be shrunk dramatically until empty.Then, the NN algorithm for ﬁnding skyline tuples will terminate

2 3 4 5 6 7 8 9 10 11 12 13 14 15

Trang 36

2 BACKGROUND

As an example, the NN algorithm for skyline queries is executed on the data

set in Figure 2.3 Firstly, the NN algorithm locates tuple t1 with its minimum

distance from all tuples in the data set Tuple t1 divides the space into four

(2d) non-disjoint regions (I, II, III, IV) as shown in Figure 2.3(a) The nearest tuple t1 is a member of a skyline set Obviously, all tuples {t2, t4} in region

IV being dominated by tuple t1 are pruned out from the data set The searchspace for the NN algorithm is shrunk into a smaller space which only contains

tuple t3 The next NN tuple is t3, which again divides the space being shown

in Figure 2.3(b) The skyline tuple t3 is selected for the skyline set Thesearching space of the NN algorithm is empty The processing of the skylinetechnique terminates to return the skyline set {t1, t3}.

This chapter describes the background of uncertain data and several concepts of

top-k and skyline queries were formally deﬁned.

There are two models for uncertain data, the uncertain object model for attributelevel and the possible worlds semantics model for tuple level

The uncertain object model contains a data on a set of uncertain objects data.Uncertain object data contain a ﬁnite number of uncertain objects Each objecthas a set of instances Every instance usually has the same probability The secondmodel is the possible worlds semantics model consisting of probabilistic data andgeneration rules The probabilistic data contain a list of probabilistic tuples, each

of which contains a set of attribute values and the probability A possible world isviewed as a set of possible instances Each possible world is a list of tuples associatedwith existence probabilities The generation rule states the tuple dependencies toimply the appearance of a tuple in relation to the appearance of other tuples inpossible worlds

Several basic concepts of top-k and skyline queries on certain data were provided The top-k queries on certain data return k answers with the k best function score This deﬁnition of top-k queries will be studied and extended for answering probabilistic top-k queries in Chapters 3 and 4 The skyline queries on uncertain data

return a set of skyline tuples which are non-dominated tuples A non-dominatedtuple is a tuple for which of the other tuples have any attribute values better than

it In addition, the nearest neighbour algorithm for selecting a skyline set on certain

Trang 37

2.3 Summary of chapter

Notation Deﬁnition

U an uncertain object U = {u1, u2, , u l }

u i an instance of uncertain object

u i ={d i1, d i2, , d i m } instance u i has m dimensions

D = {U1, U2, , U n } an uncertain object data

PDF probability density function

PMF probability mass function

t a = (a1, a2, , a v) tuple t a has v attributes

p a a probability of tuples t a

D = {t1, t2, , t n } a probabilistic data

w i the possible world

p(w i) probability of existence of possible world w i

W the possible world space or all possible worlds

|W| the number of possible worlds

R generation rule

R+ exclusive rule

R ∗ inclusive rule

⊕ the exclusive operation

⊗ the inclusive operation

t a ≺ t b tuple t a dominates tuple t b

Ls answer set of queries

pr k

name top-k probability of ‘name’ method

Qk

top top-k queries

pr name sky skyline probability of ‘name’ method

Qsky skyline queries

Table 2.2: The summary of notations

Trang 38

2 BACKGROUND

data was presented

The purpose of this chapter is to provide the foundation to answer top-k and

sky-line queries on certain data and for the modelling of uncertain data These conceptsand models will be extended, modiﬁed, and used for answering the probabilistic

top-k and the probabilistic skyline queries on uncertain data.

Table 2.2 is a summary of the notations used in this thesis

Trang 39

Chapter 3

Existing work on probabilistic

queries

In this chapter, the previous work on answering probabilistic top-k queries and

probabilistic skyline queries is analysed and studied Firstly, all deﬁnitions of

prob-abilistic top-k queries on probprob-abilistic data are reviewed, and the answers to the queries are evaluated using the semantic properties of probabilistic top-k queries, after which, the problems relation to answering probabilistic top-k queries are pre-

sented Secondly, the process of answering the probabilistic skyline queries on certain data is analysed based on uncertain data models and the possible worldssemantics model is studied, after which the problem of probabilistic skyline queries

un-is presented

There are several research studies which have introduced new deﬁnitions in relation

to answering probabilistic top-k queries on probabilistic data.

Top-k queries on probabilistic data (D) are required to return a set of

probabilis-tic top-k tuples which are a subset of D A probabilistic top-k tuple is determined

based on its score and its probabilities across all possible worlds Both score andprobabilities across all possible worlds are considered when interpreting semantically

the answers to probabilistic top-k queries [25] [6] [46] [51] [23] [65].

There are a number of deﬁnitions of probabilistic top-k queries based on the

semantics of possible worlds Table 1.1 and 1.2 in the Introduction chapter gave anexample of the possible worlds model The following subsection will present current

Trang 40

3 EXISTING WORK ON PROBABILISTIC QUERIES

approaches for answering probabilistic top-k queries on probabilistic data.

The uncertain top-k method [51] is the ﬁrst study which attempted to answer top-k

queries on probabilistic data The interpretation of this approach involved both the

order of ranking tuples (“top-k”) and the aggregation of their probabilities across

possible worlds (“most probable”) These concepts give diﬀerent possible

interpre-tations of uncertain top-k queries Firstly, the set of the “top-k tuples” appears

together in the “most probable possible worlds” Secondly, the most probable top-k

tuples belong to valid possible worlds Thirdly, the answer set is the most probable

top-k tuples across all possible worlds To cover these three interpretations, they

deﬁne uncertain-top-k queries (U-top-k query) as follows:

Deﬁnition 11 The uncertain-top-k query (U-Top-k) returns a set of k tuples

which is a k-length tuple vector This k-length tuple vector has the highest aggregated

probabilities across all possible worlds The k tuples in the k-length tuple vector

appear restrictively together in order in the same possible worlds

For example, given the probabilistic data in Table 1.1, selecting the answer to

the U-top-2 query on the proﬁt attribute is demonstrated based on the possible

worlds in Table 1.2, the 2-length tuple vector (−−→ t

1, t2) has U-top-2 probability and iscalculated as follows:

pr k U −top(−−→ t

1, t2) = p(w1) + p(w2)

= 0.0696 + 0.0174 = 0.087

For a similar aggregation of possible worlds, U-top-2 probabilities of the rest of the

2-length tuple vectors (−−→

t5, t6) are 0.1624, 0.0232, 0.0174, 0.1704, 0.0426, 0.2272, 0.1704, 0.0568, 0.426,

respectively

Định dạng
Số trang	144
Dung lượng	1,46 MB