Luận án tiến sĩ: Techniques for incorporating data quality assessments into learning algorithms for Bayesian networks

However, research has normally started with the assumption that the data given to the learning algorithm is accurate.. Our research lays the foundation for the development of new algorit

Trang 1

TECHNIQUES FOR INCORPORATING DATA QUALITY

ASSESSMENTS INTO LEARNING ALGORITHMS FOR

BAYESIAN NETWORKS

By

Valerie Kay Sessions

Bachelor of Science College of Charleston, 2001 Master of Science University of Charleston, 2002

Submitted in Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy in the

Department of Computer Science and Engineering

College of Engineering and Information Technology

University of South Carolina

2006

Committee Member ˆ Dean of the Graduate School

Trang 2

UMI Number: 3245436

INFORMATION TO USERS

The quality of this reproduction is dependent upon the quality of the copy submitted Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction

In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted Also, if unauthorized copyright material had to be removed, a note will indicate the deletion

®

UMI

ProQuest Information and Learning Company

300 North Zeeb Road P.O Box 1346 Ann Arbor, MI 48106-1346

Trang 4

Acknowledgements

I would like to first thank my husband Kip for his love and support - for listening

to me and supporting me during both the highs and lows of the past years I would like to

thank my parents, Tommy and Debbie Sessions, for the drive they have instilled in me to begin this research and degree, and for instilling my faith in God which has allowed me

to complete it; my in-laws, Paul and Ann Hooker, for countless hours of babysitting; my Grandmother for many dinner meals in Blythewood; my son Austin and new daughter Remley for sleeping every now and then so that I could finish one last page of writing

I would like to thank my entire committee for their time and patience with this process, especially Marco Valtorta for allowing me to spend countless hours in his office learning and asking endless questions I would also like to thank Jewel Rogers for all of her help administratively - she made commuting two hours possible Thanks to all my fellow classmates for your inspiration and code snippets

Thanks to my girlfriends Chris and Amy for their support and love, my Women’s

Bible Study and church for all of their prayers during this endeavor

Thank you to my former supervisor Richard Baker for funding this work and allowing me ample time to commute to Columbia, and many thanks to all my former co- workers at SPAWAR for their support (and covering for me) during this process

Trang 5

Abstract

The field of Bayesian Networks (BNs) has had much success in developing structure learning algorithms to learn BNs directly from data However, research has normally started with the assumption that the data given to the learning algorithm is accurate This assumption is a naive one and can lead to very biased and unrealistic decision making frameworks If we are to use decision making algorithms to their full potential we must design them with the capability to account for data quality Our research lays the

foundation for the development of new algorithms that incorporate data quality

assessments into traditional BN learning algorithms — specifically the PC algorithm We

begin by reviewing Bayesian networks, learning algorithms, and data quality measures

We then quantify the effect of inaccurate data on the PC algorithm and develop three

techniques for incorporating data quality assessments into the algorithm Results indicate that a technique which modifies the significance level used by the PC algorithm is promising We show and explain these results and offer guidelines for future research in

this area

Trang 6

Contents

Acknowledgements cececcsceceseeceeeneeeee scent eneeneeeeesneen eens eaenseaeaeneenenes ili

t6 1 = ene e ene EA SEE DO EE EEE ESE OEE OE EEE SEES EEE S EE SEE EEE Eee iv

0): 5081i 640i nh ee 1

1.1 Research Motivation HH ng ky 1

Chapter 2 Bayesian Network BasICS c co SH HH SH KÝ KH nh x 3

2.1 Bayesian Network BaSICS Lo ng HH ng HH nhà kh na 3 2.2 Learning Algorithms for Bayesian NetWOrks cà 7

2.2.1 Parameter Estimation cv 8 2.2.2 Structure Learning ‹c-c con Hs HS nh ve, 14 2.3 Learning with Uncertain Evidence - cà se 21 Chapter 3 Data Quality MeasurermenIs -.-c.cn Hs nhe 24

3.1 Overview of Data QuaÌity -c HH HH nh nh xu 24

E009) ảvvyiađaađiaiiiiiiiÝẢẢốỐốỐỐ 33

3.3.1 Inaccuracy T€SfS co HS HH HH kh 34 3.3.2 Parameter Estimation ResuÌts - se se 37

Trang 7

3.3.3 Structure Learning Resulfs -.-. -c.cnĂsĂ se, 42

3.3.4 Conclusions Regarding Inaccurate Data 45 3.4 Development of Experimental Data Sets «- 47 Chapter 4 Methods and Resulfs - -.cQQQQ nn HH HH HH kh he 61

4.1.1 Do Nothing Method -.- HH HH nhớ 61

4.1.2 Threshold Method - con HH HH HH khe 61

4.1.3 The DQ Algorithm - - con HH HH he 62 4.2 Method Pseudo Code ccc eeceeeceeec ee ec eens neeceeee essa eee en eens kệ 65 LEN -haaAaddđiiaiiiiidtidtddiididi 67

5.1 PC Algorithm ConcÌusions ch ve, 100 5.2 Conclusions Regarding Methods -.cccccnnSvee 102

5.2.1 Visit to Asia ConcÌusiOns - cv 102

5.2.2 Studfarm ConcÌuSiO'S HS nh nh nhe 104 5.2.3 ALARM Conclusions co Hs 104

5.3 Modified DQ Algorithm 0 ce cecececececee eee eee eneeeesneneneeaenens 105

Trang 8

5.4 Future Research cc cece ccccccecceeeuctcnateceveccssnceseteseeeeenecanaees

5310119124-104) PNW HiẳdtdddiẳẳẳẳŸẢ

Trang 9

List of Figures

2.3: Convergence of Beta Distribution ‹- cv in 11 2.4: Visit to Asia — TWO ConfigurafiO'S HH HH HH nh su 16 2.5: Relationship of Virtual Evidence to Prior Probability c.c{<22 22 3.1: Two Sources Report on Event A cccccecceeceeeec ene nn HH HH He ng kh nh ng 28 3.2: T1 Probability Table — 80% Sensitivity, 95% SpecifÍicify co 28

Ki 03v 0à cỚ“Taaaa)aiO 35

3.5: Average Variance From Baseline HS HH HH nhà yên 37

3.6: Comparison of learned potentIaÌs - con n HH nh yên 38

3.7: Stud Farm Average Variance from Baseline cà 41

3.8: Stud Farm Learned Potentials Graph _- - HH HH n1 kh 42

3.9: NPC and CB Structure Learming Results -.-c nen 42

3.10: Incorrect links using the NPC Algorithm HQ nhe 43

3.11: Incorrect links using the CB algorithm cccccceseeceeeescesenseeseeseseees 44 3.12: Visit to Asia “True” Probability Tables - - cà Sàn 49 3.13: Studfarm ““Irue” Probability Tables - cà n se 31 3.14: ALARM “True” Probability Tables . cnnn Hs khe, 59 3.15: Breakdown of Data S€fS con nn nnnHnnn nn K n TK Ki viện 60

Trang 10

4.1: COW €T€ẨICS cọ HH HH HH HH KH Ki Ki Tà Ki Ki Ki Ki Ki ni tà tà tà tà ty 62

4.2: Average Degree of Nodes - con HH HH nh kh nhe 65 4.3: Visit to Asia, One Data Set, Raw Results con nh 69 4.4; Visit To Asia, One Data Set, Combined Results 70 4.5: Visit To Asia, One Data Set, Zeroed Results - 70 4.6: Visit to Asia, TWwo Data Sets, Raw Resul(s - c cà 72 4.7: Visit to Asia, Two Data Sets, Combined Results .- 74

4.8: Visit to Asia, Two Data Sets, Zeroed Results 76

4.9: Visit to Asia, Three Data Sets, Raw Results S- 80 4.10: Visit to Asia, Three Data Sets, Combined Results 82 4.11: Visitto Asia, Three Data Sets, Zeroed Results 84 4.12: Studfarm, One Data Set, Raw Resulfs co ằ 85 4.13: Studfarm, One Data Set, Combined Results 86 4.14: Studfarm, One Data Set, Zeroed ResuÌt - cccS 86

4.15: Studfarm, Two Data Sets, Raw R€SUÌtS nen 87

4.16: Studfarm, Two Data Sets, Combined Resulfs 88 4.17: Studfarm, Two Data Sets, Zeroed Results 88 4.18: Studfarm, Three Data Sets, Raw Results ccccceeceeeceeneeeeneeeeseeenseenes 90

4.19: Studfarm, Three Data Sets, Combined ResuÌts 91

4.20: Studfarm, Three Data Sets, Zeroed ResSuÌts - -¿ 92

4.21: Visit To Asia, Extra Test, Raw ResulÌ(s - -cccc<cị 94 4.22: Visit to Asia, Extra Test, Combined and Zeroed Form 94

4.23: Studfarm, Extra Test, Raw Results - con HH nu nhớ 95

Trang 11

4.24: Studfarm, Extra Test, Combined and Zeroed ResuÌts 95

4.25: ALARM, Extra Test, Raw ResuÌts - - << cà: 96 4.26: ALARM, Extra Test, Combined and Zeroed Results 96

4,27: Visit to Asia, Significance Test, One Data Set, Raw Results 97

4,28: Visit to Asia, Significance Test, One Data Set, Combined Results 98

4.29: Visit to Asia, Significance Test, One Data Set, Zeroed Results 98 5.1: Commutativity of Methods -.c HH HH kh nha 107

Trang 12

Chapter 1 Introduction

1.1 Research Motivation

The field of Bayesian networks (BNs) has had much success in creating algorithms to learn directly from data There are both parameter estimation and structure learning

algorithms that have a high rate of success in creating useful BNs However, research has

normally started with the assumption that the data given to the learning algorithm is

accurate and complete This assumption is a naive one and can lead to very biased and unrealistic decision making frameworks

There are numerous examples of faulty data collections — those as technical as the

Hubble Telescope’s calibration problems, to the more human centered — lying on a credit card information form If we are to use decision making algorithms to their full potential

we must design them with the capability to account for data quality Our research lays the foundation for the development of new algorithms that incorporate data quality

Trang 14

Chapter 2 Bayesian Network Basics

In order to examine the learning of BNs from data sets of low quality, we must

first review the background of the algorithms and processes of BNs First, we will

introduce the basics of probability and statistics, Bayes law, and evidence propagation Then we will examine learning algorithms for BNs and discuss those employed for this research as well as fading and adaptation methods We will also examine research in Data quality and how it can be incorporated into our learning algorithms Finally, we will review complementary work conducted in the area of revision of parameters based on

uncertain evidence

2.1 Bayesian Network Basics

We will define a BN using graph theory, following [18] A BN is a Directed Acyclic Graph (DAG) G = (V,E) where V is a set of variables and E is a set of directed edges between these variables, and a probability distribution, P, over the variables The set (G, P) satisfies the Markov condition, which states that all variables are independent

of their nondescendents given the set of its parents

Therefore following [14], for each variable A with parents Bj, ., Bp, there is a potential table P(A | Bi, ., Bn) If A has no parents this becomes the unconditional probability

P(A).

Trang 15

Each variable’s probability is represented by a probability table which shows its

probability based on the states of its parents (or its unconditional probability if it is

without parents) All of the basics of probability are relevant to these tables and will be reviewed here

Andrey Kolmogorov [14] formed the probability axioms in the 1930’s He

proposed that given an event E in a probability space S, there is a probability P(E) such

that

1 0< P(E) <1, and P(E) = 1 if E is a certain event

2 The sum of all of the events, E; (i=1, 2, .), in the sample space, S, is 1

P(S) = P(E;) + P(E2) + = 1

3 For mutually exclusive events, E; and E2, P(E; U E2) = P(E) + P(E»)

Two events may be independent or dependent Two events are considered conditionally dependent if changing the probability of one also changes the probability of another For two events, E; and E2, we say that E; is a conditionally dependent on E; if the probability

of E; given E> does not equal the probability of E; - P(E;/E2) # P(E) Conditional

probability requires that

P(E, Ea) = P(E1) P(Fa|E:)

Using this rule and the commutative property of sets - P(E; ( E2) = P(E2 4 E;), we have

Trang 16

P(E1)P(E2 | Es) PE) PŒi| Ea)=

From this formula we can understand the statistical procedure used in BNs We begin

with a prior probability of an event occurring — P(E¡) and P(Eạ) If these events are

dependent upon each other, we can determine their conditional probabilities P(ŒE¡[E2) or

P(E,|E)) in the following way When evidence is collected about one of the variables, say

Ej, we determine the likelihood of another event, Ex, given E; — P(E2|E1), multiply this by

the prior probabilities and obtain a posterior probability based on evidence

We will use the canonical Visit to Asia example to illustrate these ideas In this example we are trying to determine if a person has Visited Asia based on some visible signs of illness The structure of the BN is shown in Figure 2.1

Figure 2.1 Visit to Asia

Trang 17

In order to understand Bayes Rule, we will review one example of the propagation of

evidence We will label the node Has Tuberculosis, “H,” and the node Visit to Asia, “A.” Because H has parent A, we have the prior probability table P(H|A) —

Visit to Asia? yes No

This means that in the general population (according to a Subject Matter Expert) about 1

in 100 people have visited Asia and if he has visited Asia this increases his chances of

having tuberculosis by 4% Now, if we obtain evidence that the person has visited Asia — P(A=yes) - we can look up P(HI/A = yes) in the table and see that the chances of

contraction of tuberculosis are now 5 % The more interesting case, however, is if we

have evidence that the person has tuberculosis We can propagate this to the parent A in the following manner

There are two probabilities that will change — P(A = yes) and P(A = no) Using Bayes Rule we have

P(H | Ay =P ALP |

P(A)

Solving for P(A|H=yes) we obtain

P(A|H = yes) = PULIAVPA) -

P(H = yes)

Trang 18

To obtain P(H) we must solve for the joint distribution table Note that we do not need to

solve for P(H=no) because we have evidence that H=yes, P(H=no) = 0

2.2 Learning Algorithms for Bayesian networks

There are many different ways to create BNs One method is to create them subjectively, by interviewing Subject Matter Experts (SME) for their opinions regarding the probability of outcomes We then use their opinions and previous work in the field to create a network to represent the relationships between variables This is often done in

sample BNs such as Visit to Asia or in larger networks where the domain is well known

In other cases we can use domain data to learn the networks There are two main

categories for this type of learning - parameter estimation and structure learning For

parameter estimation, we must begin with a known BN structure That is, we know what nodes are of interest and we know how these nodes are connected to one another by a set

Trang 19

of directed edges We also know the states that are possible for each node For example,

in our Visit to Asia example, we would have the structure of the BN as represented in Figure 2.1, and we would know the possible states of the nodes - that Visit to Asia has

states ‘yes’ and ‘no’ What we would not know is the prior probabilities of these states — that about | in 100 people have visited Asia This we would seek to discover from our

data

For structure learning, even less is known about the Bayesian Network We know

what nodes we can work with and their states, but we do not know how they are related

to one another or their prior probabilities Therefore we learn both the structure and

probabilities from our data We may also learn in a hybrid of these methods — we may

know some of the interconnections between nodes, or some of their probabilities, but

may need to learn other structures and parameters Each of these methods are discussed

in further detail

2.2.1 Parameter Estimation

When computing the probability of a state of the node, we think in terms of a

relative frequency of the event occurring If all frequencies are equally likely, then the

relative frequency becomes a uniform distribution function However, in most cases we

do not believe that all relative frequencies are equally probable — this is how we can

actually predict things — when one thing is more probable than another, or occurs more frequently than another In order to describe this we will use the usual beta density

function to describe the probability

Neapolitan [18] describes this function:

Trang 20

The beta density function with parameters a, b, N = a + b, where a and b are real

numbers > 0, is

ø@)==L _T(@TŒ) 2 ` req~ 01 0</<l,

Where L` is the gamma function I(x) = (x-l)!, iŸx ¡is an integer > l

A random variable F that has this density function is said to have a beta distribution We

refer to the beta density function as beta(f;a,b)

We see that this function becomes more concentrated around the relative

frequency as the values of a and b become larger, that is, when we have more evidence

supporting the relative frequency, our density becomes more fixed at that point At the point where this relative frequency is most pronounced, we make the claim that the

expected value of F, E(F) = x This is the expected relative frequency that we use as our estimate of a prior probability for a state of the node

We can use this beta distribution to continually revise our estimate of the relative frequency If we have a new set of data and wish to add it to our prior beta distribution,

we can do so in the following manner Suppose we have a data set, d, with two counts s and ¢ that correspond to the a and b states of our prior beta distribution If Ä⁄ = s + /, then

we have a probability distribution

Trang 21

order to accurately update the distributions, but as this will not be a concern in our

research, we will not explain this here These issues are discussed in more detail in [18] Not only do we wish to compute a beta distribution and expected value, but we also wish to know with what certainty we are assigning this expected value These

calculations are especially useful in establishing thresholds for what will be considered a

“correct” or useful expected value We can solve for a normalized percent probability

interval for E(F) using the normal approximation This is obtained by the following range

(ECP) = 2 pepe OF), ECF) + 2 perce O(F)) » perc

where o(F) is the standard deviation of F, and Zper: is the z-score obtained from

the normal table Using this formula we can obtain a 95% (or any % we choose)

probability interval for the expected value, or estimate of relative frequency

We will use a small example to illustrate the parameter estimation techniques mentioned here First, we will use the small data set shown in Figure 2.2 to learn the prior probability of having tuberculosis — node H from our pervious example

Travel to Asia | Positive for TB

We will find the probability of A = yes and H = yes, P(H=yes|A=yes) by our expected value calculations We will let a = instances where both A and H are yes, and b =

instances where they are not V= 10,a=2,b=8

Trang 22

Our beta distribution is therefore

Figure 2.3: Convergence of Beta Distribution

As is visible from the graph, the density function begins to converge around 0.15 to 0.2

With more data, if indeed the frequency is 0.2, the mass would concentrate more around 0.2 Because we must use a single number as our prior probability, we will use the expected value, E(F), as our estimate of the relative frequency In this example the

Expected Value is calculated as

a 2 E(F)=—=—

Trang 23

And the standard deviation is the square-root of this — 0.1206 To obtain the normal approximation to the 95% probability interval we use the z-score number 1.96 and

calculate the interval —

(0.20 — (1.96 * 0.1206), 0.20 + (1.96 * 0.1206) = (0, 0.436)

This is a large variance; however, we have only ten data records, so this is to be expected

As we add data to this probability density the standard deviation should grow smaller if

we encounter a greater probability that the two nodes are related

Note that these algorithms assume that all of the data is present for the

calculations and do not take into account missing data In large and disparate datasets it is

almost inevitable that there will be missing data, therefore we need to account for this problem There are many ways to handle missing data One is to ignore those records that

do not have complete data In large data sets with only a few missing data pieces this may

be a sufficient method, however if there is a small data set, we may need to account for this missing data One method for doing this is to simply guess what the missing data piece is A more mathematical and less subjective approach is to use Expectation-

Maximization (EM) algorithms in which we use the maximum likelihood value of the missing data item To achieve this we first estimate the probabilities of the missing data

We can simply set the probability equal to one and divide the likelihood out evenly to begin For a node with two states, each state would have equal probability of occurrence- (0.5, 0.5) Then we use this initial probability in the likelihood equation

L(p =0.5| D)= alN-a)! N! “q-p)”*

where N = total number of cases, œ = number of cases in the particular state we are investigating, and D = the data set we are looking at After we determine an initial

Trang 24

likelihood, we chose a new probability — say 0.52, and determine the likelihood of this as the probability We would continue to run this algorithm until we determined a most likely probability, and then use this as our missing value This and other algorithms for

dealing with missing data are explained in detail in [8]

There is also the possibility that the data is not missing at random, but that the

absence of data depends on the state of other variables In this case we cannot use the EM

algorithm but must take into account the missing data’s non-random nature There are a variety of methods that can be used to solve this problem One involves deleting the rows

of missing data, as in the non-missing-at-random case This may work when small

amounts of data are missing; however, with large percentages of missing data this will

incur biases in the models Other methods involve adding a new state for the variable —

“unknown.” While this may work well in cases where having missing data is part of the data, such as when last known addresses are missing in a police criminal record, we may

be counting a non-existent data type and this may also lead to biases in our data A third method involves replacing the missing values with an estimate of its value, or imputation techniques Two such methods are explained in [25], mean imputation and hot deck imputation Mean imputation involves taking the mean of the non-missing variables and using this value as a replacement for the missing variable In hot deck imputation we find similar cases and substitute the missing values with the corresponding values in the similar cases All of these methods can incur biases, and we must be diligent in deciding what method works best for our domain In our research, it will not prove necessary to deal with missing data, as the crux of the research is in determining how to incorporate inaccurate data into our models It is important, however, to understand that missing data

Trang 25

does affect the learning algorithms, and that there are methods available for correcting

missing data

2.2.2 Structure Learning

When the structure of the nodes is not known, we must learn both the

interdependencies among the data as well as the prior probabilities — this is known as

structure learning There are two main approaches to this learning — searching and

scoring methods and constraint-based learning First we will explain the basics of the

common search and score algorithms In theory, learning can be achieved by creating all

of the possible DAGs for the nodes, and then scoring them to determine which DAG best

fits the data For a large number of nodes this is infeasible Chickering [3] proved that

finding an optimal solution is NP-hard, and therefore we must use heuristic searches to find suitable DAGs First we must use algorithms to generate a class of DAGs, and then

we score those DAGs and determine the most probable structure We will present the

popular K2 algorithm, developed by Cooper and Herskovits [5], which is one popular search and score algorithm

To score a DAG structure, we use the Cooper-Herskovits scoring function:

number of states for node i, Ni = number of states in the data set D corresponding to

variable k, and cjjx = the number of states of node i in state & with parent in state /

Trang 26

In the K2 algorithm, we wish to maximize the score We assume a set of nodes X;

to X; ordered such that if a node appears after another in the ordering, that is j < i, there is

no arc from X; to X; We start with a node, say ;, and set its parent set to empty Then we

visit all of the predecessors of X; and determine the node in the predecessor that most

increases the score = P(D|S) We greedily add this node to the parent set of X; and

continue until the addition of a parent does not increase the score Then we proceed through the node list

While K2 is a simple algorithm, it is powerful and can be modified to also require

no initial ordering of nodes We can also use other algorithms to create an initial ordering

of the nodes, as was done in Singh and Valtorta [26] This research used the CB

algorithm to create the initial ordering The CB algorithm starts with a set of nodes and creates a network where all nodes are connected to one another with unconditional edges The algorithm then determines the edges that are independent and deletes those edges

Once the final set of edges is determined, the algorithm orients the edges and creates a

final ordering of the nodes After orientation of the edges, the ordering is passed to the

K2 algorithm for a final determination of a structure

To illustrate the K2 algorithm, we will use a small part of our Visit to Asia

example — the two nodes Visit to Asia, A, and Has Tuberculosis, H We will use the small record set in Figure 2.2 and illustrate the structure learning process In the true K2

algorithm we would need an ordering of nodes, {A, H}, such that A can be the parent of

H, but H cannot be the parent of A We can use other constraint based algorithms to attain

an ordering of nodes prior to running the K2 algorithm, as was done in [26] with the CB

Trang 27

algorithm, however in this example we will simply use {A, H} as the ordering is not important for our illustration

With this ordering of the nodes, we have two possible configurations for the BN

structure — both are shown in Figure 2.4

Figure 2.4 Visit to Asia - Two Configurations

Node A can be the parent of H, or the two nodes can be independent We use the Cooper-

Herskovits function to score the configurations

With no prior knowledge of the data we will consider the initial probability tables

as A (0.5, 0.5), and H (0.5, 0.5) Because of the requirement that inputs to the gamma function be integers, we will consider the joint distribution as

P(A=Y,H=Y)=0.25 P(A=Y,H =N)=0.25 P(A=N,H =Y)=0.25 P(A=N,H =N)=0.25

and determine that each line in the distribution represents a prior data set We will use the data set in Figure 2.2 as our example set This will allow us to score our two

configurations — that the nodes are independent, or that node A is the parent of H Using the Cooper-Herskovits score for the independent configuration we obtain

Trang 28

P(G|d)=

Based on the data, having A as parent of H does not increase the score, therefore

we would conclude that the nodes are independent based on the data With all of the data

we would conclude that the two nodes are indeed related in the second configuration,

however, from this data set we conclude that the two are independent

There are a variety of other methods for learning structure which we will mention

here First, there are several scoring metrics that are similar to the K2 score, including the

Trang 29

BDe Metric of Heckerman, Geiger, and Chickering [11], that operate in the same way as other search and score algorithms There are also other useful methods for learning structure with missing data including a Monte Carlo Method called Gibbs Sampling Consider a Markov chain with states X={x), X2, ., Xi} and transitions

5={1,2,3 N}between these states We will have a set of transitional probabilities p, of moving from one state to the next that sum to P(5) =1 We must also have a chain that is ergodic, that is, every state is reachable from every other state We have an initial

configuration of the states, say (x), x}, x°) Gibbs Sampling works in the following way We sample each variable in X using the current state of the other variables as

constraints for the value of that variable We do this continually, each iteration

representing a transition in the Markov Chain The premise of Gibbs Sampling is that if

this process is repeated a large enough number of times we obtain a probability close to the sum of all probabilities, P If we are confident that this is the case, then we can

estimate a missing value by taking the mean of the sampled data This is a common

technique for estimating with missing data, but can become computationally intensive for

a large sample Techniques for large samples include the Laplace Approximation, the Bayesian Information Criterion, or BIC score, and the Cheeseman-Strutz approximation,

or CS score These are given an overview in [18], and will not be explained further

Constraint-based learning algorithms work differently and use Conditional

Independence (CJ) tests to create a useful DAG These types of algorithms will be used in

our research They work by creating a fully connected DAG and then deleting links that are conditionally independent One way to determine conditional independence is

through D-separation tests Because D-separated events are normally also conditionally

Trang 30

independent events, these tests can be useful for CI algorithms Two nodes A and B are d- separated if the changes in the certainty of A have no impact on the certainty of B More

formally, nodes A and B are d-separated if P(A|B,C) = P(A|C) There are several efficient algorithms for determining d-separation of nodes including Bayes-Ball, developed by Ross Schachter [24] This algorithm explores a network and marks those nodes that are d-

connected to the starting variable When the algorithm exhausts, those nodes that are not

marked are d-separated from the starting variable The algorithm works as follows There

is a “ball” that passes through the structure Three moves can be made as the ball

traverses the structure — it may pass through, bounce back, or be blocked Whether a ball has evidence, and whether it is passed from a parent to a child, or a child to a parent,

affects the move that is made Below is a matrix of these moves

1 From a parent to a child with no evidence — child is marked and ball passes

through to all the child’s children

2 From a child to a parent with no evidence — parent is marked and ball bounces

The algorithm terminates when the ball stops moving through the

structure All marked nodes are d-connected to the source node and all unmarked nodes

are d-separated and are therefore probabilistically independent based on the evidence

Trang 31

The NPC and PC algorithms are two CI algorithms used by the Hugin™ Decision Engine which will be used in our research These algorithms use a form of the Cross Entropy score called GŸ to determine independence of nodes [28] If X and Y are random variables with joint probability distribution P, and sample size m, then the G’ score is defined as:

The NPC and PC algorithms work as follows:

1 Find a graph pattern (most often a complete graph is constructed of the

nodes)

2 For every adjacency set of a node, test for independences (using G* score) and

delete those links which are independent

3 Orient remaining links

Once all of the independent edges are removed, the algorithm orients the edges in a series of steps:

A Head to Head links: If there exists three nodes _X, Y, Z, such that

X —Z-Y are connected, and X and Y are independent given a set not containing Z, then orient the nodes as ¥ > Z<Y

B Remaining Links: Three more rules govern the remaining links, each use the assumption that all Head-to-Head links have already been

discovered

Trang 32

i If there exists three nodes X,Y, Z, that are connected as

X + Z-Y,andX and Y are not connected, then orient

Z-Yas ZY

ii Ifthere exists two nodes X,Y , that are connected Y - Y, and there exist a path from _X to Y, then orient X-Yas X > Y

iii If there exists four nodes X,Y,Z,W that are connected

X -Z-Y,X ~W,Y OW, and X and Y are not connected, then orient Z-Was ZW

C If there are any links left, these are oriented arbitrarily, avoiding cycles and head-

to-head links

After orienting the links, the algorithm is complete

There are a few problems with the PC algorithm that NPC attempts to correct,

mainly those related to problems with small data sets which tell us two variables are

independent even if this is not theoretically possible In the Hugin™ implementation of NPC this is done simply by allowing the user to orient any ambiguous cases herself For another implementation of NPC, see [29]

2.3 Learning with Uncertain Evidence

A topic related to our research involves the revision of belief models using uncertain evidence Chan and Darwiche [2] have presented a summary of the two main schools of thought regarding probability updating with uncertain evidence - Jeffrey’s Rule and the Virtual Evidence Method Informally, these can be classified as the “All things considered” and “Nothing else considered” methods Jeffrey’s Rule uses the

Trang 33

construct of probability kinematics to minimize belief change upon new evidence If we

have two pieces of evidence which disagree on their probability distributions for a

particular event, for example the color of a piece of cloth, P(c) and P(c), but agree on the relationship of that probability distribution to same other event, the piece of cloth being sold, P(s|c), we can apply Jeffrey’s rule and obtain a new distribution

Pr(s,c)-Pr'(c) Pre) Pr(s|c)=

Jeffrey’s Rule is considered an “All things considered” algorithm, because we take the new probability distribution as the new “truth” about the event, and update accordingly

Conversely, the Virtual Evidence Method defines a relationship between

uncertain evidence on an event and the prior probability distribution for that event If we have a node, C, upon which we receive new evidence, 7, we define a new distribution

Ae Pr(C)

PXCIM =F BAC) + 4 Pr(C) ’

where Pr(77| C) = A This recasts the uncertainty in the evidence as a likelihood

function Graphically, this would create a belief network with the structure of Figure 2.5

Figure 2.5 Relationship of Virtual Evidence to Prior Probability

While Chan and Darwiche [2] show that it is possible to convert between

Jeffrey’s Rule and the Virtual Evidence method, the two schools of thought highlight the

Trang 34

importance of understanding how evidence is being presented — whether in an “All things

considered” methodology or a Virtual Evidence or “Nothing else considered”

methodology Vomlel [35] reiterates that these methods are both useful, but we must

determine which method best fits our evidence and tailor our revisions and our evidence

collection methods to that methodology

Now that we have reviewed the basics of BNs, we will explore data quality in

Chapter 3 and begin the baseline work for our experiments

Trang 35

Chapter 3 Data Quality Measurements

3.1 Overview of Data Quality

Much research has been conducted in the field of data quality in the last 30-40 years of data collection This is largely a result of business and industry’s continued

reliance on their collected data to influence business decisions Decision makers want to

be assured that their decisions are based on sound and accurate data Also, there has been much concern in recent years as to the accuracy of data stored in our criminal records systems A study in the 1980’s conducted for the Office of Technology Assessment, discovered that there were vast problems with the Federal Bureau of Investigation (FBI)’s databases [15] This study analyzed the National Crime Information Center’s

Computerized Criminal History (NCIC-CCH) database, which is an online file of

approximately 2 million records of arrests, court dispositions, and sentencing In this study it was found that approximately half of those data records contain some problem in data quality, ranging from incomplete data to inaccuracy It is apparent from this and other studies that our data quality is in question and that our machine learning algorithms must take these discrepancies into account

Trang 36

Assessing the quality of our data is a difficult process and is ripe with

subjectivity Researchers have been attempting for some time to develop quantitative metrics to accurately judge the quality of our data The metrics they have developed range from the subjective: value-added and understandability, to those that are more

easily quantifiable: completeness and timeliness While some researchers have up to

sixteen different metrics [22], we will concentrate on a core set of four in our discussion

— accuracy/precision, completeness, consistency/believability, and timeliness distilled from [37]

Accuracy and precision are taken from their common scientific meanings

Accuracy represents how close a measurement, or data record, is to the real-world

situation that it measures Precision can refer to two similar ideas First, it can refer to the standard deviation or variation in a numerical data record with multiple readings As an example, if a weather sensor is calibrated before its use to have a variation of + 0.01°C,

we would use this calibrated range as the precision of the instrument The second use for precision is to quantify the degree to which a sensor, or other data input gives the same data result to a given real-world situation To explain the difference between accuracy and precision, we can think of a shooting range with two gunmen If one gunman shot three shots into the same hole in the target, he may be very precise, however inaccurate if that hole is not the bull’s-eye In contrast, the second gunman may shot the bull’s-eye 2

out of 3 times, but on the third shot he shots into the woods In this case the gunman is

accurate on two of the three shots, yet imprecise, as he cannot repeat his shot with

precision each time Of course, we would wish our data to be both accurate and precise

While accuracy is a metric that must be calculated over time and with much diligence to

Trang 37

discover discrepancies between what is contained in the real world and what is

represented in the data record, precision can more easily be calculated from the data at hand

Completeness refers to the amount of missing data records This can be computed

as the total number of missing data, or the total number of rows containing missing data

Each can be useful in research This metric is easily calculated from the data at hand, and

requires no subjective insertions from the user or data manager Completeness has been

well researched and methods such as Expectation Maximization (EM) and Gibbs

Sampling handle this problem effectively (see Section 2.2.1) Large amounts of missing data can also point to problems in the data collection method and the data collection

tools

Timeliness of the data is also important in the context of our data Data that is

outdated is often useless in many fields of study, such as weather data records, or stock market data that must be used to make buying/selling decisions in seconds However, in other fields the timeliness of data is not as crucial If historical data is being used to track

consumer spending over the last decade, having data within minutes of the real-world

event is not as important This metric is therefore at once both subjective and objective If the data is time-stamped, it is not difficult to determine the time difference between the entry and the present time However, we must allow a Subject Matter Expert to then guide the program as to how current the data should be for that particular purpose

Lastly, data can be judged on its consistency or believability These metrics are

similar to precision, except will apply here to data from different data sets Where

precision applies to one particular data source — a particular sensor or manually inputted

Trang 38

data source - consistency/believability will apply to multiple data sets reporting on the

same real-world situation If we have three sensors reporting weather information from

one location we can measure the consistency of the data for that region as the deviation between the data records from each sensor Similarly, if we have eyewitness accounts at

the scene of a crime, we can judge the consistency of the data set by the similarity of the

accounts This metric is computed separately from the accuracy and precision

calculations and gives no weight to one data source over the other If there are three data sources, we will calculate one consistency metric for that data record which takes into account differences from all the data sets, without declaring which data source is most accurate This allows us to account for differences in data without knowing the accuracy

of the data set This may seem faulty, and indeed if we have the resources to judge the accuracy of the data sets, then we can assign data quality based on those findings

However, consistency allows us a metric to determine that there are indeed differences in data sources, without tracing the data sets back to the sources and having to exhaustively determine which data set is most accurate

While all of the elements of data quality are important, the study of all elements and their combined effects would be time prohibitive Therefore, after studying the literature and conducting our own experiments (see Section 3.3) we have determined that we will limit our study to the element of data inaccuracy This element was chosen because it gives itself well to measurement and experimentation and has not yet been as thoroughly researched and reported in the literature

While the effects of inaccuracy on learning algorithms have not been well studied or

documented in the literature, complementary work done by Vomlel [35] in the area of

Trang 39

evidential update with uncertain evidence, lends itself well to use in our research Vomlel

defines accuracy as

ip+in+ fp+ fn P(A=T)=

where T is a data source, A is the event 7 is reporting on, tp is the number of true positive

data points, tn is the number of true negative data points, fp is the number of false

positives, and fm the number of false negatives He further defines two criteria —

sensitivity and specificity - that are important for determining how data sources should

update the probability table of A Sensitivity is the test’s true positive rate, tp, and

specificity the true negative rate, tn Figure 3.1 gives a graphical example of how the two data sources report evidence for event A

Figure 3.1: Two Sources Report on Event A

Figure 3.2 shows the relationship between T1 and A when T1’s sensitivity is 80% and specificity is 95%

0 9

0 0

Figure 3.2: T1 Probability Table — 80% Sensitivity, 95% Specificity

28

Trang 40

While Vomlel uses these metrics for evidential update, we are evaluating the BNs generated using data sources of low accuracy So while evidential update uses this idea

when updating a belief based on uncertain evidence:

P(A= yes|T = yes)=c*(P(T = yes | A = yes)* (P(A = yes),

we are exploring the probability of learning a BN, G with data of a certain accuracy, or

P(G|A=T)

This method can also be extended to our other data quality measures In order to clarify our definitions of completeness, consistency/believability, and timeliness we will

formalize them as Vomlel has done with accuracy

We shall define completeness, or percentage missing data, as

n PercentComplete = C% = —— ,

no+tn c m

where n, is the number of complete data points, and n,, the number of missing data

points Similarly, consistency/believability shall be the square root of the sample variance

of the data sources reporting:

variance = —J'(x, —x)ˆ,

Nia where N is the number of data sets reporting, x; is the data record from the i” data source reporting on the event, and X is the mean of the data points

Timeliness can also be represented more formally using this method Assuming a

time standard, f, where n,; <= ¢, and n, > ¢, timeliness can be represented as

PercentTimely =T% = mt ›

n+n t 7

Tiêu đề	Techniques for Incorporating Data Quality Assessments into Learning Algorithms for Bayesian Networks
Tác giả	Valerie Kay Sessions
Trường học	University of South Carolina
Chuyên ngành	Computer Science and Engineering
Thể loại	Doctoral Dissertation
Năm xuất bản	2006
Thành phố	Columbia

Định dạng
Số trang	124
Dung lượng	5,1 MB