However, research has normally started with the assumption that the data given to the learning algorithm is accurate.. Our research lays the foundation for the development of new algorit
Trang 1TECHNIQUES FOR INCORPORATING DATA QUALITY
ASSESSMENTS INTO LEARNING ALGORITHMS FOR
BAYESIAN NETWORKS
By
Valerie Kay Sessions
Bachelor of Science College of Charleston, 2001 Master of Science University of Charleston, 2002
Submitted in Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy in the
Department of Computer Science and Engineering
College of Engineering and Information Technology
University of South Carolina
2006
Committee Member ˆ Dean of the Graduate School
Trang 2UMI Number: 3245436
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy submitted Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction
In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted Also, if unauthorized copyright material had to be removed, a note will indicate the deletion
®
UMI
UMI Microform 3245436 Copyright 2007 by ProQuest Information and Learning Company
All rights reserved This microform edition is protected against unauthorized copying under Title 17, United States Code
ProQuest Information and Learning Company
300 North Zeeb Road P.O Box 1346 Ann Arbor, MI 48106-1346
Trang 4Acknowledgements
I would like to first thank my husband Kip for his love and support - for listening
to me and supporting me during both the highs and lows of the past years I would like to
thank my parents, Tommy and Debbie Sessions, for the drive they have instilled in me to begin this research and degree, and for instilling my faith in God which has allowed me
to complete it; my in-laws, Paul and Ann Hooker, for countless hours of babysitting; my Grandmother for many dinner meals in Blythewood; my son Austin and new daughter Remley for sleeping every now and then so that I could finish one last page of writing
I would like to thank my entire committee for their time and patience with this process, especially Marco Valtorta for allowing me to spend countless hours in his office learning and asking endless questions I would also like to thank Jewel Rogers for all of her help administratively - she made commuting two hours possible Thanks to all my fellow classmates for your inspiration and code snippets
Thanks to my girlfriends Chris and Amy for their support and love, my Women’s
Bible Study and church for all of their prayers during this endeavor
Thank you to my former supervisor Richard Baker for funding this work and allowing me ample time to commute to Columbia, and many thanks to all my former co- workers at SPAWAR for their support (and covering for me) during this process
Trang 5Abstract
The field of Bayesian Networks (BNs) has had much success in developing structure learning algorithms to learn BNs directly from data However, research has normally started with the assumption that the data given to the learning algorithm is accurate This assumption is a naive one and can lead to very biased and unrealistic decision making frameworks If we are to use decision making algorithms to their full potential we must design them with the capability to account for data quality Our research lays the
foundation for the development of new algorithms that incorporate data quality
assessments into traditional BN learning algorithms — specifically the PC algorithm We
begin by reviewing Bayesian networks, learning algorithms, and data quality measures
We then quantify the effect of inaccurate data on the PC algorithm and develop three
techniques for incorporating data quality assessments into the algorithm Results indicate that a technique which modifies the significance level used by the PC algorithm is promising We show and explain these results and offer guidelines for future research in
this area
Trang 6Contents
Acknowledgements cececcsceceseeceeeneeeee scent eneeneeeeesneen eens eaenseaeaeneenenes ili
t6 1 = ene e ene EA SEE DO EE EEE ESE OEE OE EEE SEES EEE S EE SEE EEE Eee iv
0): 5081i 640i nh ee 1
1.1 Research Motivation HH ng ky 1
Chapter 2 Bayesian Network BasICS c co SH HH SH KÝ KH nh x 3
2.1 Bayesian Network BaSICS Lo ng HH ng HH nhà kh na 3 2.2 Learning Algorithms for Bayesian NetWOrks cà 7
2.2.1 Parameter Estimation cv 8 2.2.2 Structure Learning ‹c-c con Hs HS nh ve, 14 2.3 Learning with Uncertain Evidence - cà se 21 Chapter 3 Data Quality MeasurermenIs -.-c.cn Hs nhe 24
3.1 Overview of Data QuaÌity -c HH HH nh nh xu 24
E009) ảvvyiađaađiaiiiiiiiÝẢẢốỐốỐỐ 33
3.3.1 Inaccuracy T€SfS co HS HH HH kh 34 3.3.2 Parameter Estimation ResuÌts - se se 37
Trang 73.3.3 Structure Learning Resulfs -.-. -c.cnĂsĂ se, 42
3.3.4 Conclusions Regarding Inaccurate Data 45 3.4 Development of Experimental Data Sets «- 47 Chapter 4 Methods and Resulfs - -.cQQQQ nn HH HH HH kh he 61
4.1.1 Do Nothing Method -.- HH HH nhớ 61
4.1.2 Threshold Method - con HH HH HH khe 61
4.1.3 The DQ Algorithm - - con HH HH he 62 4.2 Method Pseudo Code ccc eeceeeceeec ee ec eens neeceeee essa eee en eens kệ 65 LEN -haaAaddđiiaiiiiidtidtddiididi 67
5.1 PC Algorithm ConcÌusions ch ve, 100 5.2 Conclusions Regarding Methods -.cccccnnSvee 102
5.2.1 Visit to Asia ConcÌusiOns - cv 102
5.2.2 Studfarm ConcÌuSiO'S HS nh nh nhe 104 5.2.3 ALARM Conclusions co Hs 104
5.3 Modified DQ Algorithm 0 ce cecececececee eee eee eneeeesneneneeaenens 105
Trang 85.4 Future Research cc cece ccccccecceeeuctcnateceveccssnceseteseeeeenecanaees
5310119124-104) PNW HiẳdtdddiẳẳẳẳŸẢ
Trang 9List of Figures
2.3: Convergence of Beta Distribution ‹- cv in 11 2.4: Visit to Asia — TWO ConfigurafiO'S HH HH HH nh su 16 2.5: Relationship of Virtual Evidence to Prior Probability c.c{<22 22 3.1: Two Sources Report on Event A cccccecceeceeeec ene nn HH HH He ng kh nh ng 28 3.2: T1 Probability Table — 80% Sensitivity, 95% SpecifÍicify co 28
Ki 03v 0à cỚ“Taaaa)aiO 35
3.5: Average Variance From Baseline HS HH HH nhà yên 37
3.6: Comparison of learned potentIaÌs - con n HH nh yên 38
3.7: Stud Farm Average Variance from Baseline cà 41
3.8: Stud Farm Learned Potentials Graph _- - HH HH n1 kh 42
3.9: NPC and CB Structure Learming Results -.-c nen 42
3.10: Incorrect links using the NPC Algorithm HQ nhe 43
3.11: Incorrect links using the CB algorithm cccccceseeceeeescesenseeseeseseees 44 3.12: Visit to Asia “True” Probability Tables - - cà Sàn 49 3.13: Studfarm ““Irue” Probability Tables - cà n se 31 3.14: ALARM “True” Probability Tables . cnnn Hs khe, 59 3.15: Breakdown of Data S€fS con nn nnnHnnn nn K n TK Ki viện 60
Trang 104.1: COW €T€ẨICS cọ HH HH HH HH KH Ki Ki Tà Ki Ki Ki Ki Ki ni tà tà tà tà ty 62
4.2: Average Degree of Nodes - con HH HH nh kh nhe 65 4.3: Visit to Asia, One Data Set, Raw Results con nh 69 4.4; Visit To Asia, One Data Set, Combined Results 70 4.5: Visit To Asia, One Data Set, Zeroed Results - 70 4.6: Visit to Asia, TWwo Data Sets, Raw Resul(s - c cà 72 4.7: Visit to Asia, Two Data Sets, Combined Results .- 74
4.8: Visit to Asia, Two Data Sets, Zeroed Results 76
4.9: Visit to Asia, Three Data Sets, Raw Results S- 80 4.10: Visit to Asia, Three Data Sets, Combined Results 82 4.11: Visitto Asia, Three Data Sets, Zeroed Results 84 4.12: Studfarm, One Data Set, Raw Resulfs co ằ 85 4.13: Studfarm, One Data Set, Combined Results 86 4.14: Studfarm, One Data Set, Zeroed ResuÌt - cccS 86
4.15: Studfarm, Two Data Sets, Raw R€SUÌtS nen 87
4.16: Studfarm, Two Data Sets, Combined Resulfs 88 4.17: Studfarm, Two Data Sets, Zeroed Results 88 4.18: Studfarm, Three Data Sets, Raw Results ccccceeceeeceeneeeeneeeeseeenseenes 90
4.19: Studfarm, Three Data Sets, Combined ResuÌts 91
4.20: Studfarm, Three Data Sets, Zeroed ResSuÌts - -¿ 92
4.21: Visit To Asia, Extra Test, Raw ResulÌ(s - -cccc<cị 94 4.22: Visit to Asia, Extra Test, Combined and Zeroed Form 94
4.23: Studfarm, Extra Test, Raw Results - con HH nu nhớ 95
Trang 114.24: Studfarm, Extra Test, Combined and Zeroed ResuÌts 95
4.25: ALARM, Extra Test, Raw ResuÌts - - << cà: 96 4.26: ALARM, Extra Test, Combined and Zeroed Results 96
4,27: Visit to Asia, Significance Test, One Data Set, Raw Results 97
4,28: Visit to Asia, Significance Test, One Data Set, Combined Results 98
4.29: Visit to Asia, Significance Test, One Data Set, Zeroed Results 98 5.1: Commutativity of Methods -.c HH HH kh nha 107
Trang 12Chapter 1 Introduction
1.1 Research Motivation
The field of Bayesian networks (BNs) has had much success in creating algorithms to learn directly from data There are both parameter estimation and structure learning
algorithms that have a high rate of success in creating useful BNs However, research has
normally started with the assumption that the data given to the learning algorithm is
accurate and complete This assumption is a naive one and can lead to very biased and unrealistic decision making frameworks
There are numerous examples of faulty data collections — those as technical as the
Hubble Telescope’s calibration problems, to the more human centered — lying on a credit card information form If we are to use decision making algorithms to their full potential
we must design them with the capability to account for data quality Our research lays the foundation for the development of new algorithms that incorporate data quality
Trang 14Chapter 2 Bayesian Network Basics
In order to examine the learning of BNs from data sets of low quality, we must
first review the background of the algorithms and processes of BNs First, we will
introduce the basics of probability and statistics, Bayes law, and evidence propagation Then we will examine learning algorithms for BNs and discuss those employed for this research as well as fading and adaptation methods We will also examine research in Data quality and how it can be incorporated into our learning algorithms Finally, we will review complementary work conducted in the area of revision of parameters based on
uncertain evidence
2.1 Bayesian Network Basics
We will define a BN using graph theory, following [18] A BN is a Directed Acyclic Graph (DAG) G = (V,E) where V is a set of variables and E is a set of directed edges between these variables, and a probability distribution, P, over the variables The set (G, P) satisfies the Markov condition, which states that all variables are independent
of their nondescendents given the set of its parents
Therefore following [14], for each variable A with parents Bj, ., Bp, there is a potential table P(A | Bi, ., Bn) If A has no parents this becomes the unconditional probability
P(A).
Trang 15Each variable’s probability is represented by a probability table which shows its
probability based on the states of its parents (or its unconditional probability if it is
without parents) All of the basics of probability are relevant to these tables and will be reviewed here
Andrey Kolmogorov [14] formed the probability axioms in the 1930’s He
proposed that given an event E in a probability space S, there is a probability P(E) such
that
1 0< P(E) <1, and P(E) = 1 if E is a certain event
2 The sum of all of the events, E; (i=1, 2, .), in the sample space, S, is 1
P(S) = P(E;) + P(E2) + = 1
3 For mutually exclusive events, E; and E2, P(E; U E2) = P(E) + P(E»)
Two events may be independent or dependent Two events are considered conditionally dependent if changing the probability of one also changes the probability of another For two events, E; and E2, we say that E; is a conditionally dependent on E; if the probability
of E; given E> does not equal the probability of E; - P(E;/E2) # P(E) Conditional
probability requires that
P(E, Ea) = P(E1) P(Fa|E:)
Using this rule and the commutative property of sets - P(E; ( E2) = P(E2 4 E;), we have
Trang 16P(E1)P(E2 | Es) PE) PŒi| Ea)=
From this formula we can understand the statistical procedure used in BNs We begin
with a prior probability of an event occurring — P(E¡) and P(Eạ) If these events are
dependent upon each other, we can determine their conditional probabilities P(ŒE¡[E2) or
P(E,|E)) in the following way When evidence is collected about one of the variables, say
Ej, we determine the likelihood of another event, Ex, given E; — P(E2|E1), multiply this by
the prior probabilities and obtain a posterior probability based on evidence
We will use the canonical Visit to Asia example to illustrate these ideas In this example we are trying to determine if a person has Visited Asia based on some visible signs of illness The structure of the BN is shown in Figure 2.1
Figure 2.1 Visit to Asia
Trang 17In order to understand Bayes Rule, we will review one example of the propagation of
evidence We will label the node Has Tuberculosis, “H,” and the node Visit to Asia, “A.” Because H has parent A, we have the prior probability table P(H|A) —
Visit to Asia? yes No
This means that in the general population (according to a Subject Matter Expert) about 1
in 100 people have visited Asia and if he has visited Asia this increases his chances of
having tuberculosis by 4% Now, if we obtain evidence that the person has visited Asia — P(A=yes) - we can look up P(HI/A = yes) in the table and see that the chances of
contraction of tuberculosis are now 5 % The more interesting case, however, is if we
have evidence that the person has tuberculosis We can propagate this to the parent A in the following manner
There are two probabilities that will change — P(A = yes) and P(A = no) Using Bayes Rule we have
P(H | Ay =P ALP |
P(A)
Solving for P(A|H=yes) we obtain
P(A|H = yes) = PULIAVPA) -
P(H = yes)
Trang 18To obtain P(H) we must solve for the joint distribution table Note that we do not need to
solve for P(H=no) because we have evidence that H=yes, P(H=no) = 0
2.2 Learning Algorithms for Bayesian networks
There are many different ways to create BNs One method is to create them subjectively, by interviewing Subject Matter Experts (SME) for their opinions regarding the probability of outcomes We then use their opinions and previous work in the field to create a network to represent the relationships between variables This is often done in
sample BNs such as Visit to Asia or in larger networks where the domain is well known
In other cases we can use domain data to learn the networks There are two main
categories for this type of learning - parameter estimation and structure learning For
parameter estimation, we must begin with a known BN structure That is, we know what nodes are of interest and we know how these nodes are connected to one another by a set
Trang 19of directed edges We also know the states that are possible for each node For example,
in our Visit to Asia example, we would have the structure of the BN as represented in Figure 2.1, and we would know the possible states of the nodes - that Visit to Asia has
states ‘yes’ and ‘no’ What we would not know is the prior probabilities of these states — that about | in 100 people have visited Asia This we would seek to discover from our
data
For structure learning, even less is known about the Bayesian Network We know
what nodes we can work with and their states, but we do not know how they are related
to one another or their prior probabilities Therefore we learn both the structure and
probabilities from our data We may also learn in a hybrid of these methods — we may
know some of the interconnections between nodes, or some of their probabilities, but
may need to learn other structures and parameters Each of these methods are discussed
in further detail
2.2.1 Parameter Estimation
When computing the probability of a state of the node, we think in terms of a
relative frequency of the event occurring If all frequencies are equally likely, then the
relative frequency becomes a uniform distribution function However, in most cases we
do not believe that all relative frequencies are equally probable — this is how we can
actually predict things — when one thing is more probable than another, or occurs more frequently than another In order to describe this we will use the usual beta density
function to describe the probability
Neapolitan [18] describes this function:
Trang 20The beta density function with parameters a, b, N = a + b, where a and b are real
numbers > 0, is
ø@)==L _T(@TŒ) 2 ` req~ 01 0</<l,
Where L` is the gamma function I(x) = (x-l)!, iŸx ¡is an integer > l
A random variable F that has this density function is said to have a beta distribution We
refer to the beta density function as beta(f;a,b)
We see that this function becomes more concentrated around the relative
frequency as the values of a and b become larger, that is, when we have more evidence
supporting the relative frequency, our density becomes more fixed at that point At the point where this relative frequency is most pronounced, we make the claim that the
expected value of F, E(F) = x This is the expected relative frequency that we use as our estimate of a prior probability for a state of the node
We can use this beta distribution to continually revise our estimate of the relative frequency If we have a new set of data and wish to add it to our prior beta distribution,
we can do so in the following manner Suppose we have a data set, d, with two counts s and ¢ that correspond to the a and b states of our prior beta distribution If Ä⁄ = s + /, then
we have a probability distribution
Trang 21order to accurately update the distributions, but as this will not be a concern in our
research, we will not explain this here These issues are discussed in more detail in [18] Not only do we wish to compute a beta distribution and expected value, but we also wish to know with what certainty we are assigning this expected value These
calculations are especially useful in establishing thresholds for what will be considered a
“correct” or useful expected value We can solve for a normalized percent probability
interval for E(F) using the normal approximation This is obtained by the following range
(ECP) = 2 pepe OF), ECF) + 2 perce O(F)) » perc
where o(F) is the standard deviation of F, and Zper: is the z-score obtained from
the normal table Using this formula we can obtain a 95% (or any % we choose)
probability interval for the expected value, or estimate of relative frequency
We will use a small example to illustrate the parameter estimation techniques mentioned here First, we will use the small data set shown in Figure 2.2 to learn the prior probability of having tuberculosis — node H from our pervious example
Travel to Asia | Positive for TB
We will find the probability of A = yes and H = yes, P(H=yes|A=yes) by our expected value calculations We will let a = instances where both A and H are yes, and b =
instances where they are not V= 10,a=2,b=8
Trang 22Our beta distribution is therefore
Figure 2.3: Convergence of Beta Distribution
As is visible from the graph, the density function begins to converge around 0.15 to 0.2
With more data, if indeed the frequency is 0.2, the mass would concentrate more around 0.2 Because we must use a single number as our prior probability, we will use the expected value, E(F), as our estimate of the relative frequency In this example the
Expected Value is calculated as
a 2 E(F)=—=—
Trang 23And the standard deviation is the square-root of this — 0.1206 To obtain the normal approximation to the 95% probability interval we use the z-score number 1.96 and
calculate the interval —
(0.20 — (1.96 * 0.1206), 0.20 + (1.96 * 0.1206) = (0, 0.436)
This is a large variance; however, we have only ten data records, so this is to be expected
As we add data to this probability density the standard deviation should grow smaller if
we encounter a greater probability that the two nodes are related
Note that these algorithms assume that all of the data is present for the
calculations and do not take into account missing data In large and disparate datasets it is
almost inevitable that there will be missing data, therefore we need to account for this problem There are many ways to handle missing data One is to ignore those records that
do not have complete data In large data sets with only a few missing data pieces this may
be a sufficient method, however if there is a small data set, we may need to account for this missing data One method for doing this is to simply guess what the missing data piece is A more mathematical and less subjective approach is to use Expectation-
Maximization (EM) algorithms in which we use the maximum likelihood value of the missing data item To achieve this we first estimate the probabilities of the missing data
We can simply set the probability equal to one and divide the likelihood out evenly to begin For a node with two states, each state would have equal probability of occurrence- (0.5, 0.5) Then we use this initial probability in the likelihood equation
L(p =0.5| D)= alN-a)! N! “q-p)”*
where N = total number of cases, œ = number of cases in the particular state we are investigating, and D = the data set we are looking at After we determine an initial
Trang 24likelihood, we chose a new probability — say 0.52, and determine the likelihood of this as the probability We would continue to run this algorithm until we determined a most likely probability, and then use this as our missing value This and other algorithms for
dealing with missing data are explained in detail in [8]
There is also the possibility that the data is not missing at random, but that the
absence of data depends on the state of other variables In this case we cannot use the EM
algorithm but must take into account the missing data’s non-random nature There are a variety of methods that can be used to solve this problem One involves deleting the rows
of missing data, as in the non-missing-at-random case This may work when small
amounts of data are missing; however, with large percentages of missing data this will
incur biases in the models Other methods involve adding a new state for the variable —
“unknown.” While this may work well in cases where having missing data is part of the data, such as when last known addresses are missing in a police criminal record, we may
be counting a non-existent data type and this may also lead to biases in our data A third method involves replacing the missing values with an estimate of its value, or imputation techniques Two such methods are explained in [25], mean imputation and hot deck imputation Mean imputation involves taking the mean of the non-missing variables and using this value as a replacement for the missing variable In hot deck imputation we find similar cases and substitute the missing values with the corresponding values in the similar cases All of these methods can incur biases, and we must be diligent in deciding what method works best for our domain In our research, it will not prove necessary to deal with missing data, as the crux of the research is in determining how to incorporate inaccurate data into our models It is important, however, to understand that missing data
Trang 25does affect the learning algorithms, and that there are methods available for correcting
missing data
2.2.2 Structure Learning
When the structure of the nodes is not known, we must learn both the
interdependencies among the data as well as the prior probabilities — this is known as
structure learning There are two main approaches to this learning — searching and
scoring methods and constraint-based learning First we will explain the basics of the
common search and score algorithms In theory, learning can be achieved by creating all
of the possible DAGs for the nodes, and then scoring them to determine which DAG best
fits the data For a large number of nodes this is infeasible Chickering [3] proved that
finding an optimal solution is NP-hard, and therefore we must use heuristic searches to find suitable DAGs First we must use algorithms to generate a class of DAGs, and then
we score those DAGs and determine the most probable structure We will present the
popular K2 algorithm, developed by Cooper and Herskovits [5], which is one popular search and score algorithm
To score a DAG structure, we use the Cooper-Herskovits scoring function:
number of states for node i, Ni = number of states in the data set D corresponding to
variable k, and cjjx = the number of states of node i in state & with parent in state /
Trang 26In the K2 algorithm, we wish to maximize the score We assume a set of nodes X;
to X; ordered such that if a node appears after another in the ordering, that is j < i, there is
no arc from X; to X; We start with a node, say ;, and set its parent set to empty Then we
visit all of the predecessors of X; and determine the node in the predecessor that most
increases the score = P(D|S) We greedily add this node to the parent set of X; and
continue until the addition of a parent does not increase the score Then we proceed through the node list
While K2 is a simple algorithm, it is powerful and can be modified to also require
no initial ordering of nodes We can also use other algorithms to create an initial ordering
of the nodes, as was done in Singh and Valtorta [26] This research used the CB
algorithm to create the initial ordering The CB algorithm starts with a set of nodes and creates a network where all nodes are connected to one another with unconditional edges The algorithm then determines the edges that are independent and deletes those edges
Once the final set of edges is determined, the algorithm orients the edges and creates a
final ordering of the nodes After orientation of the edges, the ordering is passed to the
K2 algorithm for a final determination of a structure
To illustrate the K2 algorithm, we will use a small part of our Visit to Asia
example — the two nodes Visit to Asia, A, and Has Tuberculosis, H We will use the small record set in Figure 2.2 and illustrate the structure learning process In the true K2
algorithm we would need an ordering of nodes, {A, H}, such that A can be the parent of
H, but H cannot be the parent of A We can use other constraint based algorithms to attain
an ordering of nodes prior to running the K2 algorithm, as was done in [26] with the CB
Trang 27algorithm, however in this example we will simply use {A, H} as the ordering is not important for our illustration
With this ordering of the nodes, we have two possible configurations for the BN
structure — both are shown in Figure 2.4
Figure 2.4 Visit to Asia - Two Configurations
Node A can be the parent of H, or the two nodes can be independent We use the Cooper-
Herskovits function to score the configurations
With no prior knowledge of the data we will consider the initial probability tables
as A (0.5, 0.5), and H (0.5, 0.5) Because of the requirement that inputs to the gamma function be integers, we will consider the joint distribution as
P(A=Y,H=Y)=0.25 P(A=Y,H =N)=0.25 P(A=N,H =Y)=0.25 P(A=N,H =N)=0.25
and determine that each line in the distribution represents a prior data set We will use the data set in Figure 2.2 as our example set This will allow us to score our two
configurations — that the nodes are independent, or that node A is the parent of H Using the Cooper-Herskovits score for the independent configuration we obtain
Trang 28P(G|d)=
Based on the data, having A as parent of H does not increase the score, therefore
we would conclude that the nodes are independent based on the data With all of the data
we would conclude that the two nodes are indeed related in the second configuration,
however, from this data set we conclude that the two are independent
There are a variety of other methods for learning structure which we will mention
here First, there are several scoring metrics that are similar to the K2 score, including the
Trang 29BDe Metric of Heckerman, Geiger, and Chickering [11], that operate in the same way as other search and score algorithms There are also other useful methods for learning structure with missing data including a Monte Carlo Method called Gibbs Sampling Consider a Markov chain with states X={x), X2, ., Xi} and transitions
5={1,2,3 N}between these states We will have a set of transitional probabilities p, of moving from one state to the next that sum to P(5) =1 We must also have a chain that is ergodic, that is, every state is reachable from every other state We have an initial
configuration of the states, say (x), x}, x°) Gibbs Sampling works in the following way We sample each variable in X using the current state of the other variables as
constraints for the value of that variable We do this continually, each iteration
representing a transition in the Markov Chain The premise of Gibbs Sampling is that if
this process is repeated a large enough number of times we obtain a probability close to the sum of all probabilities, P If we are confident that this is the case, then we can
estimate a missing value by taking the mean of the sampled data This is a common
technique for estimating with missing data, but can become computationally intensive for
a large sample Techniques for large samples include the Laplace Approximation, the Bayesian Information Criterion, or BIC score, and the Cheeseman-Strutz approximation,
or CS score These are given an overview in [18], and will not be explained further
Constraint-based learning algorithms work differently and use Conditional
Independence (CJ) tests to create a useful DAG These types of algorithms will be used in
our research They work by creating a fully connected DAG and then deleting links that are conditionally independent One way to determine conditional independence is
through D-separation tests Because D-separated events are normally also conditionally
Trang 30independent events, these tests can be useful for CI algorithms Two nodes A and B are d- separated if the changes in the certainty of A have no impact on the certainty of B More
formally, nodes A and B are d-separated if P(A|B,C) = P(A|C) There are several efficient algorithms for determining d-separation of nodes including Bayes-Ball, developed by Ross Schachter [24] This algorithm explores a network and marks those nodes that are d-
connected to the starting variable When the algorithm exhausts, those nodes that are not
marked are d-separated from the starting variable The algorithm works as follows There
is a “ball” that passes through the structure Three moves can be made as the ball
traverses the structure — it may pass through, bounce back, or be blocked Whether a ball has evidence, and whether it is passed from a parent to a child, or a child to a parent,
affects the move that is made Below is a matrix of these moves
1 From a parent to a child with no evidence — child is marked and ball passes
through to all the child’s children
2 From a child to a parent with no evidence — parent is marked and ball bounces
The algorithm terminates when the ball stops moving through the
structure All marked nodes are d-connected to the source node and all unmarked nodes
are d-separated and are therefore probabilistically independent based on the evidence
Trang 31The NPC and PC algorithms are two CI algorithms used by the Hugin™ Decision Engine which will be used in our research These algorithms use a form of the Cross Entropy score called GŸ to determine independence of nodes [28] If X and Y are random variables with joint probability distribution P, and sample size m, then the G’ score is defined as:
The NPC and PC algorithms work as follows:
1 Find a graph pattern (most often a complete graph is constructed of the
nodes)
2 For every adjacency set of a node, test for independences (using G* score) and
delete those links which are independent
3 Orient remaining links
Once all of the independent edges are removed, the algorithm orients the edges in a series of steps:
A Head to Head links: If there exists three nodes _X, Y, Z, such that
X —Z-Y are connected, and X and Y are independent given a set not containing Z, then orient the nodes as ¥ > Z<Y
B Remaining Links: Three more rules govern the remaining links, each use the assumption that all Head-to-Head links have already been
discovered
Trang 32i If there exists three nodes X,Y, Z, that are connected as
X + Z-Y,andX and Y are not connected, then orient
Z-Yas ZY
ii Ifthere exists two nodes X,Y , that are connected Y - Y, and there exist a path from _X to Y, then orient X-Yas X > Y
iii If there exists four nodes X,Y,Z,W that are connected
X -Z-Y,X ~W,Y OW, and X and Y are not connected, then orient Z-Was ZW
C If there are any links left, these are oriented arbitrarily, avoiding cycles and head-
to-head links
After orienting the links, the algorithm is complete
There are a few problems with the PC algorithm that NPC attempts to correct,
mainly those related to problems with small data sets which tell us two variables are
independent even if this is not theoretically possible In the Hugin™ implementation of NPC this is done simply by allowing the user to orient any ambiguous cases herself For another implementation of NPC, see [29]
2.3 Learning with Uncertain Evidence
A topic related to our research involves the revision of belief models using uncertain evidence Chan and Darwiche [2] have presented a summary of the two main schools of thought regarding probability updating with uncertain evidence - Jeffrey’s Rule and the Virtual Evidence Method Informally, these can be classified as the “All things considered” and “Nothing else considered” methods Jeffrey’s Rule uses the
Trang 33construct of probability kinematics to minimize belief change upon new evidence If we
have two pieces of evidence which disagree on their probability distributions for a
particular event, for example the color of a piece of cloth, P(c) and P(c), but agree on the relationship of that probability distribution to same other event, the piece of cloth being sold, P(s|c), we can apply Jeffrey’s rule and obtain a new distribution
Pr(s,c)-Pr'(c) Pre) Pr(s|c)=
Jeffrey’s Rule is considered an “All things considered” algorithm, because we take the new probability distribution as the new “truth” about the event, and update accordingly
Conversely, the Virtual Evidence Method defines a relationship between
uncertain evidence on an event and the prior probability distribution for that event If we have a node, C, upon which we receive new evidence, 7, we define a new distribution
Ae Pr(C)
PXCIM =F BAC) + 4 Pr(C) ’
where Pr(77| C) = A This recasts the uncertainty in the evidence as a likelihood
function Graphically, this would create a belief network with the structure of Figure 2.5
Figure 2.5 Relationship of Virtual Evidence to Prior Probability
While Chan and Darwiche [2] show that it is possible to convert between
Jeffrey’s Rule and the Virtual Evidence method, the two schools of thought highlight the
Trang 34importance of understanding how evidence is being presented — whether in an “All things
considered” methodology or a Virtual Evidence or “Nothing else considered”
methodology Vomlel [35] reiterates that these methods are both useful, but we must
determine which method best fits our evidence and tailor our revisions and our evidence
collection methods to that methodology
Now that we have reviewed the basics of BNs, we will explore data quality in
Chapter 3 and begin the baseline work for our experiments
Trang 35Chapter 3 Data Quality Measurements
3.1 Overview of Data Quality
Much research has been conducted in the field of data quality in the last 30-40 years of data collection This is largely a result of business and industry’s continued
reliance on their collected data to influence business decisions Decision makers want to
be assured that their decisions are based on sound and accurate data Also, there has been much concern in recent years as to the accuracy of data stored in our criminal records systems A study in the 1980’s conducted for the Office of Technology Assessment, discovered that there were vast problems with the Federal Bureau of Investigation (FBI)’s databases [15] This study analyzed the National Crime Information Center’s
Computerized Criminal History (NCIC-CCH) database, which is an online file of
approximately 2 million records of arrests, court dispositions, and sentencing In this study it was found that approximately half of those data records contain some problem in data quality, ranging from incomplete data to inaccuracy It is apparent from this and other studies that our data quality is in question and that our machine learning algorithms must take these discrepancies into account
Trang 36Assessing the quality of our data is a difficult process and is ripe with
subjectivity Researchers have been attempting for some time to develop quantitative metrics to accurately judge the quality of our data The metrics they have developed range from the subjective: value-added and understandability, to those that are more
easily quantifiable: completeness and timeliness While some researchers have up to
sixteen different metrics [22], we will concentrate on a core set of four in our discussion
— accuracy/precision, completeness, consistency/believability, and timeliness distilled from [37]
Accuracy and precision are taken from their common scientific meanings
Accuracy represents how close a measurement, or data record, is to the real-world
situation that it measures Precision can refer to two similar ideas First, it can refer to the standard deviation or variation in a numerical data record with multiple readings As an example, if a weather sensor is calibrated before its use to have a variation of + 0.01°C,
we would use this calibrated range as the precision of the instrument The second use for precision is to quantify the degree to which a sensor, or other data input gives the same data result to a given real-world situation To explain the difference between accuracy and precision, we can think of a shooting range with two gunmen If one gunman shot three shots into the same hole in the target, he may be very precise, however inaccurate if that hole is not the bull’s-eye In contrast, the second gunman may shot the bull’s-eye 2
out of 3 times, but on the third shot he shots into the woods In this case the gunman is
accurate on two of the three shots, yet imprecise, as he cannot repeat his shot with
precision each time Of course, we would wish our data to be both accurate and precise
While accuracy is a metric that must be calculated over time and with much diligence to
Trang 37discover discrepancies between what is contained in the real world and what is
represented in the data record, precision can more easily be calculated from the data at hand
Completeness refers to the amount of missing data records This can be computed
as the total number of missing data, or the total number of rows containing missing data
Each can be useful in research This metric is easily calculated from the data at hand, and
requires no subjective insertions from the user or data manager Completeness has been
well researched and methods such as Expectation Maximization (EM) and Gibbs
Sampling handle this problem effectively (see Section 2.2.1) Large amounts of missing data can also point to problems in the data collection method and the data collection
tools
Timeliness of the data is also important in the context of our data Data that is
outdated is often useless in many fields of study, such as weather data records, or stock market data that must be used to make buying/selling decisions in seconds However, in other fields the timeliness of data is not as crucial If historical data is being used to track
consumer spending over the last decade, having data within minutes of the real-world
event is not as important This metric is therefore at once both subjective and objective If the data is time-stamped, it is not difficult to determine the time difference between the entry and the present time However, we must allow a Subject Matter Expert to then guide the program as to how current the data should be for that particular purpose
Lastly, data can be judged on its consistency or believability These metrics are
similar to precision, except will apply here to data from different data sets Where
precision applies to one particular data source — a particular sensor or manually inputted
Trang 38data source - consistency/believability will apply to multiple data sets reporting on the
same real-world situation If we have three sensors reporting weather information from
one location we can measure the consistency of the data for that region as the deviation between the data records from each sensor Similarly, if we have eyewitness accounts at
the scene of a crime, we can judge the consistency of the data set by the similarity of the
accounts This metric is computed separately from the accuracy and precision
calculations and gives no weight to one data source over the other If there are three data sources, we will calculate one consistency metric for that data record which takes into account differences from all the data sets, without declaring which data source is most accurate This allows us to account for differences in data without knowing the accuracy
of the data set This may seem faulty, and indeed if we have the resources to judge the accuracy of the data sets, then we can assign data quality based on those findings
However, consistency allows us a metric to determine that there are indeed differences in data sources, without tracing the data sets back to the sources and having to exhaustively determine which data set is most accurate
While all of the elements of data quality are important, the study of all elements and their combined effects would be time prohibitive Therefore, after studying the literature and conducting our own experiments (see Section 3.3) we have determined that we will limit our study to the element of data inaccuracy This element was chosen because it gives itself well to measurement and experimentation and has not yet been as thoroughly researched and reported in the literature
While the effects of inaccuracy on learning algorithms have not been well studied or
documented in the literature, complementary work done by Vomlel [35] in the area of
Trang 39evidential update with uncertain evidence, lends itself well to use in our research Vomlel
defines accuracy as
ip+in+ fp+ fn P(A=T)=
where T is a data source, A is the event 7 is reporting on, tp is the number of true positive
data points, tn is the number of true negative data points, fp is the number of false
positives, and fm the number of false negatives He further defines two criteria —
sensitivity and specificity - that are important for determining how data sources should
update the probability table of A Sensitivity is the test’s true positive rate, tp, and
specificity the true negative rate, tn Figure 3.1 gives a graphical example of how the two data sources report evidence for event A
Figure 3.1: Two Sources Report on Event A
Figure 3.2 shows the relationship between T1 and A when T1’s sensitivity is 80% and specificity is 95%
0 9
0 0
Figure 3.2: T1 Probability Table — 80% Sensitivity, 95% Specificity
28
Trang 40While Vomlel uses these metrics for evidential update, we are evaluating the BNs generated using data sources of low accuracy So while evidential update uses this idea
when updating a belief based on uncertain evidence:
P(A= yes|T = yes)=c*(P(T = yes | A = yes)* (P(A = yes),
we are exploring the probability of learning a BN, G with data of a certain accuracy, or
P(G|A=T)
This method can also be extended to our other data quality measures In order to clarify our definitions of completeness, consistency/believability, and timeliness we will
formalize them as Vomlel has done with accuracy
We shall define completeness, or percentage missing data, as
n PercentComplete = C% = —— ,
no+tn c m
where n, is the number of complete data points, and n,, the number of missing data
points Similarly, consistency/believability shall be the square root of the sample variance
of the data sources reporting:
variance = —J'(x, —x)ˆ,
Nia where N is the number of data sets reporting, x; is the data record from the i” data source reporting on the event, and X is the mean of the data points
Timeliness can also be represented more formally using this method Assuming a
time standard, f, where n,; <= ¢, and n, > ¢, timeliness can be represented as
PercentTimely =T% = mt ›
n+n t 7