Given a set of FDs obtained from data, an ordering of variables is obtained such that the Markov boundaries of some variables are determined.. Representing joint probability distribution
Trang 1shown that FD logically implies CI (Butz et al., 1999) We show how to combine the
ob-tained FDs with the chain rule of probability to construct a DAG of a CN Given a set of FDs obtained from data, an ordering of variables is obtained such that the Markov boundaries of some variables are determined Representing joint probability distribution of variables in the resulting ordering by the chain rule, the Markov boundaries of the other variables are deter-mined A DAG of a CN is constructed by designating each Markov boundary of variable as the parent set of the variable During this process, we take full advantage of known results in both CNs (Pearl, 1988) and relational databases (Maier, 1983) We demonstrate the effective-ness of our approach using fifteen real-world datasets The DAG constructed in our approach can also be used as an initial DAG for previous approaches The work here further illustrates
the intrinsic relationship between CNs and relational databases (Wong et al., 2000, Wong and
Butz, 2001)
The remainder of this chapter is organized as follows Background knowledge is given
in Section 49.2 In Section 49.3, the theoretical foundation of our approach is provided The algorithm to construct a CN is developed in Section 49.4 In Section 49.5, the experimental results are presented Conclusions are drawn in Section 49.6
49.2 Background Knowledge
Let U be a finite set of discrete variables, each with a finite domain Let V be the Carte-sian product of the variable domains A joint probability distribution (Pearl, 1988) p(U) is a function p on V such that 0 ≤ p(v) ≤ 1 for each configuration v ∈ V and ∑v ∈V p(v) = 1.0 The marginal distribution p (X) for X ⊆ U is defined as ∑U −X p(U) If p(X) > 0, then the conditional probability distribution p(Y|X) for X,Y ⊆ U is defined as p(XY)/p(X) In this chapter, we may write aifor the singleton set{ai}, and we use the terms attribute and variable
interchangeably Similarly, for the terms tuple and configuration
Definition 1 A causal network (CN) is a directed acyclic graph (DAG) D together with a conditional probability distribution (CPD) p(ai|P i) for each variable aiinD, where P idenotes
the parent set of aiinD.
The DAGD graphically encodes CIs regarding the variables in U.
Definition 2 (Wong et al., 2000) Let X, Y, and Z be three disjoint sets of variables X is said
to be conditionally independent of Y given Z, denoted I(X,Z,Y), if p(X|Y,Z) = p(X|Z).
As previously mentioned, the CIs encoded in the DAGD indicate that the product of the given CPDs is a joint probability distribution p(U).
Example 1 One CN on the set U = {a1,a2,a3,a4,a5,a6} is the DAG in Figure 49.1(i) together with the CPDs p(a1), p(a2|a1), p(a3|a1), p(a4|a2), p(a5|a3), and p(a6|a4,a5) This DAG encodes, in particular, I(a3,a1,a2), I(a4,a2,a1a3), I(a5,a3,a1a2a4) and I(a6,a4a5,a1a2a3).
By the chain rule, the joint probability distribution p(U) can be expressed as:
p(U) = p(a1)p(a2|a1)p(a3|a1,a2)p(a4|a1,a2,a3)p(a5|a1,a2,a3,a4)
p(a6|a1,a2,a3,a4,a5).
The above CIs can be used to rewrite p(U) as:
p(U) = p(a )p(a |a )p(a |a )p(a |a )p(a |a )p(a |a ,a ).
Trang 2(i) (ii)
a 1
a 3
a 6
a 5
a 4
a 2
a 1
a 3
a 6
a 5
a 4
a 2
Fig 49.1 Two causal networks
The term CN is somewhat misleading because it may be possible to reverse a directed edge without disturbing the encoded CI information For example, the two CNs in Figure 49.1 en-code the same independency information Thus, it is perhaps better to view a CN as encoding independency information rather than causal relationships
Using CI information encoded in the DAG, the Markov boundary of a variable can be defined
Definition 3 Let U = {a1, ,an}, and O be an ordering a1, ,an of variables of U Let
U i = {a1, ,ai −1 } be a subset of U with respect to the ordering O A Markov boundary of a variable ai over Ui, denoted Bi, is any subset X of Ui such that p(Ui) satisfies I(ai,X,Ui −X −
a i) and ai i) does not satisfy I(ai,X ,Ui − X − ai) for any X ⊂ X.
Example 2 Recall the DAG in Figure 49.1(i) Let the ordering O be a1,a2,a3,a4,a5,a6 Then U1= {}, U2= {a1}, U3= {a1,a2}, U4= {a1,a2,a3}, U5= {a1,a2,a3,a4}, and U6=
{a1,a2,a3,a4,a5} The Markov boundary of variable a3over U3is B3= {a1}, since p(U3)
satisfies I(a3, a1, a2), but does not satisfy I(a3, φ,a1a2).
The Markov boundary Bi of a variable ai over Ui encodes CI I(ai,Bi,Ui − Bi − ai) over p(Ui) Using the Markov boundary of each variable, a boundary DAG is defined as follows.
Trang 3Definition 4 Let p(U) be a joint probability distribution over U, O be an ordering a1, ,an
of variables of U, and Ui = {a1, ,ai −1 } be a subset of U with respect to the ordering O {B1, ,Bn} is an ordered set of subsets of U such that each Bi is a Markov boundary of ai over the Ui The DAG created by designating each Bi as a parent set of variable aiis called a
boundary DAG of p(U) relative to O.
The next theorem (Pearl, 1988) indicates that the boundary DAG of p(U) relative to an
orderingO is a DAG of CN of p(U).
Theorem 1 Let p(U) be a joint probability distribution and O be an ordering of variables of
U If D is a boundary DAG of p(U) relative to O, then D is a DAG of CN of p(U).
It is important to realize that we can construct a DAG of a CN after the Markov boundary of each variable has been obtained according to Definition 4 and Theorem 1
Example 3 Let U = {a1, ,a6} and O = a2,a1,a3,a4,a5,a6 be an ordering of the vari-ables of U With respect to the ordering O, U1= {a2}, U2= {}, U3= {a1,a2}, U4= {a1,a2,a3},
U5= {a1,a2,a3,a4}, and U6= {a1,a2,a3,a4,a5} Supposing we assign B1= {a2}, B2= {},
B3= {a1}, B4= {a2}, B5= {a3} and B6= {a4,a5}, then the DAG shown in Figure 49.1 (ii)
is the learned DAG of CN of p(U).
49.3 Theoretical Foundation
In this section, several theorems relevant to our approach are provided We define a relation
r(U) as a finite set of tuples over U We begin with functional dependency (Maier, 1983) Definition 5 Let r(U) be a relation over U and X,Y ⊆ U The functional dependency (FD)
X → Y is satisfied by r(U) if every two tuples t1and t2of r(U) that agree on X also agree on
Y
If a relation r(U) satisfies the FD X → Y, but not X → Y for every X ⊂ X, then X → Y
is called left-reduced (Maier, 1983).
The next theorem shows that FD logically implies CI
Theorem 2 (Butz et al., 1999) Let r(U) be a relation over U Let p(U) be a joint distribution over r(U) Let X,Y ⊆ U and Z = U − XY Having FD X → Y satisfied by r(U) is a sufficient condition for the CI I(Y,X,Z) to be satisfied by p(U).
By exploiting the implication relationship between functional dependency and conditional
independency, we can relate a left-reduced FD X → ai to the Markov boundary of variable ai Theorem 3 Let U = {a1, ,an}, Ui be a subset of U, and X ⊆ Ui If FD X → ai is
left-reduced, then X is the Markov boundary of variable ai over Ui.
Proof: Since X → ai is a FD and X ⊆Ui, according to Definition 5, X → ai holds over Ui ∪{ai} Since FD X → ai is left-reduced, by Theorem 2, p(Ui) satisfies I(ai,X,Ui − X − ai) but not I(ai,X ,Ui − X − ai) for any X ⊂ X.
Theorem 3 indicates that the Markov boundary of variable ai over Uican be learned from
a left-reduced FD X → ai, if X is a subset of Ui We define those variables that can be learned
from a set of left-reduced FDs as follows
Trang 4Definition 6 Let U = {a1, ,an}, X ⊂ U, ai /∈ X, and F be a set of left-reduced FDs over U.
If exists X → ai ∈ F, then ai is a decided variable Otherwise, ai is an undecided variable.
Considering a variable as decided indicates that its Markov boundary can be learned from FDs implied in data
49.4 Learning a DAG of CN by FDs
In this section, we use learned FDs to construct a CN We illustrate our algorithm using the heart disease dataset, which contains 13 attributes and 230 rows, from the UCI Machine Learn-ing Repository (Blake and Merz, 1998)
Example 4 The heart disease dataset has U = {a1, ,a13} Using FD Mine, the discovered set of left-reduced FDs is F = {a1a5→ a3, a1a5→ a6, a1a5→ a11, a1a5→ a13, a1a8→ a7,
a4a5a9→ a2, a1a5a10→ a4, a1a2a5→ a8, a1a5a10→ a9, a1a5a10→ a12}.
As indicated by Theorem 1, a DAG of a CN is constructed when the Markov boundary of each variable relative to an orderingO is obtained We obtain the Markov boundary of each
variable in two steps First, in Section 49.4.1, we show how to obtain an orderingO such that
the Markov boundary of each decided variable with respect toO can be obtained from the
given FDs Second, we determine the Markov boundary of each undecided variable by the chain rule in Section 49.4.2
49.4.1 Learning an Ordering of Variables from FDs
Given a set F of left-reduced FDs, the algorithm in Figure 49.4.1 will determine an ordering
O of the variables of U such that the Markov boundary of each decided variable with respect
toO can be obtained from F.
We use the FDs in Example 4 to demonstrate how Algorithm 1 works
Example 5 First, the FD a1a5→ a3 is selected in line 2 by Algorithm 1 In line 3, since
a3is not in Y for any Y → ai (ai 3) in F, variable a3is removed for U in line 4, thus
U = {a1,a2,a4, ,a13} We obtain O = a3 in line 5 In line 6, FD a1a5→ a3 is re-moved from F Because F is not empty, Algorithm 1 continues performing lines 2 − 8 At this time, the FD a1a5→ a6is selected in line 2 Variable a6is removed from U in line 4,
U = {a1,a2,a4,a5,a7, ,a13} We obtain O = a6,a3 in line 5 By recursively performing lines 2 to 8 until F is empty, we obtain an ordering O = a12,a9,a4,a2,a8,a7,a13,a11,a6,a3 and U = {a1,a5,a10} In line 9, we prepend U to the head of O Thus, for the variables in U,
O = a1,a5,a10,a12,a9,a4,a2,a8,a7,a13,a11,a6,a3
The next theorem guarantees that the Markov boundary of each decided variable is deter-mined byO as obtained by Algorithm 1.
Theorem 4 Let U = {a1, ,an}, O = a1, ,an be an ordering obtained by the Algorithm
1, and Ui = {a1, ,ai −1 } be a subset of U with respect to O Let Bibe the Markov boundary
of ai over Ui If ai is a decided variable, then Bi ⊆ Ui.
Proof: Since ai is a decided variable, then there exists a FD X → ai When Algorithm 1 per-forms lines 2-8, then ai is always deleted from U before any variable of X according to lines 3-5 of Algorithm 1 Thus, for any a j ∈ X, a j is always before aiinO, so we have X ⊆ Ui Since Bi = X, according to Theorem 3, then Bi ⊆ Ui.
Trang 5Algorithm 1.
Input: U = {a1, ,an}, and a set F of left-reduced FDs.
Output: an ordered listO of the variables of U.
Begin
1 O =
2 for each X → ai ∈ F
4 U = U − {ai}.
5 prepend aito the head ofO.
6 F = F − {Y → ai|Y → ai ∈ F}.
7 end if
8 end for
9 prepend U to the head of O.
10 return(O)
End
Fig 49.2 An algorithm to obtain an orderingO of the variables of U.
49.4.2 Learning the Markov Boundaries of Undecided Variables
Once an orderingO = a1, ,an of variables of U is obtained by Algorithm 1, the joint probability distribution p(U) can be expressed using the chain rule as follows.
p(U) = p(a1) p(a j|a1, ,aj−1 ) p(an|a1, ,an−1 ). (49.1)
If ai is an undecided variable, then there is no FD X → ai ∈ F According to Algorithm 1,
ai is not deleted from U and is prepended to the head of O at the line 9 of Algorithm 1 This
indicates that all undecided variables appear before all decided variables inO.
Suppose a1, ,aj are all the undecided variables, and B j+1, ,Bnare the Markov
bound-aries of all the decided variables By Definition 3, CI I(ai,Bi,Ui −Bi −ai) holds for each vari-able ai, j +1 ≤ i ≤ n Thus, each p(ai|a1, ,ai−1 ) = p(ai|Bi) Equation 63.1 can be rewritten
as:
p(U) = p(a1) p(a j|a1, ,a j −1 )p(aj+1|Bj+1) p(an|Bn). (49.2)
By assigning Bk = {a1, ,ak −1 } as the Markov boundary of each undecided variable
a k,1 ≤ k ≤ j, Equation 63.2 can be expressed as:
p(U) = p(a1|B1) p(a j|Bj) p(an|Bn) = ∏
a i ∈U p(ai|Bi). (49.3)
Equation 63.5 indicated that a joint probability distribution p(U) can be represented by the Markov boundary of all variables of U Thus, a boundary DAG relative to D can be
con-structed Based on the above analysis, we developed the algorithm shown in Figure 49.4.2, called FD2CN, which learns a DAGD of a CN from a dataset r(U).
Trang 6Algorithm 2 FD2CN
Input: A dataset r(U) over variable set U.
Output: A DAGDU,E of a CN learned from r(U).
Begin
1 F= FD Mine(r(U)) //return a set of left-reduced FDs
2 Obtaining an orderingO using Algorithm 1.
3 Ui = {}.
4 whileO is not empty
5 ai= pophead(O).
6 if there exists a FD X → ai ∈ F then
8 else
9 B i = Ui.
10 Ui = Ui ∪ {ai}.
11 end while
12 Constructing a DAGDU,E such that {(b,ai) ∈ E|b ∈ Bi, ai,b ∈ U}.
End
Fig 49.3 The algorithm, FD2CN, to learn a DAG of a CN from data
In line 6 of Algorithm FD2BN, we determined whether or not a variable is decided If so,
in the line 7, we obtain its Markov boundary If not, in line 9, we obtain its Markov boundary
In the line 12, a DAG of a CN is constructed by making each Bi as the parent set of ai of U In other words, if bi ∈ Bi, then we add a edge b → aiin DAGD According to Definition 4 and
Theorem 1, we know the constructed DAGD is a DAG of a CN.
Example 6 Applying Algorithm FD2CN on the heart disease dataset, F in line 1 is obtained
as in Example 4 O in line 2 is obtained as in Example 5 According to obtained O, we obtain B1= {}, B5= {a1}, B10 = {a1,a5}, B12= {a1,a5,a10}, B9= {a1,a5,a10}, B4=
{a1,a5,a10}, B2= {a4,a5,a9}, B8= {a1,a5,a2}, B7= {a1,a8}, B13= {a1,a5}, B11= {a1,a5},
B6= {a1,a5}, and B3= {a1,a5} By making each Bi as the parent set of ai of U, a D is con-structed For example, since B2= {a4,a5,a9}, then D have edges a4→ a2, a5→ a2, and
a9→ a2 the DAG of a CN learned from the heart disease dataset in Figure 49.5 is depicted
in Figure 49.4.
49.5 Experimental Results
Experiments were carried out on fifteen real-world datasets obtained from the UCI Machine Learning Repositories (Blake and Merz, 1998) The results are shown in Figure 49.5 The last column gives the elapsed time to construct a CN, measured on a 1GHz Pentium III PC with 256 MB RAM The results show that the processing time is mainly determined by the
Trang 7a2 a9 a6 a3
a8
a7
a10
a13
a11
Fig 49.4 The learned DAG of a CN from the heart disease dataset
number of attributes Since it also indicates that many FDs hold in some datasets, our proposed approach is a feasible way to learn a CN
Trang 8Dataset Name
# of attributes
# of rows
# of FDs
Time (seconds)
Fig 49.5 Experimental results using fifteen real-world datasets
49.6 Conclusion
In this chapter, we presented a novel method for learning a CN Although a CN encodes
probabilistic conditional independencies, our method is based on learning FDs (Yao et al., 2002) Since functional dependency logically implies conditional independency (Butz et al.,
1999), we described how to construct a CN from data dependencies We implemented our approach and encouraging experimental results have been obtained
Since functional dependency is a special case of conditional independency, we acknowl-edge that our approach may not utilize all the independency information encoded in the sample data However, previous methods also suffer from this disadvantage as learning all CIs from sample data is a NP-hard problem (Bouckaert, 1994)
References
Bouckaert, R (1994) Properties of learning algorithms for Bayesian belief networks In
Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence, 102–109.
Butz, C J., Wong, S K M., and Yao, Y Y (1999) On Data and Probabilistic Dependencies,
IEEE Canadian Conference on Electrical and Computer Engineering, 1692-1697 Maier, D (1983) The Theory of Relational Databases, Computer Science Press.
Neapolitan, R E (2003) Learning Bayesian Networks, Prentice Hall.
Pearl, J (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Infer-ence, Morgan Kaufmann Publishers.
Trang 9Blake, C.L and Merz, C.J (1998) UCI Repository of machine learning
databases Irvine, CA: University of California, Department of Information and Com-puter Science
Wong, S K M., Butz, C J., and Wu, D (2000) On the Implication Problem for Probabilistic
Conditional Independency, IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans, 30(6), 785-805.
Wong, S K M., and Butz, C J (2001), Constructing the Dependency Structure of a
Multi-Agent Probabilistic Network, IEEE Transactions on Knowledge and Data Engineering,
13(3), 395-415
Yao, H., Hamilton, H J., and Butz, C J (2002) FD Mine: Discovering Functional
Depen-dencies in a Database Using Equivalences, In Proceedings of the Second IEEE Interna-tional Conference on Data Mining, 729-732.
Trang 10Ensemble Methods in Supervised Learning
Lior Rokach
Department of Information Systems Engineering
Ben-Gurion University of the Negev
liorrk@bgu.ac.il
Summary The idea of ensemble methodology is to build a predictive model by integrating multiple models It is well-known that ensemble methods can be used for improving prediction performance In this chapter we provide an overview of ensemble methods in classification tasks We present all important types of ensemble methods including boosting and bagging Combining methods and modeling issues such as ensemble diversity and ensemble size are discussed
Key words: Ensemble, Boosting, AdaBoost, Windowing, Bagging, Grading, Arbiter Tree, Combiner Tree
50.1 Introduction
The main idea of ensemble methodology is to combine a set of models, each of which solves the same original task, in order to obtain a better composite global model, with more accurate and reliable estimates or decisions than can be obtained from using a single model The idea
of building a predictive model by integrating multiple models has been under investigation for
a long time B¨uhlmann and Yu (2003) pointed out that the history of ensemble methods starts
as early as 1977 with Tukeys Twicing, an ensemble of two linear regression models Ensemble methods can be also used for improving the quality and robustness of clustering algorithms
(Dimitriadou et al., 2003) Nevertheless, in this chapter we focus on classifier ensembles.
In the past few years, experimental studies conducted by the machine-learning commu-nity show that combining the outputs of multiple classifiers reduces the generalization er-ror (Domingos, 1996, Quinlan, 1996, Bauer and Kohavi, 1999, Opitz and Maclin, 1999) En-semble methods are very effective, mainly due to the phenomenon that various types of
classi-fiers have different “inductive biases” (Geman et al., 1995, Mitchell, 1997) Indeed, ensemble
methods can effectively make use of such diversity to reduce the variance-error (Tumer and Ghosh, 1999,Ali and Pazzani, 1996) without increasing the bias-error In certain situations, an ensemble can also reduce bias-error, as shown by the theory of large margin classifiers (Bartlett and Shawe-Taylor, 1998)
O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_50, © Springer Science+Business Media, LLC 2010