Data Mining and Knowledge Discovery Handbook, 2 Edition part 98 pot

Given a set of FDs obtained from data, an ordering of variables is obtained such that the Markov boundaries of some variables are determined.. Representing joint probability distribution

Trang 1

shown that FD logically implies CI (Butz et al., 1999) We show how to combine the

ob-tained FDs with the chain rule of probability to construct a DAG of a CN Given a set of FDs obtained from data, an ordering of variables is obtained such that the Markov boundaries of some variables are determined Representing joint probability distribution of variables in the resulting ordering by the chain rule, the Markov boundaries of the other variables are deter-mined A DAG of a CN is constructed by designating each Markov boundary of variable as the parent set of the variable During this process, we take full advantage of known results in both CNs (Pearl, 1988) and relational databases (Maier, 1983) We demonstrate the effective-ness of our approach using ﬁfteen real-world datasets The DAG constructed in our approach can also be used as an initial DAG for previous approaches The work here further illustrates

the intrinsic relationship between CNs and relational databases (Wong et al., 2000, Wong and

Butz, 2001)

The remainder of this chapter is organized as follows Background knowledge is given

in Section 49.2 In Section 49.3, the theoretical foundation of our approach is provided The algorithm to construct a CN is developed in Section 49.4 In Section 49.5, the experimental results are presented Conclusions are drawn in Section 49.6

49.2 Background Knowledge

Let U be a finite set of discrete variables, each with a finite domain Let V be the Carte-sian product of the variable domains A joint probability distribution (Pearl, 1988) p(U) is a function p on V such that 0 ≤ p(v) ≤ 1 for each configuration v ∈ V and ∑v ∈V p(v) = 1.0 The marginal distribution p (X) for X ⊆ U is defined as ∑U −X p(U) If p(X) > 0, then the conditional probability distribution p(Y|X) for X,Y ⊆ U is defined as p(XY)/p(X) In this chapter, we may write aifor the singleton set{ai}, and we use the terms attribute and variable

interchangeably Similarly, for the terms tuple and conﬁguration

Deﬁnition 1 A causal network (CN) is a directed acyclic graph (DAG) D together with a conditional probability distribution (CPD) p(ai|P i) for each variable aiinD, where P idenotes

the parent set of aiinD.

The DAGD graphically encodes CIs regarding the variables in U.

Deﬁnition 2 (Wong et al., 2000) Let X, Y, and Z be three disjoint sets of variables X is said

to be conditionally independent of Y given Z, denoted I(X,Z,Y), if p(X|Y,Z) = p(X|Z).

As previously mentioned, the CIs encoded in the DAGD indicate that the product of the given CPDs is a joint probability distribution p(U).

Example 1 One CN on the set U = {a1,a2,a3,a4,a5,a6} is the DAG in Figure 49.1(i) together with the CPDs p(a1), p(a2|a1), p(a3|a1), p(a4|a2), p(a5|a3), and p(a6|a4,a5) This DAG encodes, in particular, I(a3,a1,a2), I(a4,a2,a1a3), I(a5,a3,a1a2a4) and I(a6,a4a5,a1a2a3).

By the chain rule, the joint probability distribution p(U) can be expressed as:

p(U) = p(a1)p(a2|a1)p(a3|a1,a2)p(a4|a1,a2,a3)p(a5|a1,a2,a3,a4)

p(a6|a1,a2,a3,a4,a5).

The above CIs can be used to rewrite p(U) as:

p(U) = p(a )p(a |a )p(a |a )p(a |a )p(a |a )p(a |a ,a ).

Trang 2

(i) (ii)

a 1

a 3

a 6

a 5

a 4

a 2

a 1

a 3

a 6

a 5

a 4

a 2

Fig 49.1 Two causal networks

The term CN is somewhat misleading because it may be possible to reverse a directed edge without disturbing the encoded CI information For example, the two CNs in Figure 49.1 en-code the same independency information Thus, it is perhaps better to view a CN as encoding independency information rather than causal relationships

Using CI information encoded in the DAG, the Markov boundary of a variable can be deﬁned

Deﬁnition 3 Let U = {a1, ,an}, and O be an ordering a1, ,an of variables of U Let

U i = {a1, ,ai −1 } be a subset of U with respect to the ordering O A Markov boundary of a variable ai over Ui, denoted Bi, is any subset X of Ui such that p(Ui) satisﬁes I(ai,X,Ui −X −

a i) and ai i) does not satisfy I(ai,X ,Ui − X − ai) for any X ⊂ X.

Example 2 Recall the DAG in Figure 49.1(i) Let the ordering O be a1,a2,a3,a4,a5,a6 Then U1= {}, U2= {a1}, U3= {a1,a2}, U4= {a1,a2,a3}, U5= {a1,a2,a3,a4}, and U6=

{a1,a2,a3,a4,a5} The Markov boundary of variable a3over U3is B3= {a1}, since p(U3)

satisﬁes I(a3, a1, a2), but does not satisfy I(a3, φ,a1a2).

The Markov boundary Bi of a variable ai over Ui encodes CI I(ai,Bi,Ui − Bi − ai) over p(Ui) Using the Markov boundary of each variable, a boundary DAG is deﬁned as follows.

Trang 3

Deﬁnition 4 Let p(U) be a joint probability distribution over U, O be an ordering a1, ,an

of variables of U, and Ui = {a1, ,ai −1 } be a subset of U with respect to the ordering O {B1, ,Bn} is an ordered set of subsets of U such that each Bi is a Markov boundary of ai over the Ui The DAG created by designating each Bi as a parent set of variable aiis called a

boundary DAG of p(U) relative to O.

The next theorem (Pearl, 1988) indicates that the boundary DAG of p(U) relative to an

orderingO is a DAG of CN of p(U).

Theorem 1 Let p(U) be a joint probability distribution and O be an ordering of variables of

U If D is a boundary DAG of p(U) relative to O, then D is a DAG of CN of p(U).

It is important to realize that we can construct a DAG of a CN after the Markov boundary of each variable has been obtained according to Deﬁnition 4 and Theorem 1

Example 3 Let U = {a1, ,a6} and O = a2,a1,a3,a4,a5,a6 be an ordering of the vari-ables of U With respect to the ordering O, U1= {a2}, U2= {}, U3= {a1,a2}, U4= {a1,a2,a3},

U5= {a1,a2,a3,a4}, and U6= {a1,a2,a3,a4,a5} Supposing we assign B1= {a2}, B2= {},

B3= {a1}, B4= {a2}, B5= {a3} and B6= {a4,a5}, then the DAG shown in Figure 49.1 (ii)

is the learned DAG of CN of p(U).

49.3 Theoretical Foundation

In this section, several theorems relevant to our approach are provided We deﬁne a relation

r(U) as a ﬁnite set of tuples over U We begin with functional dependency (Maier, 1983) Deﬁnition 5 Let r(U) be a relation over U and X,Y ⊆ U The functional dependency (FD)

X → Y is satisﬁed by r(U) if every two tuples t1and t2of r(U) that agree on X also agree on

Y

If a relation r(U) satisﬁes the FD X → Y, but not X → Y for every X ⊂ X, then X → Y

is called left-reduced (Maier, 1983).

The next theorem shows that FD logically implies CI

Theorem 2 (Butz et al., 1999) Let r(U) be a relation over U Let p(U) be a joint distribution over r(U) Let X,Y ⊆ U and Z = U − XY Having FD X → Y satisfied by r(U) is a sufficient condition for the CI I(Y,X,Z) to be satisfied by p(U).

By exploiting the implication relationship between functional dependency and conditional

independency, we can relate a left-reduced FD X → ai to the Markov boundary of variable ai Theorem 3 Let U = {a1, ,an}, Ui be a subset of U, and X ⊆ Ui If FD X → ai is

left-reduced, then X is the Markov boundary of variable ai over Ui.

Proof: Since X → ai is a FD and X ⊆Ui, according to Deﬁnition 5, X → ai holds over Ui ∪{ai} Since FD X → ai is left-reduced, by Theorem 2, p(Ui) satisﬁes I(ai,X,Ui − X − ai) but not I(ai,X ,Ui − X − ai) for any X ⊂ X.

Theorem 3 indicates that the Markov boundary of variable ai over Uican be learned from

a left-reduced FD X → ai, if X is a subset of Ui We deﬁne those variables that can be learned

from a set of left-reduced FDs as follows

Trang 4

Deﬁnition 6 Let U = {a1, ,an}, X ⊂ U, ai /∈ X, and F be a set of left-reduced FDs over U.

If exists X → ai ∈ F, then ai is a decided variable Otherwise, ai is an undecided variable.

Considering a variable as decided indicates that its Markov boundary can be learned from FDs implied in data

49.4 Learning a DAG of CN by FDs

In this section, we use learned FDs to construct a CN We illustrate our algorithm using the heart disease dataset, which contains 13 attributes and 230 rows, from the UCI Machine Learn-ing Repository (Blake and Merz, 1998)

Example 4 The heart disease dataset has U = {a1, ,a13} Using FD Mine, the discovered set of left-reduced FDs is F = {a1a5→ a3, a1a5→ a6, a1a5→ a11, a1a5→ a13, a1a8→ a7,

a4a5a9→ a2, a1a5a10→ a4, a1a2a5→ a8, a1a5a10→ a9, a1a5a10→ a12}.

As indicated by Theorem 1, a DAG of a CN is constructed when the Markov boundary of each variable relative to an orderingO is obtained We obtain the Markov boundary of each

variable in two steps First, in Section 49.4.1, we show how to obtain an orderingO such that

the Markov boundary of each decided variable with respect toO can be obtained from the

given FDs Second, we determine the Markov boundary of each undecided variable by the chain rule in Section 49.4.2

49.4.1 Learning an Ordering of Variables from FDs

Given a set F of left-reduced FDs, the algorithm in Figure 49.4.1 will determine an ordering

O of the variables of U such that the Markov boundary of each decided variable with respect

toO can be obtained from F.

We use the FDs in Example 4 to demonstrate how Algorithm 1 works

Example 5 First, the FD a1a5→ a3 is selected in line 2 by Algorithm 1 In line 3, since

a3is not in Y for any Y → ai (ai 3) in F, variable a3is removed for U in line 4, thus

U = {a1,a2,a4, ,a13} We obtain O = a3 in line 5 In line 6, FD a1a5→ a3 is re-moved from F Because F is not empty, Algorithm 1 continues performing lines 2 − 8 At this time, the FD a1a5→ a6is selected in line 2 Variable a6is removed from U in line 4,

U = {a1,a2,a4,a5,a7, ,a13} We obtain O = a6,a3 in line 5 By recursively performing lines 2 to 8 until F is empty, we obtain an ordering O = a12,a9,a4,a2,a8,a7,a13,a11,a6,a3 and U = {a1,a5,a10} In line 9, we prepend U to the head of O Thus, for the variables in U,

O = a1,a5,a10,a12,a9,a4,a2,a8,a7,a13,a11,a6,a3

The next theorem guarantees that the Markov boundary of each decided variable is deter-mined byO as obtained by Algorithm 1.

Theorem 4 Let U = {a1, ,an}, O = a1, ,an be an ordering obtained by the Algorithm

1, and Ui = {a1, ,ai −1 } be a subset of U with respect to O Let Bibe the Markov boundary

of ai over Ui If ai is a decided variable, then Bi ⊆ Ui.

Proof: Since ai is a decided variable, then there exists a FD X → ai When Algorithm 1 per-forms lines 2-8, then ai is always deleted from U before any variable of X according to lines 3-5 of Algorithm 1 Thus, for any a j ∈ X, a j is always before aiinO, so we have X ⊆ Ui Since Bi = X, according to Theorem 3, then Bi ⊆ Ui.

Trang 5

Algorithm 1.

Input: U = {a1, ,an}, and a set F of left-reduced FDs.

Output: an ordered listO of the variables of U.

Begin

1 O = 

2 for each X → ai ∈ F

4 U = U − {ai}.

5 prepend aito the head ofO.

6 F = F − {Y → ai|Y → ai ∈ F}.

7 end if

8 end for

9 prepend U to the head of O.

10 return(O)

End

Fig 49.2 An algorithm to obtain an orderingO of the variables of U.

49.4.2 Learning the Markov Boundaries of Undecided Variables

Once an orderingO = a1, ,an of variables of U is obtained by Algorithm 1, the joint probability distribution p(U) can be expressed using the chain rule as follows.

p(U) = p(a1) p(a j|a1, ,aj−1 ) p(an|a1, ,an−1 ). (49.1)

If ai is an undecided variable, then there is no FD X → ai ∈ F According to Algorithm 1,

ai is not deleted from U and is prepended to the head of O at the line 9 of Algorithm 1 This

indicates that all undecided variables appear before all decided variables inO.

Suppose a1, ,aj are all the undecided variables, and B j+1, ,Bnare the Markov

bound-aries of all the decided variables By Deﬁnition 3, CI I(ai,Bi,Ui −Bi −ai) holds for each vari-able ai, j +1 ≤ i ≤ n Thus, each p(ai|a1, ,ai−1 ) = p(ai|Bi) Equation 63.1 can be rewritten

as:

p(U) = p(a1) p(a j|a1, ,a j −1 )p(aj+1|Bj+1) p(an|Bn). (49.2)

By assigning Bk = {a1, ,ak −1 } as the Markov boundary of each undecided variable

a k,1 ≤ k ≤ j, Equation 63.2 can be expressed as:

p(U) = p(a1|B1) p(a j|Bj) p(an|Bn) = ∏

a i ∈U p(ai|Bi). (49.3)

Equation 63.5 indicated that a joint probability distribution p(U) can be represented by the Markov boundary of all variables of U Thus, a boundary DAG relative to D can be

con-structed Based on the above analysis, we developed the algorithm shown in Figure 49.4.2, called FD2CN, which learns a DAGD of a CN from a dataset r(U).

Trang 6

Algorithm 2 FD2CN

Input: A dataset r(U) over variable set U.

Output: A DAGDU,E of a CN learned from r(U).

Begin

1 F= FD Mine(r(U)) //return a set of left-reduced FDs

2 Obtaining an orderingO using Algorithm 1.

3 Ui = {}.

4 whileO is not empty

5 ai= pophead(O).

6 if there exists a FD X → ai ∈ F then

8 else

9 B i = Ui.

10 Ui = Ui ∪ {ai}.

11 end while

12 Constructing a DAGDU,E such that {(b,ai) ∈ E|b ∈ Bi, ai,b ∈ U}.

End

Fig 49.3 The algorithm, FD2CN, to learn a DAG of a CN from data

In line 6 of Algorithm FD2BN, we determined whether or not a variable is decided If so,

in the line 7, we obtain its Markov boundary If not, in line 9, we obtain its Markov boundary

In the line 12, a DAG of a CN is constructed by making each Bi as the parent set of ai of U In other words, if bi ∈ Bi, then we add a edge b → aiin DAGD According to Deﬁnition 4 and

Theorem 1, we know the constructed DAGD is a DAG of a CN.

Example 6 Applying Algorithm FD2CN on the heart disease dataset, F in line 1 is obtained

as in Example 4 O in line 2 is obtained as in Example 5 According to obtained O, we obtain B1= {}, B5= {a1}, B10 = {a1,a5}, B12= {a1,a5,a10}, B9= {a1,a5,a10}, B4=

{a1,a5,a10}, B2= {a4,a5,a9}, B8= {a1,a5,a2}, B7= {a1,a8}, B13= {a1,a5}, B11= {a1,a5},

B6= {a1,a5}, and B3= {a1,a5} By making each Bi as the parent set of ai of U, a D is con-structed For example, since B2= {a4,a5,a9}, then D have edges a4→ a2, a5→ a2, and

a9→ a2 the DAG of a CN learned from the heart disease dataset in Figure 49.5 is depicted

in Figure 49.4.

49.5 Experimental Results

Experiments were carried out on ﬁfteen real-world datasets obtained from the UCI Machine Learning Repositories (Blake and Merz, 1998) The results are shown in Figure 49.5 The last column gives the elapsed time to construct a CN, measured on a 1GHz Pentium III PC with 256 MB RAM The results show that the processing time is mainly determined by the

Trang 7

a2 a9 a6 a3

a8

a7

a10

a13

a11

Fig 49.4 The learned DAG of a CN from the heart disease dataset

number of attributes Since it also indicates that many FDs hold in some datasets, our proposed approach is a feasible way to learn a CN

Trang 8

Dataset Name

# of attributes

# of rows

# of FDs

Time (seconds)

Fig 49.5 Experimental results using ﬁfteen real-world datasets

49.6 Conclusion

In this chapter, we presented a novel method for learning a CN Although a CN encodes

probabilistic conditional independencies, our method is based on learning FDs (Yao et al., 2002) Since functional dependency logically implies conditional independency (Butz et al.,

1999), we described how to construct a CN from data dependencies We implemented our approach and encouraging experimental results have been obtained

Since functional dependency is a special case of conditional independency, we acknowl-edge that our approach may not utilize all the independency information encoded in the sample data However, previous methods also suffer from this disadvantage as learning all CIs from sample data is a NP-hard problem (Bouckaert, 1994)

References

Bouckaert, R (1994) Properties of learning algorithms for Bayesian belief networks In

Proceedings of the 10th Conference on Uncertainty in Artiﬁcial Intelligence, 102–109.

Butz, C J., Wong, S K M., and Yao, Y Y (1999) On Data and Probabilistic Dependencies,

IEEE Canadian Conference on Electrical and Computer Engineering, 1692-1697 Maier, D (1983) The Theory of Relational Databases, Computer Science Press.

Neapolitan, R E (2003) Learning Bayesian Networks, Prentice Hall.

Pearl, J (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Infer-ence, Morgan Kaufmann Publishers.

Trang 9

Blake, C.L and Merz, C.J (1998) UCI Repository of machine learning

databases Irvine, CA: University of California, Department of Information and Com-puter Science

Wong, S K M., Butz, C J., and Wu, D (2000) On the Implication Problem for Probabilistic

Conditional Independency, IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans, 30(6), 785-805.

Wong, S K M., and Butz, C J (2001), Constructing the Dependency Structure of a

Multi-Agent Probabilistic Network, IEEE Transactions on Knowledge and Data Engineering,

13(3), 395-415

Yao, H., Hamilton, H J., and Butz, C J (2002) FD Mine: Discovering Functional

Depen-dencies in a Database Using Equivalences, In Proceedings of the Second IEEE Interna-tional Conference on Data Mining, 729-732.

Trang 10

Ensemble Methods in Supervised Learning

Lior Rokach

Department of Information Systems Engineering

Ben-Gurion University of the Negev

liorrk@bgu.ac.il

Summary The idea of ensemble methodology is to build a predictive model by integrating multiple models It is well-known that ensemble methods can be used for improving prediction performance In this chapter we provide an overview of ensemble methods in classiﬁcation tasks We present all important types of ensemble methods including boosting and bagging Combining methods and modeling issues such as ensemble diversity and ensemble size are discussed

Key words: Ensemble, Boosting, AdaBoost, Windowing, Bagging, Grading, Arbiter Tree, Combiner Tree

50.1 Introduction

The main idea of ensemble methodology is to combine a set of models, each of which solves the same original task, in order to obtain a better composite global model, with more accurate and reliable estimates or decisions than can be obtained from using a single model The idea

of building a predictive model by integrating multiple models has been under investigation for

a long time B¨uhlmann and Yu (2003) pointed out that the history of ensemble methods starts

as early as 1977 with Tukeys Twicing, an ensemble of two linear regression models Ensemble methods can be also used for improving the quality and robustness of clustering algorithms

(Dimitriadou et al., 2003) Nevertheless, in this chapter we focus on classiﬁer ensembles.

In the past few years, experimental studies conducted by the machine-learning commu-nity show that combining the outputs of multiple classiﬁers reduces the generalization er-ror (Domingos, 1996, Quinlan, 1996, Bauer and Kohavi, 1999, Opitz and Maclin, 1999) En-semble methods are very effective, mainly due to the phenomenon that various types of

classi-ﬁers have different “inductive biases” (Geman et al., 1995, Mitchell, 1997) Indeed, ensemble

methods can effectively make use of such diversity to reduce the variance-error (Tumer and Ghosh, 1999,Ali and Pazzani, 1996) without increasing the bias-error In certain situations, an ensemble can also reduce bias-error, as shown by the theory of large margin classiﬁers (Bartlett and Shawe-Taylor, 1998)

O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,

Định dạng
Số trang	10
Dung lượng	404,31 KB