1. Trang chủ
  2. » Công Nghệ Thông Tin

Data Mining and Knowledge Discovery Handbook, 2 Edition part 57 potx

10 213 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 10
Dung lượng 394,99 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Our main message is that in Data Mining we have to deal with generalized quantifiers of some particular kind and that their logical properties are very important.. Formulas built from ato

Trang 1

540 Yoav Benjamini and Moshe Leshno

Storey J.D., Taylor J.E and Siegmund D., (2004) Strong control, conservative point esti-mation, and simultaneous conservative consistency of false discovery rates: A unified

approach Journal of the Royal Statistical Society Series B, 66:187–205.

Therneau T.M and Grambsch P.M., (2000) Modeling Survival Data, Extending the Cox Model Springer.

Tibshirani R and Knight K., (1999) The covariance inflation criterion for adaptive model

selection Journal of the Royal Statistical Society Series B, 61:Part 3 529–546.

Zembowicz R and Zytkov J.M., (1996) From contingency tables to various froms of knowl-edge in databases In U.M Fayyad, R Uthurusamy, G Piatetsky-Shapiro and P Smyth

(editors) Advances in Knowledge Discovery and Data Mining (pp 329-349) MIT Press.

Zytkov J.M and Zembowicz R., (1997) Contingency tables as the foundation for concepts,

concept hierarchies and rules: The 49er system approach Fundamenta Informaticae,

30:383–399

Trang 2

Logics for Data Mining

Petr H´ajek

Institute of Computer Science

Academy of Sciences of the Czech Republic

182 07 Prague, Czech Republic

hajek@cs.cas.cz

Summary Systems of formal (symbolic) logic suitable for Data Mining are presented, main stress being put to various kinds of generalized quantifiers

Key words: logic, Data Mining, generalized quantifiers, GUHA method

Introduction

Data Mining, as presently understood, is a broad term, including search for “association rules”, classification, regression, clustering and similar Here we shall restrict ourselves to search for “rules” in a rather general sense, namely general dependencies valid in given data and expressed by formulas of a formal logical language The present theoretical approach is the result of a long development of the GUHA method of automated generation of hypotheses (General Unary Hypotheses Automaton, see a paragraph in Section 26.2) but is believed to

be fully relevant for contemporary mining of association rules and its possible generalization

See (Agrawal et al., 1996, Hoppner, 2005, Adamo, 2001) for association rules.

Data are assumed to have the form of one or more tables, matrices or relations A rectan-gular matrix may be understood as giving data on objects (corresponding to rows of the matrix) and their attributes (columns) Or rows may correspond to objects from one set, columns to

objects of the same or a different set and the whole matrix is understood as one binary

at-tribute In the former case we have variables x,y for objects and a unary predicate for each column (P i for i-th column, say); P i (x) denotes the value of P i for the object x In the latter

case we have variables for objects from the first set(x, say), other variables for objects from

the other set(y, say) and one binary predicate P; then P(x,y) denotes the value of the attribute

for the pair(x,y) of objects.

For example, rows correspond to patients, the first column corresponds to having feaver

(yes – value 1, no – value 0) Then patient Nov´ak satisfies P1(x) if he has feaver Secondly,

the matrix describes the relation “being a married couple” and we take MC for the predicate

Then the couple (Nov´ak, Nov´akov´a) satisfies P(x,y) if they are a married couple, thus the

corresponding field in the matrix has value 1

These were examples of Boolean (0-1-valued) data; more gener-ally, they may take values from some set of values (reals, colours, )

O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,

DOI 10.1007/978-0-387-09823-4_26, © Springer Science+Business Media, LLC 2010

Trang 3

542 Petr H´ajek

It should be clear what is the value of P(x) for an object in the former case and value

of P(x,y) for a pair of objects in the latter.

Logic enables us to construct composed formulas from atomic formulas as above using some connectives (as conjunction, disjunction, implication, negation in the Boolean logic,

no-tation:∧,∨,→,¬) and quantifiers (universal ∀ and existential ∃ in classical Boolean logic) Our main message is that in Data Mining we have to deal with generalized quantifiers of

some particular kind and that their logical properties are very important For simplicity we re-strict ourselves to two-valued (0-1-valued) data Our approach generalizes easily to categorical (finitely valued) data when we work with atomic formulas of the form(X)P(x) (or (X)P(x,y) etc.) where X is a subset of the domain of values of the attribute P and an object o satisfies (X)P(x) iff the P-value of o is in X, similarly for (X)P(x,y) (For example let P denote age in

years 0, ,100, let X be the set of numbers 0 ≤ n ≤ 30.)

The reader having some knowledge of classical propositional and predicate calculus will have no problems with the mentioned notions; the reader having difficulties is recommended

to have a look to a textbook of mathematical logic, e g (Ebbinghaus et al., 1984).

26.1 Generalized quantifiers

We shall present the notion of a generalized quantifier and supply several examples (In the next section we shall study various classes of quantifiers.) For simplicity we shall work with data having the form of a rectangular boolean matrix, rows corresponding to object and

columns to (yes-no) attributes Predicates P1, ,P nare names of the attributes; we have an

object variable x, and P1(x), ,P n (x) are atomic formulas For each formula ϕ, we have its negation ¬ϕ; an object satisfies ¬ϕ if it does not satisfy ϕ For each pair ϕ,ψ of formulas

we have their conjunction ϕ ∧ψ and disjunction ϕ ∨ψ An object satisfies ϕ ∧ψ if it satisfies

bothϕ and ψ; it satisfies at least one of ϕ,ψ Similarly for conjunction/disjunction of three,

four formulas An object satisfies implicationϕ → ψ if it satisfies ψ or does not satisfy

ϕ Formulas built from atomic formulas using the connectives ¬,∧∨,→ are open: each open

formulaϕ defines an attribute: each object of our data either satisfies ϕ or does not satisfy

it This is uniquely determined by the data Thusϕ defines two numbers: r – the number of

objects satisfyingϕ and s – the number of objects satisfying ¬ϕ The pair (r,s) may be called the two-fold table of ϕ (given by the data); r +s = m is the number of objects in the data (rows

of the matrix)

A (one-dimensional) quantifier Q applied to an open formulaϕ describes the behaviour

of the attribute defined byϕ in the data as a whole, i.e gives a global characterization of the

attribute (in the data) The reader surely knows the classical quantifiers ∀ (universal) and ∃

(existential) The formula(∀x)ϕ is true in the data if each object satisfies ϕ (thus s = 0); the

formula(∃x)ϕ is true in the data of at least one object satisfies ϕ (thus r ≥ 1) You see that

truth of such a quantified formula does not depend on any ordering of the rows of the data matrix, just it is given by the two-fold table ofϕ.

A (one-dimensional) quantifier Q is determined by its truth function Tr Q assigning to each two-fold table(r,s) either 1 (true) or 0 (false) For each open formula ϕ, the closed formula (Qx)ϕ is true in the data iff the two-fold table (r,s) of ϕ (given by the data) satisfies

Tr Q (r,s) = 1 Clearly, Tr ∀ (r,s) = 1 iff s = 0 and Tr ∃ (r,s) = 1 iff r > 0.

Some other examples:

Majority: TrMaj(r,s) = 1 iff r > s; (Maj x)ϕ says that the majority of objects satisfy ϕ Many: Let 0 < p < 1; TrMany (r,s) = 1 iff r/(r + s) ≥ p; (Many x)ϕ says that the relative

Trang 4

frequence of objects satisfyingϕ is at least p.

At least: Tr ∃≥n (r,s) = 1 iff r ≥ n (at least n objects satisfy ϕ).

Odd: TrOdd(r,s) = 1 iff r is an odd number.

The reader may produce many more examples We shall be more general:

A two-dimensional quantifier Q, applied to a pair ϕ,ψ of open formulas describes the

behaviour of the pair of attributes defined byϕ and ψ in the data as a whole, thus gives a global characterization of the mutual relation ofϕ,ψ (in the data) The closed formula given

by Q,ϕ,ψ is written as (Qx)(ϕ,ψ) Its truth/falsity in the data is determined by the four-fold table (a,b,c,d) where a,b,c,d denotes the number of objects in the data satisfying ϕ ∧ ψ,

ϕ ∧ ¬ψ, ¬ϕ ∧ ψ, ¬ϕ ∧ ¬ψ respectively This is often displayed as

ψ ¬ψ

ϕ a b r

¬ϕ c d s

where r = a + b (number of objects satisfying ϕ), s = c + d, k = a + c, l = b + d (marginal sums), m = a+b+c+d = r+s = k+l Thus the truth function of a two-dimensional quantifier

Q assigns to each four-fold table (a,b,c,d) the value Tr Q (a,b,c,d) ∈ {0,1}.

All (two-dimensional)ϕ⇒ (x) ψ says “all ϕ’s are ψ’s)”; Tr ⇒ (a,b,c,d) = 1 iff b = 0 This

is definable by one-dimensional∀ and the connective →, namelyϕ(x) ⇒ψ says the same as

(∀x)(ϕ → ψ).

Many: for 0 < p ≤ 1,ϕ(x) ⇒ p ψ says “p-many ϕ’s are ψ’s”, i.e the relative frequence of

objects satisfyingψ among those satisfying ϕ is ≥ p, thus Tr ⇒ p (a,b,c,d) = 1 iff a/(a+b) ≥

p Caution: this is not the same as (Many p x)(ϕ → ψ) : for example if (a,b,c,d) is (2,2,5,5) then a/(a + b) = 1/2 but the number of objects satisfying ϕ → ψ is a + c + d = 12, thus (a + c + d)/m = 12/14 = 6/7 Thus if p = 0.8 thenϕ(x) ⇒ ψ is false but (∀x)(ϕ → ψ) is

true Butϕ(x) ⇒ψ can be also written as (Manyp x)ϕ/ψ, understood as saying that the formula

(Manyp x)ϕ is true in the subtable consisting of rows satisfying ψ.

p-equivalence:ϕ⇔ (x) ψ (ϕ is p-equivalent to ψ) is true in the data if both ϕ (x) ⇒ pψ and

¬ϕ (x) ⇒ p ¬ψ is true, thus a/(a + b) ≥ p and d/(c + d) ≥ p.

Foundedness: Let t be a natural number (Fdd x)(ϕ,ψ) says that at least t objects satisfy

ϕ ∧ ψ, i.e a ≥ t Similarly:

Support: Let 0 < σ < 1 (Suppσx)(ϕ,ψ) is Manyσ(ϕ ∧ ψ), thus says that the relative frequence ofϕ ∧ ψ in the data is at least σ.

Founded implication: (FIMPLp ,t x)(ϕ,ψ) (or just ϕ ⇒ p ,t ψ) is (Manyp x)(ϕ,ψ) and Fddt s)(ϕ ∧ ψ), hence Tr ⇒ p ,t (a,b,c,d) = 1 iff

a/(a + b) ≥ p and a ≥ t.

Agrawal: ϕ ⇒ Agr

p ,σ ψ is Manyp(ϕ,ψ) and Suppσ(ϕ,ψ), hence

Tr ⇒ Agr

p ,σ(a,b,c,d) = 1 iff a/(a + b) ≥ p and a/(a,b,c,d) ≥ σ.

Clearly the last two quantifiers differ only very little:ϕ ⇒ ∗

p ,t ψ is equivalent to ϕ ⇒ Agr

p,σ forσ = t/(a+b+c+d) Now ⇒ Agris the quantifier of the association rules of Agrawal; it is little known and has to be stressed that the “almost the same” quantifier of founded implication

Trang 5

544 Petr H´ajek

was used in GUHA to generate “association rules” in the presently common sense as soon as

in mid-sixties of the past century (H´ajek et al., 1966).

The reader may play by defining more and more two-dimensional quantifiers; clearly not all of them are relevant for Data Mining We close this section by two important remarks

Closed formulas Each formula (of the present formalism, with unary predicates and just one object variable) beginning with a quantifier is closed, i.e does not refer to any particular

object but expresses some global pattern found in the data Further closed formulas result from those beginning by a quantifier using connectives (e.g we have seen thatϕ(x) ⇔ pψ is equivalent to(ϕ⇒ (x) p ψ) ∧ (¬ϕ ⇒ (x) p ¬ψ), etc.) A closed formula is a tautology (logical truth)

if it is true in each data To give a trivial example, observe that if p1≤ p2then the formula (ϕ⇒ (x) p2ψ) → (ϕ ⇒ (x) p1ψ) is a tautology

Predicates of higher arity If our data contain information on relations of higher arity

(bi-nary, ter(bi-nary, ) we have to use predicates of higher arity and more than one object variable

A quantifier always binds (quantifies) a variable This leads to the logical notion of free and

bound variables of a formula, free variables varying over arbitrary objects of the data For

ex-ample, take a binary predicate P; P(x,y) is a formula in which x,y are free, (Many p y)P(x,y) is

a formula in which only x is free An object o satisfies the last formula iff it is P-related with p-many objects (Let P(x,y) say “x knows y”, let p be 0.8 An object o satisfies (Many0.8 y)P(x,y)

if he knows at least 80% objects (from the data) We may form composed formulas using sev-eral quantifiers binding different variables, e.g.(∀x)((Many0.8 y)P(x,y) → R(x)) (saying “each object knowing at least 80% objects has the property R”) etc This sort of formulas is used in relational Data Mining (Dˇzeroski and Lavraˇc, 2001) We shall not go into details.

26.2 Some important classes of quantifiers

26.2.1 One-dimensional

Let us call a one-dimensional quantifier Q multitudinal if its truth function Tr Qis not decreas-ing in its first argument and non-increasdecreas-ing in the second, i.e for any two-fold tables(r1,s1), (r2,s2) whenever

r1≤ r2, s1≥ s2and Tr Q (r,s1) = 1 then Tr Q (r1,s2) = 1 This means that the formula (Qx)ϕ says, in some sense given by Tr Q , that sufficiently many objects satisfy ϕ “Sufficiently many”

may mean “all”(∀), at least one (∃), at least 7 (∃7), at least 100p% (Many p) etc

Very important: The quantifier may correspond to a statistical test of high

probabil-ity Telegraphically: Our hypothesis is that the probability of the attributeϕ is bigger than

p (under frame assumptions saying that all objects have the same probability of havingϕ and are mutually independent) Take a smallα (e.g 0.05 – significance level) The number

r +s

i =r(r +s

i )p i (1 − p) r +s−i is the probability that at least r objects (form our r + s objects) will

haveϕ, assuming that the probability of ϕ is p If this sum is ≤ α then we can reject the (null)

hypothesis saying that the probability ofϕ is p or less (since if it were then what we have

observed would be improbable) This is the (simplified) idea of statistical hypothesis testing

We get a quantifier of testing high probability, HProb p,α.

Tr HProb p,α(r,s) = 1 iff r+s

i =r

%

r + s i

&

p i (1 − p) r +s−i ≤ α.

Trang 6

This is an example of a statistically motivated one-dimensional quantifier; it can be proved

to be multitudinal) See Chapter 31.4.6 for statistical hypothesis testing and (H´ajek and Havr´anek, 1978) for its logical foundations

26.2.2 Two-dimensional

Recall that a two-dimensional quantifier is given by its truth function assigning to each four-fold table(a,b,c,d) a truth value (1 or 0) In Data Mining we are especially interested in

two-dimensional quantifiers expressing in some sense a kind of association of two attributes (described by two open formulas) In some sense, the formula(Qx)(ϕ,ψ) should say that there

are sufficiently many coincidences in (the truth values of)ϕ,ψ and not too many differences.

This leads to the following definition (H´ajek and Havr´anek, 1978):

A two-dimensional quantifier is associational if it satisfies the fol-lowing for each pair (a1,b1,c1,d1), (a2,b2,c2,d2) of four-fold tables:

a2≥ a1,b2≤ b1,c2≤ c1,d2≥ d1and Tr Q (a1,b1,c1,d1) = 1 implies Tr Q (a2,b2,c2,d2) = 1.

In other words: if

ψ1¬ψ1

ϕ1 a1 b1

¬ϕ1 c1 d1

ψ2¬ψ2

ϕ2 a2 b2

¬ϕ2 c2 d2

are four-fold tables of the pairs (ϕ11), (ϕ22) of open formulas in given data, if

(Qx)(ϕ1,ψ1) is true in the data and the above inequalities hold (a2≥ a1,b2≤ b1,c2≤ c1,d2

d1), then (Qx)(ϕ22) is also true The second table has more coincidences (a2≥ a1,d2≥ d1) and less differences(b2≤ b1,c2≤ c1).

A quantifier Q is locally associational if the above condition holds for all (a1,b1,c1,d1), (a2,b2,c2,d2) satisfying the additional assumption a1+ b1+ c1+ d1 = a2+ b2+ c2+ d2 (i.e the tables correspond to two data matrices of the same cardinality; in particular, think

ofϕ1,ψ122evaluated in the same data matrix).

We shall deal with (locally) associational quantifiers of two important kinds: implicational and comparative We give examples and state general (deductive) properties of quantifiers in

these classes

Implicational quantifiers formalize the association formulated as “many ϕ’s are ψ’s” (They could be also called two-dimensional multitudinal quantifiers.) The definition reads

as follows:

A two-dimensional quantifier Q is implicational if each pair (a1,b1, c1,d1), (a2,b2,c2,d2) of four fold tables satisfies the following condition: If a2≥ a1,b2≤ b1 and

Tr Q (a1,b1,c1,d1) = 1 then Tr Q (a2,b2,c2,d2) = 1 Q is locally implicational if this

condi-tion is satisfied for each pair of four-fold tables with the same sum(a1+ b1+ c1+ d1 =

c2+ b2+ c2+ d2).

Clearly, the quantifier⇒ p (p-many) is implicational: a2≥ a1and b2≤ b1imply a2/(a2+

b2) ≥ a1/(a1+b1) The quantifier ⇒ ∗

p,t of founded implication is also implicational: if a2≥ a1

and a1≥ t then trivially a2≥ t The “almost the same” Agrawal’s quantifier ⇒ Agr

p,σ is locally

implicational: if the tables have equal sum and a2≥ a1then trivially a2/(a2+b2+c2+d2) ≥

a /(a + b + c + d ).

Trang 7

546 Petr H´ajek

Note the statistical parallel of⇒ p : The hypothesis of P(ψ|ϕ) ≥ p (conditional probability

ofψ, given ϕ) is tested using the statistic

a +b

i =a(a +b

i ) p i (1 − p) a +b−i The corresponding quantifier ⇒!

p ,αof likely p-implication (with significance levelα) is

Tr ⇒!

p ,α(a,b,c,d) = 1 iff a+b

i =a

%

a + b i

&

p i (1 − p) a +b−i ≤ α.

This is also an implicational quantifier (see (H´ajek and Havr´anek, 1978), where also another statistically motivated implicational quantifier is discussed) For each (locally) implicational quantifier (denote it#) the following two deduction rules are sound (in the sense that when-ever the assumption is true in your data, the consequence is also true):

(ϕ1ϕ2) ⇒

ϕ1#ψ ∨ ¬ϕ2, ϕ ⇒#ψ1

ϕ ⇒#(ψ1∨ψ2)

For example, if the formula “p-many probands being smokers and older then 50 have cancer”

is true in your data then the following is true too: “p-many probands being smokers have cancer or are not older than 50” Second: If the “association rule” “x buys Lidov´e noviny ⇒ x

is Czech” is 90%-true with support 1000 then also

x buys Lidov´e noviny ⇒ (x is Czech or x is Slovak)

is 90% true with the same support These deduction rules are extremely useful for optimizing search for formulas (“rules”) of the formϕ ⇒#ψ where ⇒#is an implicational quantifier,ϕ is

an elementary conjunction (conjunction of atomic open formulas and negated atomic formulas

containing each predicate at most once, e.g P1(x) ∧ ¬P3(x) ∧ P7(x)) andψ is an elementary disjunction (similar definition, e.g.¬P2(x) ∨ ¬P10(x)).

Caution: For the “classical” quantifier ⇒1(ϕ ⇒1ψ saying “all ϕ’s are ψ’s”) the first rule can be converted, thus truth ofϕ11ψ ∨ ¬ϕ2implies truth of(ϕ1ϕ2) ⇒1ψ But this is not true for ⇒ pand other mentioned implicational (locally implicational) quantifiers Let us stress once more that implicational quantifiers formalize, in various possible ways, what we mean saying “manyϕ’s are ψ’s” Agrawal’s association rules are a particular case, with one particular implicational quantifier and also with specific open formulas (no negation allowed, just conjunction of atoms) Even if this may be the most used case, the reader is invited to consider broader, more general and more powerful possibilities

We now turn our attention to a very important class of quantifiers that we shall call com-parative The intuitive meaning of association expressed by a comparative quantifier is that

the formula(Qx)(ϕ,ψ) should say that presence of ϕ positively contributes to the presence of

ψ This does not mean that many ϕ’s are ψ’s, thus that the relative frequence of ψ among ϕ (denoted Freq(ψ|ϕ)) is big but that Freq(ψ|ϕ) is (sufficiently) bigger than Freq(ψ|¬ϕ).

For example, imagine that 30% of smokers have an illness and only 5% of non-smokers

have the same illness The simplest quantifier of this kind is called the simple associational

quantifier, denoted SIMPLE or0 (see (H´ajek and Havr´anek, 1978, H´ajek et al., 1995) or other GUHA papers); the truth function is Tr ∼0(a,b,c,d) = 1 if ad > bc A trivial compu-tation shows that ad > bc is equivalent both to a

a +b > a +c

a +b+c+d (if(a,b,c,d) is the

four-fold table ofϕ,ψ then this says that, in the data, Freq(ψ|ϕ) > Freq(ψ)) and to a

+b > c

+d (Freq(ψ|ϕ) > Freq(ψ|¬ϕ)) You may make this quantifier parametric, demanding ad > h.bc, for some h ≥ 1.

Thus let us accept the following definition: A two-dimensional quantifier Q is comparative

if Tr Q (a,b,c,d) = 1 implies ad > bc The statistical counterpart is Fisher quantifier ∼ F

αbased

Trang 8

on the test of the hypothesis P(ψ|ϕ) > P(ψ) (against the null hypothesis P(ψ|ϕ) ≤ P(ψ)),

with significanceα The formula is:

Tr ∼α(a,b,c,d) = 1 if ad > bc and

min(a+b,a+c)

i =a

%

a+ b i

&%

b + d

a + b − i

& %

a + b + c + d

a + b

&

≤ α.

If we adopt the usual notation a + b = r, a + c = k, b + d = l, a + b + c + d = m then the

last formula becomes

min(r,k)

i =a

%

k i

&%

l

r − i

&

/'m r

(

≤ α.

This is a rather complicated formula; there are non-trivial algorithms for computing the sum

in question Fisher quantifier can be proved to be associational (H´ajek and Havr´anek, 1978) Let us mention that another comparative associational statistically motivated quantifier is

based on the statistical chi-square test.

Indeed,

Tr ∼ CHISQ

α (a,b,c,d) = 1 if ad > bc and m(ad − bc)2

r s k l ≥χ2

α,

whereχ2

αis a constant (the (1-α)-quantile of the χ2distribution function)

Now let us present three deduction rules and ask if our quantifiers obey them Once more,

it means that whenever the assumption (above the line) is true in the data then the conclusion (below the line) is true Here∼ stands for a quantifier; we write ϕ ∼ ψ instead of (∼ x)(ϕ,ψ) Rule of symmetry: (SYM)

ϕ ∼ ψ

ψ ∼ ϕ Rule of negation: (NEG)

ϕ ∼ ψ

¬ϕ ∼ ¬ψ Rule of conversion: (CNVS)

ϕ ∼ ψ

¬ψ ∼ ¬ϕ

Fact

The simple quantifier0, the Fisher quantifier ∼ F

αas well as the chi-square quantifier∼ CHI

α obey all the rules (SYM), (NEG), (CNVS)

For a proof see again (H´ajek and Havr´anek, 1978), observing that if any quantifier obeys (SYM) and (NEG) then it automatically obeys (CNVS) Now we present three more quanti-fiers occuring in the literature, each obeying just one of our present rules

The quantifier of pure p-equivalence ≡ p(Rauch, see e.g (Rauch, 1998A)) The formula

ϕ ≡ p ψ is true if both ϕ ⇒ p ψ and ¬ϕ ⇒ p ¬ψ are true, thus Tr ≡ p (a,b,c,d) = 1 if a/(a+b) ≥

p and d/(c + d) ≥ p For p >1

2 this quantifier is comparative (Indeed, if a/(a + b) >1

2 and

d/(c + d) >1

2 then c/(c + d) <1

2< a/(a + b), which gives bc < ad.)

Trang 9

548 Petr H´ajek

The quantifier of conviction (Adamo, 2001) ϕ ∼ conv

h ψ is true if

(a + b)(b + d) > h.b(a + b + c + d), or equivalently, (rl)/(bm) > h, where h is a parameter,

h ≥ 1 An elementary computation gives that the last inequality for h = 1 (and hence for each

h ≥ 1) implies ad < bc; the quantifier is comparative.

The quantifier “above average” is a variant of SIMPLE (used in the program 4FT-miner

(lispminer)).ϕ ∼ AA

h ψ is true if

a/(a + b) > h.(a + c)/(a + b + c + d) (thus a/r > h.k/m), which means that is Fr(ψ/ϕ) > h.Fr(ψ) For h = 1 this is equivalent to the simple quantifier with h = 1; evidently, for each

h ≥ 1, the AA quantifier is comparative.

But these last three quantifiers differ as far as our deduction rules are concerned:

Fact

(1) The quantifier AA obeys symmetry but for h > 1 neither negation nor conversion (2) The quantifier of pure p-equivalence obeys negation but for p < 1 neither symmetry nor conver-sion (3) The quantifier of conviction obeys conversion but for h > 1 neither symmetry nor

negation

The positive claims are verified by easy computations; the negative claims can be all witnessed e.g by the table(9,1,10,80).

Let us also mention the quantifier of double p-implication ⇔ p(Rauch):ϕ ⇔ pψ is true if bothϕ ⇒ p ψ and ψ ⇒ p ϕ is true Show that this quantifier is not comparative (consider e.g (9,1,1,0)); it obeys symmetry but (for p < 1) neither negation nor conversion.

The study of deductive rules is important for interpretation of results of Data Mining as well as for optimization of mining algorithms

To close this section let us mention that each two-dimensional quantifier∼ can be used

to define a three-dimensional quantifier by partializing: the formula(ϕ ∼ ψ)/χ is true in the data matrix in question iffϕ ∼ ψ is true in the submatrix of objects satisfying χ Cf (H´ajek,

2003)

26.3 Some comments and conclusion

Using four-fold tables

Even is we have dealt with logical aspects of Data Mining we feel obliged to stress once more

the importance of the statistical side of the game We already referred to (Giudici, 2003) ; let

us make some further references Glymour’s (Glymour et al., 1996) is a good reading on the prehistory of Data Mining, namely exploratory data analysis and of some dangers of using

statistically motivated notions in Data Mining (Zytkow and Zembowicz, 1997) deal with gen-erating knowledge from four-fold tables (and describe their database discovery system “49-er”) Recently, the chi-square statistic was used to define “generalized association rules” (we

would say: using a comparative quantifier) by Hegland (Hegland, 2001) and Brin (Brin et al.,

1998) Papers by Rauch (et al.) discuss several further classes of quantifiers, see references

Two generalizations

First, the logical approach to mining generalized association rules can be and has been

gener-alized to fuzzy logic We refer to Holeˇna’s papers ( (Holeˇna, 1996) – (Holeˇna, 1996)) and also

to Chen et al (Chen et al., 2003) Second, we have only mentioned relational Data Mining

Trang 10

and its techniques of inductive logic programming Besides Dˇzeroski and Lavraˇc (Dˇzeroski

and Lavraˇc, 2001) the reader may consult e.g Dehaspe and Toivonen (Dehaspe and Toivonen, 1999)

The GUHA method

The reader should be informed on the more then 30 years old story of the GUHA method of automated generation of hypotheses (General Unary Hypotheses Automa-ton) which is undoubtely one of the oldest methods of computerized exploratory data

analysis (or, if you want, mining of association rules) starting with (H´ajek et al.,

1966) from 1966 We already mentioned above that formulas almost identical with Agrawal’s association rules were considered and algorithms for their generation were dis-cussed in that paper This was followed by a long period of research culminating in 1978

by the monograph (H´ajek and Havr´anek, 1978) by H´ajek and Havr´anek, presenting a logical and statistical fundations that are still relevant for contemporary Data Mining (Note that the book is presently available on web, see references.) The research has continued; see (H´ajek

et al., 2003) and (H´ajek and Holeˇna, 2003, H´ajek, 2001) for a survey of the present state

and relation to other Data Mining methods The GUHA approach offers observational logical calculi (based on generated quantifiers as presented here), logical foundations of statistical inference (theoretical logical calculi), theory of some auxiliary (helpful) quantifiers good for compression of results, three semantics of missing information and several other facts, no-tions and techniques There have been several implementano-tions; for two presently available see (GUHA+-, lispminer)

It is regrettable that the mainstream of Data Mining has neglected the GUHA approach

( (Liu et al., 2000) being one of few exceptions); this subsection is a small attempt to change

this

Conclusion

The study of logical aspects of Data Mining is interesting and useful: it gives an exact abstract approach to “association rules” based on the notion of (generalized) quantifiers, important classes of quantifiers, deductive properties of associations expressed using such quantifiers

as well as other results not mentioned here (as e.g results on computational complexity) Hopefully the present chapter will help the reader to enjoy this

Acknowledgments

Partial support of the COST Action 274 (TARSKI) is recognized

References

Adamo, J M Data Mining for association rules and sequential patterns Springer 2001.

Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and A I Verkamo “Fast discovery of association rules.” In: Advances in knowledge discovery and Data Mining Fayyad U

M et al., ed., AAAI Press/MIT Press,1996

Ngày đăng: 04/07/2014, 05:21

TỪ KHÓA LIÊN QUAN