Data Mining and Knowledge Discovery Handbook, 2 Edition part 38 pptx

describe a mining algorithm but rather a pruning technique for non anti-monotonic and non monotonic constraints.. Albert-Lorincz and Bouli-caut, 2003 considers frequent sequence mining u

Trang 1

describe a mining algorithm but rather a pruning technique for non anti-monotonic and non monotonic constraints Considering a sub-lattice ˚A of 2 I, the problem is to decide whether this sub-lattice can be pruned A sub-lattice is characterized by its

maximal element M and its minimal element m, i.e., the sub-lattice is the collection

of all itemsets S such that m ⊆ S ⊆ M To prune this sub-lattice, one must prove that

none of its elements can satisfy the constraintC To check this, the authors introduce

the concept of negative witness: a negative witness forC in the sub-lattice ˚A is an itemset W such that ¬C (W) ⇒ ∀X ∈ ˚A, ¬C (X) Therefore, if the constraint is not

satisﬁed by the negative witness, then the whole sub-lattice can be pruned Finding

witnesses for anti-monotonic or monotonic constraints is easy : m is the witness for all anti-monotonic constraints and M for all monotonic ones The authors then show

how to compute efﬁciently witnesses for various tough constraints For instance, for AVG(S) >σ, a witness is the set m ∪ {i ∈ M | i.v >σ} The authors also gives an

algorithm (linear in the size ofI ) to compute a witness for the difﬁcult constraint (VAR(S) >σ) where VAR denotes the variance

17.4.3 Ad-hoc Strategies

Apart from generic algorithms, many algorithms have been designed to cope with speciﬁc classes of constraints We select only two examples

The FIC algorithm (Pei et al., 2001) does a depth-ﬁrst exploration of the

item-set lattice It is very efﬁcient due to its clever data structure, a preﬁx-tree used to store the database This algorithm can compute the extended theory for a conjunction

Cam∧ Cm∧ C whereC is convertible anti-monotonic or monotonic A constraint

C is convertible anti-monotonic if there exists an order on the items such that, if

itemsets are written using this order, every preﬁx of an itemset satisfyingC satisﬁes

C For instance, AVG(S) >σ is convertible anti-monotonic if the items i are or-dered by decreasing value i.v The main problem with convertible constraints is that

a conjunction of convertible constraints is generally not convertible

Another example of an ad-hoc strategy is used in the c-Spade algorithm (Zaki, 2000) This algorithm is used to extract constrained sequences where each event

in the sequences is dated One of the constraints, the max − gap constraint, states

that two consecutive events occurring in a pattern must not be further apart than a given maximum gap This constraint is neither anti-monotonic nor monotonic and a speciﬁc algorithm has been designed for it

17.4.4 Other Directions of Research

Among others, let us introduce here three important directions of research

Adaptive Pruning Strategies

We mentioned the trade-off between anti-monotonic pruning which is known to be quite efﬁcient and pruning based on non anti-monotonic constraints Since the se-lectivity of the various constraints is generally unknown, a quite exciting challenge

Trang 2

is to look for adaptive strategies which can decide of the pruning strategy

dynam-ically (Bonchi et al., 2003A, Bonchi et al., 2003B) propose algorithms for

fre-quent itemsets under syntactical monotonic constraints (Albert-Lorincz and Bouli-caut, 2003) considers frequent sequence mining under regular expression constraints These are promising approaches to widen the applicability of constraint-based min-ing techniques in real contexts

Combining Constraints and Condensed Representations

A few papers, e.g., (Boulicaut and Jeudy, 2000, Bonchi and Lucchese, 2004), deal with the problem of extracting constrained condensed representation In these works, the aim is to compute a condensed representation of the extended theory

Thx (D,2 I ,Cam∧ Cm,freq) In (Boulicaut and Jeudy, 2000), the authors use free

itemsets, i.e., their algorithm computes the extended theory Thx (D,2 I ,Cam∧Cm∧

Cfree,freq) In (Bonchi and Lucchese, 2004), the authors use closed itemsets, i.e.,

their algorithm computes the extended theory Thx (D,2 I ,Cam∧ Cm∧ Cclos,freq).

However, in these two works, the deﬁnition of free sets and closed sets have been modiﬁed to be able to regenerate the extended theory Thx (D,2 I ,Cam∧ Cm,freq)

from the extracted theories This kind of research combines the advantages of both condensed representations and constrained mining which result in very efﬁcient al-gorithms

Constraint-based Mining of more Complex Pattern

Domains

Most of the recent results have concerned simple local pattern discovery tasks like the ones based on itemsets or sequences We believe that inductive querying is much more general Many open problems are however to be addressed For instance, even based mining of association rules is already much harder than

constraint-based mining of itemsets (Lakshmanan et al., 1999, Jeudy and Boulicaut, 2002) The recent work on the MINE RULE query language (Meo et al., 1998) is also typical

of the difﬁculty to optimize constraint-based association rule mining (Meo, 2003) When considering model mining under constraints (e.g., classiﬁer design or clus-tering), only very preliminary approaches are available (see, e.g., (Garofalakis and Rastogi, 2000)) We think that this will be a major issue for research in the next few years For instance, for clustering, it seems important to go further than the classical similarity optimization constraints and enable to specify other constraints on clusters (e.g., enforcing that some objects are or are not within the same clusters)

17.5 Conclusion

In this chapter, we have considered constraint-based mining approaches, i.e., the core techniques for inductive querying

Trang 3

This domain has been studied a lot for simple pattern domains like itemsets or sequences Rather general forms of inductive queries on these domains (e.g., ar-bitrary boolean expressions over monotonic and anti-monotonic constraints) have been considered Beside the many ad-hoc algorithms, an interesting effort has con-cerned generic algorithms Many open problems are still there: how to solve tough constraints?, how to design relevant approximation or relaxation schemes? how to combine constraint-based mining with condensed representations, not only for sim-ple pattern domains but also more comsim-plex ones?

Moreover, within the inductive database framework, the problem is to optimize sequences of queries and typically sequences of correlated inductive queries It is crucial to consider that the optimization of a query and thus constraint-based mining must also take into account the previously solved queries Looking for the formal properties between inductive queries, especially containment, is thus a major priority Here again, we believe that condensed representations might play a major role Last but not the least, a quite challenging problem is to consider from where the constraints come The analysts can think in terms of constraints or declarative speciﬁcations which are not supported by the available solvers: an obvious example could be unexpectedness or novelty w.r.t some explicit background knowledge To

be able to derive appropriate inductive queries based on a limited number of primi-tives (and some associated solvers) from the constraints expressed by the analysts is challenging

References

R Agrawal, H Mannila, R Srikant, H Toivonen, and A I Verkamo Fast discovery of

association rules In Advances in Knowledge Discovery and Data Mining, pages 307–

328 AAAI Press, 1996

H Albert-Lorincz and J.-F Boulicaut Mining frequent sequential patterns under regular

expressions: a highly adaptative strategy for pushing constraints In Proc SIAM DM’03,

pages 316–320, 2003

Y Bastide, N Pasquier, R Taouil, G Stumme, and L Lakhal Mining minimal

non-redundant association rules using frequent closed itemsets In Proc CL 2000, volume

1861 of LNCS, pages 972–986 Springer-Verlag, 2000.

Y Bastide, R Taouil, N Pasquier, G Stumme, and L Lakhal Mining frequent patterns with

counting inference SIGKDD Explorations, 2(2):66–75, 2000.

R J Bayardo Efﬁciently mining long patterns from databases In Proc ACM SIGMOD’98,

pages 85–93, 1998

F Bonchi, F Giannotti, A Mazzanti, and D Pedreschi Adaptive constraint pushing in

fre-quent pattern mining In Proc PKDD’03, volume 2838 of LNAI, pages 47–58

Springer-Verlag, 2003A

F Bonchi, F Giannotti, A Mazzanti, and D Pedreschi Examiner: Optimized level-wise frequent pattern mining with monotone constraints In Proc.

IEEE ICDM’03, pages 11–18, 2003B.

F Bonchi, F Giannotti, A Mazzanti, and D Pedreschi Exante: Anticipated data reduction

in constrained pattern mining In Proc PKDD’03, volume 2838 of LNAI, pages 59–70.

Springer-Verlag, 2003C

Trang 4

F Bonchi and C Lucchese On closed constrained frequent pattern mining In Proc IEEE

ICDM’04 (In Press), 2004.

J.-F Boulicaut Inductive databases and multiple uses of frequent itemsets: the cInQ

ap-proach In Database Technologies for Data Mining - Discovering Knowledge with

In-ductive Queries, volume 2682 of LNCS, pages 1–23 Springer-Verlag, 2004.

J.-F Boulicaut and A Bykowski Frequent closures as a concise representation for binary

Data Mining In Proc PAKDD’00, volume 1805 of LNAI, pages 62–73 Springer-Verlag,

2000

J.-F Boulicaut, A Bykowski, and C Rigotti Approximation of frequency queries by mean

of free-sets In Proc PKDD’00, volume 1910 of LNAI, pages 75–85 Springer-Verlag,

2000

J.-F Boulicaut, A Bykowski, and C Rigotti Free-sets : a condensed representation of

boolean data for the approximation of frequency queries Data Mining and Knowledge

Discovery, 7(1):5–22, 2003.

J.-F Boulicaut and B Jeudy Using constraint for itemset mining: should we prune or not?

In Proc BDA’00, pages 221–237, 2000.

J.-F Boulicaut and B Jeudy Mining free-sets under constraints In Proc IEEE IDEAS’01,

pages 322–329, 2001

C Bucila, J E Gehrke, D Kifer, and W White Dualminer: A dual-pruning algorithm for

itemsets with constraints Data Mining and Knowledge Discovery, 7(4):241–272, 2003.

D Burdick, M Calimlim, and J Gehrke MAFIA: A maximal frequent itemset algorithm

for transactional databases In Proc IEEE ICDE’01, pages 443–452, 2001.

A Bykowski and C Rigotti DBC: a condensed representation of frequent patterns for

efﬁcient mining Information Systems, 28(8):949–977, 2003.

T Calders and B Goethals Mining all non-derivable frequent itemsets In Proc PKDD’02, volume 2431 of LNAI, pages 74–85 Springer-Verlag, 2002.

B Cr´emilleux and J.-F Boulicaut Simplest rules characterizing classes generated by

delta-free sets In Proc ES 2002, pages 33–46 Springer-Verlag, 2002.

L De Raedt A perspective on inductive databases SIGKDD Explorations, 4(2):69–77,

2003

L De Raedt, M Jaeger, S Lee, and H Mannila A theory of inductive query answering In

Proc IEEE ICDM’02, pages 123–130, 2002.

L De Raedt and S Kramer The levelwise version space algorithm and its application to

molecular fragment ﬁnding In Proc IJCAI’01, pages 853–862, 2001.

M M Garofalakis and R Rastogi Scalable Data Mining with model constraints SIGKDD

Explorations, 2(2):39–48, 2000.

M M Garofalakis, R Rastogi, and K Shim SPIRIT: Sequential pattern mining with regular

expression constraints In Proc VLDB’99, pages 223–234, 1999.

B Goethals and M J Zaki, editors Proc of the IEEE ICDM 2003 Workshop on Frequent

Itemset Mining Implementations, volume 90 of CEUR Workshop Proceedings, 2003.

D Gunopulos, R Khardon, H Mannila, S Saluja, H Toivonen, and R S Sharm Discovering all most speciﬁc sentences ACM Transactions on Database Systems, 28(2):140–174, 2003.

T Imielinski and H Mannila A database perspective on knowledge discovery

Communi-cations of the ACM, 39(11):58–64, 1996.

B Jeudy and J.-F Boulicaut Optimization of association rule mining queries Intelligent

Data Analysis, 6(4):341–357, 2002.

D Kifer, J E Gehrke, C Bucila, and W White How to quickly ﬁnd a witness In Proc.

ACM PODS’03, pages 272–283, 2003.

Trang 5

S Kramer, L De Raedt, and C Helma Molecular feature mining in HIV data In Proc.

ACM SIGKDD’01, pages 136–143, 2001.

L V Lakshmanan, R Ng, J Han, and A Pang Optimization of constrained frequent set

queries with 2-variable constraints In Proc ACM SIGMOD’99, pages 157–168, 1999.

D.-I Lin and Z M Kedem Pincer search: An efﬁcient algorithm for discovering the

maxi-mum frequent sets IEEE Transactions on Knowledge and Data Engineering, 14(3):553–

566, 2002

H Mannila and H Toivonen Multiple uses of frequent sets and condensed representations

In Proc KDD’96, pages 189–194 AAAI Press, 1996.

H Mannila and H Toivonen Levelwise search and borders of theories in knowledge

discov-ery Data Mining and Knowledge Discovery, 1(3):241–258, 1997.

C Mellish The description identiﬁcation problem Artiﬁcial Intelligence,

52(2):151–168, 1992

R Meo Optimization of a language for Data Mining In Proc ACM SAC’03 - Data Mining

Track, pages 437–444, 2003.

R Meo, G Psaila, and S Ceri An extension to SQL for mining association rules Data

Mining and Knowledge Discovery, 2(2):195–224, 1998.

T Mitchell Generalization as search Artiﬁcial Intelligence, 18(2):203–226, 1980.

R Ng, L V Lakshmanan, J Han, and A Pang Exploratory mining and pruning optimizations of constrained associations rules In Proc ACM

SIGMOD’98, pages 13–24, 1998.

N Pasquier, Y Bastide, R Taouil, and L Lakhal Efﬁcient mining of association rules using

closed itemset lattices Information Systems, 24(1):25–46, 1999.

J Pei, G Dong, W Zou, and J Han On computing condensed frequent pattern bases In

Proc IEEE ICDM’02, pages 378–385, 2002.

J Pei, J Han, and L V S Lakshmanan Mining frequent itemsets with convertible

con-straints In Proc IEEE ICDE’01, pages 433–442, 2001.

R Srikant, Q Vu, and R Agrawal Mining association rules with item constraints In Proc.

ACM SIGKDD’97, pages 67–73, 1997.

M J Zaki Sequence mining in categorical domains: incorporating constraints In Proc.

ACM CIKM’00, pages 422–429, 2000.

Trang 6

Link Analysis

Steve Donoho

Mantas, Inc

Summary Link analysis is a collection of techniques that operate on data that can be rep-resented as nodes and links This chapter surveys a variety of techniques including subgraph matching, finding cliques and K-plexes, maximizing spread of influence, visualization, find-ing hubs and authorities, and combinfind-ing with traditional techniques (classification, clusterfind-ing, etc) It also surveys applications including social network analysis, viral marketing, Internet search, fraud detection, and crime prevention

Key words: Link analysis, Social network analysis, Graph theory

18.1 Introduction

The term ”link analysis” does not refer to one speciﬁc technique or algorithm Rather

it refers to a collection of techniques that are bound together by the type of data they operate on Link analysis techniques are applied to data that can be represented as nodes and links as in Figure 18.1

A node represents an entity such as a person, a document, or a bank account Nodes are sometimes referred to as ”vertices.” A link represents a relationship be-tween two entities such as a parent/child relationship bebe-tween two people, a reference relationship between two documents, or a transaction between two bank accounts Links are sometimes referred to as ”edges.” Because links show relationships among entities, this type of data is often referred to as relational data

This is as opposed to attribute vector data used by many other unsupervised and supervised Data Mining techniques In most standard Data Mining techniques, data

is represented as a set of tuples (a vector of attribute values) Each tuple represents

an entity, but there is no explicit data about relationships among entities In link analysis, information exists about the relationships among entities, and analysis of these relationships is the focus of the ﬁeld

The roots of link analysis predate the use of modern computers Law enforce-ment ofﬁcials have carried out manual link analysis for many years When a crime

is investigated, a network such as in Figure 18.1 is drawn where the nodes represent

O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,

Trang 7

Fig 18.1 Node and Link Data Used by Link Analysis Techniques.

people, weapons, crime scenes, etc One person may be linked to another if they are family, friends, roommates, or business partners A person may be linked to a weapon if it is registered in his name or if it was found at his home Once a network

of relationships is drawn out, the bigger picture of a crime emerges from the details Holes in the network become apparent, and they are areas for further investigation Hypotheses can be formed and tested

Sociologists also performed manual link analysis long before there were com-puters The structure of a clan or tribe would be mapped out with nodes represent-ing people and links representrepresent-ing family, work, or social relationships From this a sociologist could deduce who held powerful positions within the clan, who might inﬂuence who else, how information might spread within the clan, and what factions might arise

The advent of computers allowed these techniques to become much more wide-spread and to be applied on a much larger scale All 10 million of a bank’s customers can be analysed for money laundering relationships The hundreds of millions of documents on the Internet can be analysed to determine which are most respected and reliable Large communities can be analysed to determine how information and opinions spread and who are the most inﬂuential individuals

This chapter surveys the techniques that fall under the umbrella of link analysis and how these techniques are being applied Section 18.2 presents some key con-cepts from the ﬁeld of Social Network Analysis Section 18.3 examines how link analysis techniques are used to improve search engine results Section 18.4 looks at recent link analysis ideas emerging from the ﬁeld of viral marketing Section 18.5 shows how fraud detection and law enforcement have presented unique challenges and opportunities for link analysis Finally, Section 18.6 surveys recent combinations

of link analysis with traditional Data Mining techniques

Trang 8

18.2 Social Network Analysis

The ﬁeld of social network analysis (Wasserman, 1994, Hanneman, 2001) has devel-oped over many years as sociologists develdevel-oped formal methods of studying groups

of people and their relationships When studying a social network, there are many questions sociologists are interested in answering:

1 Which people are powerful?

2 Which people inﬂuence other people?

3 How does information spread within the network?

4 Who is relatively isolated, and who is well connected?

5 In a disagreement, who is likely to side with whom?

6 What roles do people play in an organization, and who has similar roles? While concepts such as powerful, inﬂuential, isolated, and connected are somewhat subjective, social network analysis methods give us a baseline for measuring and making comparisons

Fig 18.2 Three Networks to Illustrate an Individual’s Power within a Network

Many things can make a person powerful within a group Consider the shaded nodes in the three networks shown in Figure 18.2 (Hanneman, 2001) The person at the center of the star intuitively seems more powerful than the one in the circle or the one at the end of the line If the people in the star want to communicate with each other they have to go through the center person, and that person has the power to either facilitate or hamper communication If the people in the star want to engage

in business, they have to go through the person in the center, and that person has the power to charge a fee as the middleman In contrast, the shaded node in the circular network is the most convenient path of communication or trade for some nodes, but

he is not the only path Intuitively, he has less power than the center of the star The shaded node at the end of the line is dependant on others for communication and trade but has no one who is dependant on him Intuitively, he has little or no power The networks in Figure 18.2 illustrate how ”centrality” is one measure of power The node at the center of the star derives its power from being in the center of its network The shaded nodes in the circle and line are less central to their networks

Trang 9

and are therefore less powerful Some quantitative methods of measuring centrality are:

1 Degree The shaded node in the star network is linked to six other nodes and thus has a degree of six All the other nodes in the star have a degree of one and are comparatively less central All the nodes in the circle have the same degree: two The shaded node in the line has a degree of one and is thus slightly less central than other nodes in the line with degree two

2 Closeness The average distance from the shaded node in the star to all other nodes is 1.0 This node has very direct access to everyone else Other nodes

in the star have an average distance of 1.8 All the nodes in the circle have an average distance of 2.0 The node at the end of the line has an average distance

of 3.5 whereas the node in the center of the line has an average distance of 2.0

3 Betweenness The shaded node in the star is between all other 15 pairs of nodes

In the circle there are two paths between each pair of nodes The shaded node in the circle is on a path between all other 15 pairs, but since there is an alternative path between each pair, the shaded node is on 50% of the paths between pairs The node at the end of the line is between no pairs The node one from the end

of the line is on paths between 5 pairs (33% of 15 paths) The node at the center

of the line is on paths between 9 pairs (60% of 15 paths)

4 Cutpoints Related to betweenness, cutpoints are nodes that if removed divide the network into unconnected systems These nodes hold particular power because they are the only point of contact between otherwise disconnected networks If the center of the star is removed, six disconnected systems result If a node in the circle is removed, the network is still connected If a non-end node is removed from the line, two disconnected systems result

A clique is a small, highly-interconnected group within a larger network Cliques are of interest for several reasons Ideas or information may spread extremely quickly within a clique because of the high connectivity Members of a clique often act and behave as a cohesive unit Disputes may form between cliques (”factions”) A person can be described with respect to the clique(s) they belong to A person who is only connected to people in his clique is called a ”local” and is strongly inﬂuenced by the clique A person who belongs to many cliques is called a ”cosmopolitan” and serves

to bring outside ideas and information into a clique

The most strict deﬁnition of a clique is a complete subgraph (all nodes in the clique must be linked to all other nodes) A couple more relaxed deﬁnitions are:

1 K-plexes A group of N nodes is a K-plex if each of the nodes is connected to at least N-K other nodes in the group Intuitively, if K=2 then every member of the clique has to be connected to all but two of the other members

2 K-cores The deﬁnition of a K-core is slightly more relaxed than that of a K-plex

A K-core is a maximal group of nodes all of which are connected to at least K other nodes in the group For example, if K=4 then every member of the clique

is connected to at least 4 other clique members

Trang 10

The concept of ”equivalence” is very important within social networks It makes it possible to determine if a person is playing a particular role within a network This allows both intra-network comparisons (one node has the same role as another node within one network) and inter-network comparison (two nodes in different networks are playing the same role) Two measures of equivalence are:

1 Structural Equivalence This is a strict measure of equivalence between two nodes Two nodes are exactly structurally equivalent if they are linked to exactly the same other nodes If not exactly equivalent, the degree of partial structural equivalence can be measured using the degree of overlap in nodes they are linked to

2 Regular equivalence Regular equivalence is a less strict deﬁnition than structural equivalence Two nodes have regular equivalence if the nodes they are linked to are regular equivalents For example, Fred Flintstone is the regular equivalent of Barney Rubble because Fred is the husband of Wilma, and Barney is the husband

of Betty, and Wilma and Betty are regular equivalents

On a broader scale, equivalence of nodes lays the groundwork for measuring the similarity of one whole social network to another whole social network This is useful for matching a network against a known template in order to identify the nature of the network as will be seen in Section 5 on Fraud Detection and Law Enforcement Many groups such as academic circles, fraud rings, business circles, shoppers with common interests, and professional societies can be represented as social net-works Because of this, Social Network Analysis lays the groundwork for many im-portant real-world applications

18.3 Search Engines

The Internet is rich in relational data by the simple fact that web pages are linked to other web pages While traditional search techniques such as keyword searches

fo-cus exclusively on the content of a single page, newer techniques (Page et al., 1998,

Kleinberg, 1999) exploit relationships among pages A user performing a search wants to ﬁnd results that are not only relevant but are also authoritative and reli-able A keyword search on ”stock market” will not only return authoritative sites such as the NASDAQ and NYSE pages, it will also return pages from thousands of self-proclaimed gurus selling books, software, and advice The truly reliable sources

of information are likely to be lost among the self-proclaimed gurus Is there a way

to separate the wheat from the chaff? This is where relational information contained

in links comes into play

An authoritative site such as the NASDAQ is likely to be recognized as authori-tative by many people; therefore, many other sites are likely to point to the NASDAQ site But a self-proclaimed stock market guru is less likely to have many other sites pointing to his site unless there truly is some merit to what he has to say When one site references another site, it is in fact declaring that that site has some merit – it

is casting a vote for the value and importance of the other site Conceptually, this

Định dạng
Số trang	10
Dung lượng	382,24 KB