co so du lieu nang cao do phuc bai 5 csdl dothi cuuduongthancong com (1)

Why Graph Mining and Searching? Graphs are ubiquitous Chemical compounds Cheminformatics Protein structures, biological pathways/networks Bioinformactics Program control flow, tra

Trang 1

Mining, Indexing and Searching

Graph Databases

Presenter: A/ Prof Do PhucSource: Jiawei Han , Vladimir Lipets

Trang 2

Graph, Graph, Everywhere

Aspirin Yeast protein interaction network

Trang 3

Why Graph Mining and Searching?

Graphs are ubiquitous

Chemical compounds (Cheminformatics)

Protein structures, biological pathways/networks (Bioinformactics)

Program control flow, traffic flow, and workflow analysis

XML databases, Web, and social network analysis

Graph is a general model

Trees, lattices, sequences, and items are degenerated graphs

Trang 4

Graph Isomorphism, Subgraph Isomorphism

Mining frequent graph patterns

Graph indexing methods

Similairty search in graph databases

Biological network analysis

Trang 5

Graph, Subgraph isomorphism is important and

very general form of pattern matching that finds practical application in areas such as:

pattern recognition and computer vision,

Trang 6

A hierarchy of pattern matching problems

Graph isomorphism

Approximate subgraph isomorphism

Graph edit distance

Trang 7

Isomorphic Graphs

Trang 8

Graph Isomorphism

Trang 9

Subgraph of a given graph

Trang 10

Subgraph Isomorphism

Trang 11

Subgraph Isomorphism and Related

Problems

Given a pattern graph G and a target graph H

Decision problem: Answer whether H contains a

subgraph isomorphic to G

Search problem: Return an occurrence of G as a

subgraph of H

Counting problem: Return a count of the number

of subgraphs of H that are isomorphic to G

Enumeration problem: Return all occurrences of

G as a subgraph of H

Trang 12

Graph Isomorphism, Subgraph Isomorphism

Trang 13

Graph Pattern Mining

Frequent subgraphs

A (sub)graph is frequent if its support (occurrence

frequency) in a given dataset is no less than a

minimum support threshold

Applications of graph pattern mining

Mining biochemical structures

Program control flow analysis

Mining XML structures or Web communities

Building blocks for graph classification, clustering,

comparison, and correlation analysis

Trang 14

Example: Frequent Subgraphs

S OH

O O

O N

HO

O N

Trang 15

Frequent Subgraph Mining Approaches

Apriori-based approach

AGM/AcGM: Inokuchi, et al (PKDD’00)

FSG: Kuramochi and Karypis (ICDM’01)

PATH: Vanetik and Gudes (ICDM’02, ICDM’04)

FFSM: Huan, et al (ICDM’03)

Pattern growth-based approach

MoFa, Borgelt and Berthold (ICDM’02)

gSpan: Yan and Han (ICDM’02)

Gaston: Nijssen and Kok (KDD’04)

Trang 16

Properties of Graph Mining Algorithms

Search order

breadth vs depth

Generation of candidate subgraphs

apriori vs pattern growth

Elimination of duplicate subgraphs

passive vs active

Support calculation

embedding store or not

Discover order of patterns

path Æ tree Æ graph

Trang 17

Trang 18

Graph Search: Querying Graph Databases

Querying graph databases:

Given a graph database and a query graph, find all graphs containing this query graph

N N

OH O

N

O

N

OH O

S OH

S

HO O

O N

N O O

query graph graph database

Trang 19

S OH

S HO O O

N N O

O

OH O

Query graph

Trang 20

Index substructures of a query graph to prune graphs that do not contain these substructures

Trang 21

Two steps in processing graph queries

Step 1 Index Construction

database, build an inverted index between structures and graphs

Step 2 Query Processing

these structures

performing subgraph isomorphism test

Trang 22

Some recent progress on graph mining

Trang 23

Graph Clustering

Graph similarity measure

Feature-based similarity measure

Each graph is represented as a feature vector

The similarity is defined by the distance of their corresponding vectors

Frequent subgraphs can be used as features

Structure-based similarity measure

Maximal common subgraph

Graph edit distance: insertion, deletion, and relabel

Trang 24

Graph Classification

Local structure based approach

Local structures in a graph, e.g., neighbors

surrounding a vertex, paths with fixed length

Graph pattern-based approach

Subgraph patterns from domain knowledge

Subgraph patterns from data mining

Kernel-based approach

Random walk (Gärtner ’02, Kashima et al ’02,

ICML’03, Mahé et al ICML’04)

Optimal local assignment (Fröhlich et al

Trang 25

Structure Similarity Search

(a) caffeine (b) diurobromine (c) viagra

Trang 26

Some “Straightforward” Methods

Method1: Directly compute the similarity between the

graphs in the DB and the query graph

Sequential scan

Subgraph similarity computation

Method 2: Form a set of subgraph queries from the

original query graph and use the exact subgraph

search

Costly: If we allow 3 edges to be missed in a

20-edge query graph, it may generate 1,140 subgraphs

Trang 27

Index: Precise vs Approximate Search

Precise Search

Use frequent patterns as indexing features

Select features in the database space based on their selectivity

Build the index

Approximate Search

Hard to build indices covering similar subgraphs—

explosive number of subgraphs in databases

Idea: (1) keep the index structure

(2) select features in the query space

Trang 28

Substructure Similarity Measure

Query relaxation measure

The number of edges that can be relabeled or

missed; but the position of these edges are

not fixed

QUERY GRAPH

…

Trang 29

Substructure Similarity Measure

Feature-based similarity measure

Each graph is represented as a feature vector

X = {x1, x2, …, xn}

The similarity is defined by the distance of

their corresponding vectors

Advantages

Easy to index

Fast Rough measure

Trang 30

Query Processing Framework

Three steps in processing approximate graph

queries

Step 1 Index Construction

Select small structures as features in a graph database, and build the feature-graph matrix between the features

and the graphs in the database

Trang 31

Framework (cont.)

Step 2 Feature Miss Estimation

Determine the indexed features belonging

to the query graph

Calculate the upper bound of the number

of features that can be missed for an approximate matching, denoted by J

On the query graph, not the graph database

Trang 32

Framework (cont.)

Step 3 Query Processing

Use the feature-graph matrix to calculate the difference in the number

of features between graph G and query

Q, FG – FQ

If FG – FQ > J, discard G The remaining graphs constitute a candidate answer set

Trang 33

Trang 35

Data Mining Across Multiple Networks

a

b c

d e f

c e

f

j a

b c

d e

g

h

k f

i j

a

b

c e

f a

b d

j

Trang 36

Data Mining Across Multiple Networks

a

b c

d e f

c e

f

j a

b c

d e f

a

b c

j

Trang 37

Identify Frequent Co-expression Clusters

across Multiple Microarray Data Sets

d e f

a b c

d e f

Trang 38

CODENSE: Mine Coherent Dense Subgraphs

f a

b d

e g

h

i c

a

b d

summary graph Ĝ

f

a

b c

d e f

d

e g

Trang 39

(2) Identify dense subgraphs of the summary graph

dense subgraph in the summary graph However, the

reverse is not true

CODENSE: Mine Coherent Dense Subgraphs

Trang 40

d e f

a b c

d e f

g

h j

k i

a b c

d e f

Applying CoDense to 39 Yeast Microarray Data Sets

Trang 41

MRPL51

MRP49 YDR115W

PHB1

PET100

Discovery of New Genes Based on Similar Genes

Trang 42

Brown: YDR115W, FMC1, ATP12, MRPL37, MRPS18

MRPL32

ACN9

MRPL51 MRP49

YDR115W

PHB1

PET100 PET100

Network of Known Similar Genes

Trang 43

ACN9

MRPL51

MRP49 YDR115W

PHB1

PET100

Network Involved in the New Genes

Trang 44

Trang 45

Graph mining has wide applications

Frequent and closed subgraph mining methods

gSpan and CloseGraph: pattern-growth depth-first search

approach

Graph indexing techniques:

Frequent and discirminative subgraphs as indexing fatures

Indexing and approximate matching help similar subgraph search

Mining coherent, dense, multiple biological networks

Many new developments along the line of graph pattern mining

Trang 46

Thanks and Questions

Tiêu đề	Mining, Indexing and Searching Graph Databases
Tác giả	Jiawei Han, Vladimir Lipets
Trường học	Unknown University
Chuyên ngành	Graph Databases and Data Mining
Thể loại	Lecture Presentation
Năm xuất bản	2010
Thành phố	Unknown City

Định dạng
Số trang	46
Dung lượng	484,46 KB