Why Graph Mining and Searching? Graphs are ubiquitous Chemical compounds Cheminformatics Protein structures, biological pathways/networks Bioinformactics Program control flow, tra
Trang 1Mining, Indexing and Searching
Graph Databases
Presenter: A/ Prof Do PhucSource: Jiawei Han , Vladimir Lipets
Trang 2Graph, Graph, Everywhere
Aspirin Yeast protein interaction network
Trang 3Why Graph Mining and Searching?
Graphs are ubiquitous
Chemical compounds (Cheminformatics)
Protein structures, biological pathways/networks (Bioinformactics)
Program control flow, traffic flow, and workflow analysis
XML databases, Web, and social network analysis
Graph is a general model
Trees, lattices, sequences, and items are degenerated graphs
Trang 4 Graph Isomorphism, Subgraph Isomorphism
Mining frequent graph patterns
Graph indexing methods
Similairty search in graph databases
Biological network analysis
Trang 5 Graph, Subgraph isomorphism is important and
very general form of pattern matching that finds practical application in areas such as:
pattern recognition and computer vision,
Trang 6A hierarchy of pattern matching problems
Graph isomorphism
Approximate subgraph isomorphism
Graph edit distance
Trang 7Isomorphic Graphs
Trang 8Graph Isomorphism
Trang 9Subgraph of a given graph
Trang 10Subgraph Isomorphism
Trang 11Subgraph Isomorphism and Related
Problems
Given a pattern graph G and a target graph H
Decision problem: Answer whether H contains a
subgraph isomorphic to G
Search problem: Return an occurrence of G as a
subgraph of H
Counting problem: Return a count of the number
of subgraphs of H that are isomorphic to G
Enumeration problem: Return all occurrences of
G as a subgraph of H
Trang 12 Graph Isomorphism, Subgraph Isomorphism
Mining frequent graph patterns
Graph indexing methods
Similairty search in graph databases
Biological network analysis
Trang 13Graph Pattern Mining
Frequent subgraphs
A (sub)graph is frequent if its support (occurrence
frequency) in a given dataset is no less than a
minimum support threshold
Applications of graph pattern mining
Mining biochemical structures
Program control flow analysis
Mining XML structures or Web communities
Building blocks for graph classification, clustering,
comparison, and correlation analysis
Trang 14Example: Frequent Subgraphs
S OH
O O
O N
O N
HO
O N
O N
O N
Trang 15Frequent Subgraph Mining Approaches
Apriori-based approach
AGM/AcGM: Inokuchi, et al (PKDD’00)
FSG: Kuramochi and Karypis (ICDM’01)
PATH: Vanetik and Gudes (ICDM’02, ICDM’04)
FFSM: Huan, et al (ICDM’03)
Pattern growth-based approach
MoFa, Borgelt and Berthold (ICDM’02)
gSpan: Yan and Han (ICDM’02)
Gaston: Nijssen and Kok (KDD’04)
Trang 16Properties of Graph Mining Algorithms
Search order
breadth vs depth
Generation of candidate subgraphs
apriori vs pattern growth
Elimination of duplicate subgraphs
passive vs active
Support calculation
embedding store or not
Discover order of patterns
path Æ tree Æ graph
Trang 17 Mining frequent graph patterns
Graph indexing methods
Similairty search in graph databases
Biological network analysis
Trang 18Graph Search: Querying Graph Databases
Querying graph databases:
Given a graph database and a query graph, find all graphs containing this query graph
N N
OH O
N
O
N
OH O
S OH
S
HO O
O N
N O O
query graph graph database
Trang 19S OH
S HO O O
N N O
O
OH O
Query graph
Trang 20 Index substructures of a query graph to prune graphs that do not contain these substructures
Trang 21 Two steps in processing graph queries
Step 1 Index Construction
database, build an inverted index between structures and graphs
Step 2 Query Processing
these structures
performing subgraph isomorphism test
Trang 22 Mining frequent graph patterns
Graph indexing methods
Similairty search in graph databases
Biological network analysis
Some recent progress on graph mining
Trang 23Graph Clustering
Graph similarity measure
Feature-based similarity measure
Each graph is represented as a feature vector
The similarity is defined by the distance of their corresponding vectors
Frequent subgraphs can be used as features
Structure-based similarity measure
Maximal common subgraph
Graph edit distance: insertion, deletion, and relabel
Trang 24Graph Classification
Local structure based approach
Local structures in a graph, e.g., neighbors
surrounding a vertex, paths with fixed length
Graph pattern-based approach
Subgraph patterns from domain knowledge
Subgraph patterns from data mining
Kernel-based approach
Random walk (Gärtner ’02, Kashima et al ’02,
ICML’03, Mahé et al ICML’04)
Optimal local assignment (Fröhlich et al
Trang 25Structure Similarity Search
(a) caffeine (b) diurobromine (c) viagra
Trang 26Some “Straightforward” Methods
Method1: Directly compute the similarity between the
graphs in the DB and the query graph
Sequential scan
Subgraph similarity computation
Method 2: Form a set of subgraph queries from the
original query graph and use the exact subgraph
search
Costly: If we allow 3 edges to be missed in a
20-edge query graph, it may generate 1,140 subgraphs
Trang 27Index: Precise vs Approximate Search
Precise Search
Use frequent patterns as indexing features
Select features in the database space based on their selectivity
Build the index
Approximate Search
Hard to build indices covering similar subgraphs—
explosive number of subgraphs in databases
Idea: (1) keep the index structure
(2) select features in the query space
Trang 28Substructure Similarity Measure
Query relaxation measure
The number of edges that can be relabeled or
missed; but the position of these edges are
not fixed
QUERY GRAPH
…
Trang 29Substructure Similarity Measure
Feature-based similarity measure
Each graph is represented as a feature vector
X = {x1, x2, …, xn}
The similarity is defined by the distance of
their corresponding vectors
Advantages
Easy to index
Fast Rough measure
Trang 30Query Processing Framework
Three steps in processing approximate graph
queries
Step 1 Index Construction
Select small structures as features in a graph database, and build the feature-graph matrix between the features
and the graphs in the database
Trang 31Framework (cont.)
Step 2 Feature Miss Estimation
Determine the indexed features belonging
to the query graph
Calculate the upper bound of the number
of features that can be missed for an approximate matching, denoted by J
On the query graph, not the graph database
Trang 32Framework (cont.)
Step 3 Query Processing
Use the feature-graph matrix to calculate the difference in the number
of features between graph G and query
Q, FG – FQ
If FG – FQ > J, discard G The remaining graphs constitute a candidate answer set
Trang 33 Mining frequent graph patterns
Graph indexing methods
Similairty search in graph databases
Biological network analysis
Trang 35Data Mining Across Multiple Networks
a
b c
d e f
c e
f
j a
b c
d e
g
h
k f
i j
a
b
c e
f a
b d
j
Trang 36Data Mining Across Multiple Networks
a
b c
d e f
c e
f
j a
b c
d e f
d e f
a
b c
j
Trang 37Identify Frequent Co-expression Clusters
across Multiple Microarray Data Sets
d e f
d e f
a b c
d e f
d e f
d e f
Trang 38
CODENSE: Mine Coherent Dense Subgraphs
f a
b d
e g
h
i c
a
b d
summary graph Ĝ
f
a
b c
d e f
d e f
d e f
d e f
d
e g
Trang 39(2) Identify dense subgraphs of the summary graph
dense subgraph in the summary graph However, the
reverse is not true
CODENSE: Mine Coherent Dense Subgraphs
Trang 40d e f
d e f
a b c
d e f
d e f
g
h j
k i
a b c
d e f
Applying CoDense to 39 Yeast Microarray Data Sets
Trang 41MRPL51
MRP49 YDR115W
PHB1
PET100
Discovery of New Genes Based on Similar Genes
Trang 42Brown: YDR115W, FMC1, ATP12, MRPL37, MRPS18
MRPL32
ACN9
MRPL51 MRP49
YDR115W
PHB1
PET100 PET100
Network of Known Similar Genes
Trang 43ACN9
MRPL51
MRP49 YDR115W
PHB1
PET100
Network Involved in the New Genes
Trang 44 Mining frequent graph patterns
Graph indexing methods
Similairty search in graph databases
Biological network analysis
Trang 45 Graph mining has wide applications
Frequent and closed subgraph mining methods
gSpan and CloseGraph: pattern-growth depth-first search
approach
Graph indexing techniques:
Frequent and discirminative subgraphs as indexing fatures
Similairty search in graph databases
Indexing and approximate matching help similar subgraph search
Biological network analysis
Mining coherent, dense, multiple biological networks
Many new developments along the line of graph pattern mining
Trang 46Thanks and Questions