KarelVaculík
 Introduction
 Application domains
 Graph MiningAlgorithms
 Graph: G = (V, E)
 V ... set of nodes,
 E ⊆ V ⨯ V... set of edges
 Graph: G = (V, E)
 V ... set of nodes,
 E ⊆ V ⨯ V... set of edges,
 w: E → ℝ ... weight function,
 μ:V → LV ... node labeling function,
 ν: E → LE ... edge labeling function,
 ...
 Chemical data analysis
 Computational biology
 Social networking
 Web link analysis
 Computer networks
 ...
 Clustering
 Classification
 Frequent pattern / substructure mining
 Data properties
 One large graph vs. set of (smaller) graphs (also
transactions)
 Size
▪ Streaming of massive graphs (they are too large to fit in
the main memory and random access is slow in large
capacity storage devices)
 Static vs. dynamic
 ...
 Node clustering
 Based on distance functions for nodes
 Related to minimum cut (polynomially solvable)
and graph partitioning (NP-hard) problems
 Applications: determining dense regions (=>
summarization, dimensinality reduction), ...
 Graph clustering
 Based on structural behavior
 Applications: molecular biology, chemical graphs,
XML data, ...
 Node classification
 Graph classification
 Pattern ≈ subgraph
 Single-graph setting
 Frequency: number of pattern occurrences in the
single graph.
 Examples of algorithms: SUBDUE, SEuS, GREW,
SIGRAM, GBI
 Not discussed further
 Graph-transaction setting
 Frequency (of a pattern): number of graph
transactions in which the pattern occurs
Apriori-like algorithms
 Basically two steps:
 Generation of frequent substructure candidates
▪ Based on adding nodes, edges or paths
 Checking the frequencies of candidates
 Examples of algorithms : AGM, FSG
Checking the frequencies of candidates:
 (Sub)graph isomorphism
 Canonical labeling
 Unique code for the set of graphs with the same
topological structure and the same labeling
 Both problems are not known to be either in
P or in NP-complete → relaxed problems
Pattern growth algorithms
 gSpan
 Gaston
 ...
gSpan
 Without candidate generation
 Minimum DFS code as canonical label
gSpan
 Without candidate generation
 Minimum DFS code as canonical label
 DFS lexicographic orderering on DFS codes
→ DFS code tree
gSpan
 Without candidate generation
 Minimum DFS code as canonical label
 DFS lexicographic orderering on DFS codes
→ DFS code tree
 Searching frequent patterns: traversing DFS
code tree
 Diane J. Cook, Lawrence B. Holder. Mining
graph data. JohnWiley and Sons, 2007.
 Charu C. Aggarwal, HaixunWang. Managing
and Mining Graph Data. Springer, 2010
 X.Yan and J. Han. gSpan: Graph-based
substructure pattern mining. In Proceedings of
2002 IEEE InternationalConference on Data
Mining (ICDM), pp. 721–724, 2002.