Graph mining
Karel Vaculík
FI MU
PV056 13. 5. 2014
Outline
• Graph definition revision
• Motivation examples
• Graph mining settings
• Similarity measures
• Frequent subgraph mining
• Basic mining tasks (classification, clustering, …)
2
Graphs
• Directed graph 𝐺 = 𝑉, 𝐸, 𝜇, ν
• 𝑉 set of nodes
• 𝐸 ⊆ 𝑉 × 𝑉 set of edges
• 𝜇: 𝑉 → 𝐿 𝑉 node labeling function, 𝐿 𝑉 set of node labels
• ν: 𝐸 → 𝐿 𝐸 edge labeling function, 𝐿 𝐸 set of edge labels
• Undirected graph similarly
3
Motivation – graphs in real world
• World Wide Web graph
• Social networks
• Chemical compounds (atoms and bonds)
• Biological networks (protein-protein interaction, metabolic pathways, …)
• XML documents
• Software engineering (UML diagrams - work flows, …)
4
Graph mining settings
• One (large) graph (single-graph setting) vs. database of graphs (graph-transaction
setting)
• Directed vs. undirected graphs
• Dynamic vs. static graphs
5
Measuring similarity
Inter-graph similarity (between graphs)
• Isomorphism – in NP (but not known whether in P or in NP-complete)
6
Graph (b) is isomorphic to (a) and (c) is isomorphic to a subgraph of (a)
Measuring similarity
Inter-graph similarity (between graphs)
• Isomorphism – in NP (but not known whether in P or in NP-complete)
• Maximum Common Subgraph – NP-hard
7
Two graphs (a) and (b) and a maximum common subgraph (c)
Measuring similarity
Inter-graph similarity (between graphs)
• Isomorphism – in NP (but not known whether in P or in NP-complete)
• Maximum Common Subgraph – NP-hard
• Graph-edit distance – for given costs of operations (node addition/deletion, edge
addition/deletion, …), find the minimum total cost of operations needed to transform
one graph into another one.
8
Measuring similarity
Inter-graph similarity (between graphs)
• Isomorphism – in NP (but not known whether in P or in NP-complete)
• Maximum Common Subgraph – NP-hard
• Graph-edit distance
• Kernels
• Complete kernel (distinguishes nonisomorphic graphs) as hard as isomorphism
• For example: Walk-based kernels (e.g. Direct Product Kernel)
• Frequent subgraphs (next section) 9
Measuring similarity
In-graph similarity (between vertices)
• Shared Nearest Neighbor (SNN) – vertices vi and vj from the original graph are
neighbors in SNN graph, if they have at least k neighbors in common in the original
graph
10
Measuring similarity
In-graph similarity (between vertices)
• Shared Nearest Neighbor (SNN)
• Kernels
• For example: Neumann Kernel, Laplacian Kernel
11
Frequent subgraph mining (FSM)
• Frequent subgraph: A subgraph which occurs in at least n graphs (or n times in one
graph) on input for a given minimum support n.
12
Frequent subgraph mining (FSM)
• Frequent subgraph: A subgraph which occurs in at least n graphs (or n times in one
graph) on input for a given minimum support n.
• Possible to use relative support (%) for graph-transaction setting
• Applications: characterization of graphs, classifying, clustering, association rules,
indexing
• Specific example: Analyzing web server logs (each user -> one tree of visited pages; mine
frequent subgraphs to find a better way of organizing structure of hyperlinks)
• Problem: not possible to use subgraph isomorphism (NP-complete!)
13
Frequent subgraph mining (FSM)
• General process:
• Candidate generation: which patterns will be considered
• Candidate pruning: if a candidate is not a viable frequent pattern, can we exploit the
pattern to prevent unnecessary work?
• Subgraphs and subsets exponentiate as size increases!
• Support counting: how many of a given pattern exist?
• Apriori principle: if an itemset is frequent, then all of its subsets are also frequent.
• Example: if itemset {A, B, C, D} is frequent, then {A, B} is frequent
14
Frequent subgraph mining (FSM)
gSpan
• Complete (finds all frequent subgraphs) FSM algorithm on labeled undirected graphs
• Generates candidates by adding edges (one at a time) to patterns already discovered
• Encodes patterns (graphs) according to DFS traversal order (DFS code)
15
Frequent subgraph mining (FSM)
16
Frequent subgraph mining (FSM)
gSpan
• Complete (finds all frequent subgraphs) FSM algorithm on labeled undirected graphs
• Generates candidates by adding edges (one at a time) to patterns already discovered
• Encodes patterns (graphs) according to DFS traversal order (DFS code)
• Lexicographicaly minimal code for each graph
• Building candidates by rightmost expansion – adding of an edge to a vertex on the rightmost
path
17
Frequent subgraph mining (FSM)
18
Frequent subgraph mining (FSM)
gSpan
• Complete (finds all frequent subgraphs) FSM algorithm on labeled undirected graphs
• Generates candidates by adding edges (one at a time) to patterns already discovered
• Encodes patterns (graphs) according to DFS traversal order (DFS code)
• Lexicographicaly minimal code for each graph
• Building candidates by rightmost expansion – adding of an edge to a vertex on the rightmost
path
• Uses DFS Code Tree to keep all possible DFS codes
19
Frequent subgraph mining (FSM)
20
Frequent subgraph mining (FSM)
Subdue
• Approximate FSM algorithm on labeled graphs or a single graph
• Directed or undirected graphs
• Not based on support
• Outputs structures which compress the input graph well
21
Frequent subgraph mining (FSM)
Sleuth
• FSM algorithm on rooted trees – both ordered and unordered
• Mines induced (can only contain edges from the original tree) or embedded (can
have edges between ancestors and descendants) subtrees
22
Classification, regression
• Link-based Object Classification
• assign class labels to nodes according to their link characteristics (e.g. node degree,
average path length, …)
• Applications: From a graph with persons as nodes and preferences as edges, select set
of individuals (e.g., as team leaders) and then assign groups to everyone else
23
Classification, regression
• Link-based Object Classification
• assign class labels to nodes according to their link characteristics (e.g. node degree,
average path length, …)
• Applications: From a graph with persons as nodes and preferences as edges, select set
of individuals (e.g., as team leaders) and then assign groups to everyone else
• Link-based Object Ranking
• all nodes are usually understood to be of the same type, the goal is to associate a relative
quantitative assessment with each node
• Applications: ranking web pages by a search engine (PageRank)
24
Classification, regression
• Between-graph classification – each graph is assigned to a class
• Methods based on substructures – for example: gBoost, DT-CLGBI (figure)
• Attribute: a pattern/subgraph in graph graph-structured data
• Value of an attribute: existence/nonexistence of the pattern in each graph
25
Classification, regression
• Between-graph classification – each graph is assigned to a class
• Methods based on substructures – for example: gBoost, DT-CLGBI
• Attribute: a pattern/subgraph in graph graph-structured data
• Value of an attribute: existence/nonexistence of the pattern in each graph
• It is also possible to use a FSM algorithm and then an arbitrary classification algorithm for
attribute-value data
26
Classification, regression
• Between-graph classification – each graph is assigned to a class
• Methods based on substructures – for example: gBoost, DT-CLGBI
• Attribute: a pattern/subgraph in graph graph-structured data
• Value of an attribute: existence/nonexistence of the pattern in each graph
• It is also possible to use a FSM algorithm and then an arbitrary classification algorithm for
attribute-value data
• Kernel methods (e.g. Direct Product Kernel) -> kernel-based classifier (e.g. SVM)
• Applications: classifying molecules as toxic or non-toxic
27
Classification, regression
• Within-graph classification – each node in a graph is assigned to a class
• Kernel methods (e.g. Regularized Laplacian Kernel) -> kernel-based classifier (e.g. SVM)
• Applications: classification of web pages
28
Clustering
Clustering graphs
• Divides a set of graphs into different clusters
• Applications: grouping of chemical compounds into clusters based on their structural
similarity
• Kernels -> kernel k-means clustering
• Regular clustering methods on graphs represented by frequent substructures
29
Clustering
Clustering nodes
• Divides the nodes of a graph into clusters
• Applications: clustering of people in social networks; clusters group people with
similar interests
• Minimum spanning tree clustering
1. Find minimum spanning tree of a graph
2. Remove k-1 edges with the highest weight -> k clusters
30
31
Clustering
Clustering nodes
• Shared Nearest Neighbor (SNN) Clustering
• Place a pair of objects into the same cluster, if the number of common neighbors they
share is more than some threshold τ.
32
Clustering
33
Clustering
Clustering nodes
• Shared Nearest Neighbor (SNN) Clustering
• Place a pair of objects into the same cluster, if the number of common neighbors they
share is more than some threshold τ.
• Maximal Clique Enumeration
34
Examples of other tasks
• Outlier detection
• Searching for outlying nodes or whole graphs
• Methods for measuring similarity and clustering can be used
• Association rules
• Frequent subgraphs may be used as frequent patterns in classic algorithms
• Link prediction
• In dynamic graphs; task is to predict new edges that will be probably added to a graph
35
Thank you.
36
References and resources
• Samatova, N., et al.: Practical Graph Mining with R. CRC Press (2013)
• Slides from the book webpage:
http://www.csc.ncsu.edu/faculty/samatova/practical-graph-mining-with-
R/PracticalGraphMiningWithR.html
• Cook, D. J., Holder, L. B.: Mining Graph Data. John Wiley & Sons
(2007)
37