Graph mining Karel Vaculík FI MU PV056 13. 5. 2014 Outline • Graph definition revision • Motivation examples • Graph mining settings • Similarity measures • Frequent subgraph mining • Basic mining tasks (classification, clustering, …) 2 Graphs • Directed graph 𝐺 = 𝑉, 𝐸, 𝜇, ν • 𝑉 set of nodes • 𝐸 ⊆ 𝑉 × 𝑉 set of edges • 𝜇: 𝑉 → 𝐿 𝑉 node labeling function, 𝐿 𝑉 set of node labels • ν: 𝐸 → 𝐿 𝐸 edge labeling function, 𝐿 𝐸 set of edge labels • Undirected graph similarly 3 Motivation – graphs in real world • World Wide Web graph • Social networks • Chemical compounds (atoms and bonds) • Biological networks (protein-protein interaction, metabolic pathways, …) • XML documents • Software engineering (UML diagrams - work flows, …) 4 Graph mining settings • One (large) graph (single-graph setting) vs. database of graphs (graph-transaction setting) • Directed vs. undirected graphs • Dynamic vs. static graphs 5 Measuring similarity Inter-graph similarity (between graphs) • Isomorphism – in NP (but not known whether in P or in NP-complete) 6 Graph (b) is isomorphic to (a) and (c) is isomorphic to a subgraph of (a) Measuring similarity Inter-graph similarity (between graphs) • Isomorphism – in NP (but not known whether in P or in NP-complete) • Maximum Common Subgraph – NP-hard 7 Two graphs (a) and (b) and a maximum common subgraph (c) Measuring similarity Inter-graph similarity (between graphs) • Isomorphism – in NP (but not known whether in P or in NP-complete) • Maximum Common Subgraph – NP-hard • Graph-edit distance – for given costs of operations (node addition/deletion, edge addition/deletion, …), find the minimum total cost of operations needed to transform one graph into another one. 8 Measuring similarity Inter-graph similarity (between graphs) • Isomorphism – in NP (but not known whether in P or in NP-complete) • Maximum Common Subgraph – NP-hard • Graph-edit distance • Kernels • Complete kernel (distinguishes nonisomorphic graphs) as hard as isomorphism • For example: Walk-based kernels (e.g. Direct Product Kernel) • Frequent subgraphs (next section) 9 Measuring similarity In-graph similarity (between vertices) • Shared Nearest Neighbor (SNN) – vertices vi and vj from the original graph are neighbors in SNN graph, if they have at least k neighbors in common in the original graph 10 Measuring similarity In-graph similarity (between vertices) • Shared Nearest Neighbor (SNN) • Kernels • For example: Neumann Kernel, Laplacian Kernel 11 Frequent subgraph mining (FSM) • Frequent subgraph: A subgraph which occurs in at least n graphs (or n times in one graph) on input for a given minimum support n. 12 Frequent subgraph mining (FSM) • Frequent subgraph: A subgraph which occurs in at least n graphs (or n times in one graph) on input for a given minimum support n. • Possible to use relative support (%) for graph-transaction setting • Applications: characterization of graphs, classifying, clustering, association rules, indexing • Specific example: Analyzing web server logs (each user -> one tree of visited pages; mine frequent subgraphs to find a better way of organizing structure of hyperlinks) • Problem: not possible to use subgraph isomorphism (NP-complete!) 13 Frequent subgraph mining (FSM) • General process: • Candidate generation: which patterns will be considered • Candidate pruning: if a candidate is not a viable frequent pattern, can we exploit the pattern to prevent unnecessary work? • Subgraphs and subsets exponentiate as size increases! • Support counting: how many of a given pattern exist? • Apriori principle: if an itemset is frequent, then all of its subsets are also frequent. • Example: if itemset {A, B, C, D} is frequent, then {A, B} is frequent 14 Frequent subgraph mining (FSM) gSpan • Complete (finds all frequent subgraphs) FSM algorithm on labeled undirected graphs • Generates candidates by adding edges (one at a time) to patterns already discovered • Encodes patterns (graphs) according to DFS traversal order (DFS code) 15 Frequent subgraph mining (FSM) 16 Frequent subgraph mining (FSM) gSpan • Complete (finds all frequent subgraphs) FSM algorithm on labeled undirected graphs • Generates candidates by adding edges (one at a time) to patterns already discovered • Encodes patterns (graphs) according to DFS traversal order (DFS code) • Lexicographicaly minimal code for each graph • Building candidates by rightmost expansion – adding of an edge to a vertex on the rightmost path 17 Frequent subgraph mining (FSM) 18 Frequent subgraph mining (FSM) gSpan • Complete (finds all frequent subgraphs) FSM algorithm on labeled undirected graphs • Generates candidates by adding edges (one at a time) to patterns already discovered • Encodes patterns (graphs) according to DFS traversal order (DFS code) • Lexicographicaly minimal code for each graph • Building candidates by rightmost expansion – adding of an edge to a vertex on the rightmost path • Uses DFS Code Tree to keep all possible DFS codes 19 Frequent subgraph mining (FSM) 20 Frequent subgraph mining (FSM) Subdue • Approximate FSM algorithm on labeled graphs or a single graph • Directed or undirected graphs • Not based on support • Outputs structures which compress the input graph well 21 Frequent subgraph mining (FSM) Sleuth • FSM algorithm on rooted trees – both ordered and unordered • Mines induced (can only contain edges from the original tree) or embedded (can have edges between ancestors and descendants) subtrees 22 Classification, regression • Link-based Object Classification • assign class labels to nodes according to their link characteristics (e.g. node degree, average path length, …) • Applications: From a graph with persons as nodes and preferences as edges, select set of individuals (e.g., as team leaders) and then assign groups to everyone else 23 Classification, regression • Link-based Object Classification • assign class labels to nodes according to their link characteristics (e.g. node degree, average path length, …) • Applications: From a graph with persons as nodes and preferences as edges, select set of individuals (e.g., as team leaders) and then assign groups to everyone else • Link-based Object Ranking • all nodes are usually understood to be of the same type, the goal is to associate a relative quantitative assessment with each node • Applications: ranking web pages by a search engine (PageRank) 24 Classification, regression • Between-graph classification – each graph is assigned to a class • Methods based on substructures – for example: gBoost, DT-CLGBI (figure) • Attribute: a pattern/subgraph in graph graph-structured data • Value of an attribute: existence/nonexistence of the pattern in each graph 25 Classification, regression • Between-graph classification – each graph is assigned to a class • Methods based on substructures – for example: gBoost, DT-CLGBI • Attribute: a pattern/subgraph in graph graph-structured data • Value of an attribute: existence/nonexistence of the pattern in each graph • It is also possible to use a FSM algorithm and then an arbitrary classification algorithm for attribute-value data 26 Classification, regression • Between-graph classification – each graph is assigned to a class • Methods based on substructures – for example: gBoost, DT-CLGBI • Attribute: a pattern/subgraph in graph graph-structured data • Value of an attribute: existence/nonexistence of the pattern in each graph • It is also possible to use a FSM algorithm and then an arbitrary classification algorithm for attribute-value data • Kernel methods (e.g. Direct Product Kernel) -> kernel-based classifier (e.g. SVM) • Applications: classifying molecules as toxic or non-toxic 27 Classification, regression • Within-graph classification – each node in a graph is assigned to a class • Kernel methods (e.g. Regularized Laplacian Kernel) -> kernel-based classifier (e.g. SVM) • Applications: classification of web pages 28 Clustering Clustering graphs • Divides a set of graphs into different clusters • Applications: grouping of chemical compounds into clusters based on their structural similarity • Kernels -> kernel k-means clustering • Regular clustering methods on graphs represented by frequent substructures 29 Clustering Clustering nodes • Divides the nodes of a graph into clusters • Applications: clustering of people in social networks; clusters group people with similar interests • Minimum spanning tree clustering 1. Find minimum spanning tree of a graph 2. Remove k-1 edges with the highest weight -> k clusters 30 31 Clustering Clustering nodes • Shared Nearest Neighbor (SNN) Clustering • Place a pair of objects into the same cluster, if the number of common neighbors they share is more than some threshold τ. 32 Clustering 33 Clustering Clustering nodes • Shared Nearest Neighbor (SNN) Clustering • Place a pair of objects into the same cluster, if the number of common neighbors they share is more than some threshold τ. • Maximal Clique Enumeration 34 Examples of other tasks • Outlier detection • Searching for outlying nodes or whole graphs • Methods for measuring similarity and clustering can be used • Association rules • Frequent subgraphs may be used as frequent patterns in classic algorithms • Link prediction • In dynamic graphs; task is to predict new edges that will be probably added to a graph 35 Thank you. 36 References and resources • Samatova, N., et al.: Practical Graph Mining with R. CRC Press (2013) • Slides from the book webpage: http://www.csc.ncsu.edu/faculty/samatova/practical-graph-mining-with- R/PracticalGraphMiningWithR.html • Cook, D. J., Holder, L. B.: Mining Graph Data. John Wiley & Sons (2007) 37