KarelVaculík  Introduction  Application domains  Graph MiningAlgorithms  Graph: G = (V, E)  V ... set of nodes,  E ⊆ V ⨯ V... set of edges  Graph: G = (V, E)  V ... set of nodes,  E ⊆ V ⨯ V... set of edges,  w: E → ℝ ... weight function,  μ:V → LV ... node labeling function,  ν: E → LE ... edge labeling function,  ...  Chemical data analysis  Computational biology  Social networking  Web link analysis  Computer networks  ...  Clustering  Classification  Frequent pattern / substructure mining  Data properties  One large graph vs. set of (smaller) graphs (also transactions)  Size ▪ Streaming of massive graphs (they are too large to fit in the main memory and random access is slow in large capacity storage devices)  Static vs. dynamic  ...  Node clustering  Based on distance functions for nodes  Related to minimum cut (polynomially solvable) and graph partitioning (NP-hard) problems  Applications: determining dense regions (=> summarization, dimensinality reduction), ...  Graph clustering  Based on structural behavior  Applications: molecular biology, chemical graphs, XML data, ...  Node classification  Graph classification  Pattern ≈ subgraph  Single-graph setting  Frequency: number of pattern occurrences in the single graph.  Examples of algorithms: SUBDUE, SEuS, GREW, SIGRAM, GBI  Not discussed further  Graph-transaction setting  Frequency (of a pattern): number of graph transactions in which the pattern occurs Apriori-like algorithms  Basically two steps:  Generation of frequent substructure candidates ▪ Based on adding nodes, edges or paths  Checking the frequencies of candidates  Examples of algorithms : AGM, FSG Checking the frequencies of candidates:  (Sub)graph isomorphism  Canonical labeling  Unique code for the set of graphs with the same topological structure and the same labeling  Both problems are not known to be either in P or in NP-complete → relaxed problems Pattern growth algorithms  gSpan  Gaston  ... gSpan  Without candidate generation  Minimum DFS code as canonical label gSpan  Without candidate generation  Minimum DFS code as canonical label  DFS lexicographic orderering on DFS codes → DFS code tree gSpan  Without candidate generation  Minimum DFS code as canonical label  DFS lexicographic orderering on DFS codes → DFS code tree  Searching frequent patterns: traversing DFS code tree  Diane J. Cook, Lawrence B. Holder. Mining graph data. JohnWiley and Sons, 2007.  Charu C. Aggarwal, HaixunWang. Managing and Mining Graph Data. Springer, 2010  X.Yan and J. Han. gSpan: Graph-based substructure pattern mining. In Proceedings of 2002 IEEE InternationalConference on Data Mining (ICDM), pp. 721–724, 2002.