Introduction to Visual Data Science High-Dimensional Data Visualization & Predictive Analytics Manuela Waldner Institute of Visual Computing & Human-Centered Technology, TU Wien, Austria The Visual Data Science Process Manuela Waldner 2 DATA EXPLORATION Form hypotheses about your defined problem by visually analyzing the data. PREDICTIVE MODELING Train machine learning models, evaluate their performance, and use them to make predictions. FEATURE ENGINEERING Select important features and construct more meaningful ones using the raw data that you have. PRESENTATION Communicate the findings with key stakeholders using plots and interactive visualizations. The Data Science Process Manuela Waldner 3 FEATURE ENGINEERING Select important features and construct more meaningful ones using the raw data that you have. Interactive Analysis of Big Data What is big data? Manuela Waldner 4 Key Value Key 1 Value 1 Key 2 Value 2 Key 3 Value 3 ... ... Key Variable 1 Variable 2 ... Key 1 Value 1 Value 1 ... Key 2 Value 2 Value 2 ... Key 3 Value 3 Value 3 ... ... ... ... ... .... billions of records  tall data .... thousands of variables  wide data [Heer & Kandel, Interactive Analysis of Big Data, ACM XRDS 2012] Side Note: Features From machine learning / pattern recognition: Measurable property of observed phenomenon Vectors (can be high-dimensional!) In information visualization: Attributes / variables / (data) dimensions Manuela Waldner 5 [Munzner 2014] High-Dimensional Data Image features Natural language processing Gene expression data Finance / economy ... Manuela Waldner 6 High-Dimensional Data Image features „bag of words“ Vocabulary of visual words Example: MNIST 10,000 hand-written digits 28x28 pixels  784-dimensional feature vector (intensity values) per image Manuela Waldner 7 https://www.tensorflow.org High-Dimensional Data Image features Natural language processing Vector space model: Dimensions: terms Vectors: documents or queries Manuela Waldner 8 Wikipedia: vector space model http://topicmodels.west.uni-koblenz.de/ckling/tmt/part1.pdf Natual Language Processing Pipeline Manuela Waldner 9 sentence segmentation tokenization Stop word removal Raw text sentences Tokenized sentences ...Normalization capitalization stemming lemmatization e.g., „a“, „the“ Graphics  graphics visualization  visual better  good Document-Term Matrix Bag of words: orderless representation! Document is represented by vector of term weights (e.g., number of term occurrences) Word is represented by vector of document weights (e.g., number of occurrences in documents) Manuela Waldner 10 Word Embeddings Word2vec Shallow neural network Input: text window Goal: prediction of nearby words Output: Vector Space Model of words Manuela Waldner 11 https://www.tensorflow.org/tutorials/representation/word2vec High-Dimensional Data Image features Natural language processing Gene expression data Dimensions: genes Samples: experimental conditions / species /... Manuela Waldner 12 http://cancerres.aacrjournals.org/content/64/23/8558 Curse of Dimensionality Efficiency of many algorithms depend on the number of dimensions With increasing number of dimensions, data becomes sparse Distances increase Nearest neighbors? Anomalies? Manuela Waldner 13 Curse of Dimensionality Efficiency of many algorithms depend on the number of dimensions With increasing number of dimensions, data becomes sparse Number of required training samples grows exponentially with the number of dimensions Rule of thumb: 5 samples per dimension minimum Visually inspect the features! Manuela Waldner 14 Recap: Multivariate Data Visualization Techniques Example: Iris dataset 3 species: 50 samples per species 4 features: length and width of sepals and petals Manuela Waldner 15 Wikipedia: Iris flower data set Multi-Dimensional Data Visualization Techniques Manuela Waldner 16 [Icke & Sklar, 2009] Parallel Coordinates Scatterplot Matrix Chernoff Faces Scalability problems! Radar Chart Example: Scatterplot Matrix Manuela Waldner 17 19 dimensions ~100 dimensions [Yang et al., 2003] Approaches Feature selection Selecting a subset of existing features without a transformation Using multi-dimensional data visualization techniques Feature extraction Transforming existing features into lower dimensional space Using 1D / 2D (/3D)/nD visualization technique Hybrid approach Selecting a subset of existing features Transforming feature subset into lower dimensional space Manuela Waldner 18 Feature Selection Selecting a subset of existing features without a transformation Dimensions (or dimension pairs) are ranked based on quality metric: Number of outliers Correlation between pair of dimensions Image-based ... Quality metrics can be combined Visualizing one / two / multiple dimensions of the samples Manuela Waldner 19 Rank-by-Feature Framework Exploratory analysis of multidimensional data Based on ranking criteria, axis-parallel projections are ranked 1D ranking criteria: Normality or uniformity (entropy) of distribution, number of potential outliers, number of unique values Manuela Waldner 20 [Seo and Shneiderman, 2004] Rank-by-Feature Framework 2D ranking criteria: Correlation coefficient, least squares error for linear regression / curvilinear regression, number of items in region of interest, uniformity of scatterplots Manuela Waldner 21 [Seo and Shneiderman, 2004] Interactive Dimensionality Reduction Predefine number of dimensions to be visualized Based on quality metrics Correlation between dimensions Preservation of outliers Cluster quality Assigns importance to each dimension Manuela Waldner 22 [Johansson and Johansson, 2009] Class Consistency Given: points in high-dimensional space with external class labels Class consistency: classes are mapped to regions that are visually separable (~ ratio of data points closest to their class centroid) Example: 3 classes of wine (color) 13 attributes describing chemical properties Manuela Waldner 23 [Sips et al., 2009] Feature Extraction Transforming existing features into lower dimensional space Dimensionality reduction Linear Non-linear Using 1D / 2D (/3D)/nD visualization technique Interactive visualizations can be used to steer feature extraction Manuela Waldner 24 Dimensionality Reduction Linear projection Linear transformation projecting data from high-dimensional space to low-dimensional space Example: find subset of terms accurately clustering documents Techniques: Principal component analysis (PCA) (metric) multi-dimensional scaling (MDS) ... Manuela Waldner 25 Singular Value Decomposition (SVD) 𝑈𝑈: term-concept matrix 𝑉𝑉 𝑇𝑇 : concept-document matrix 𝑘𝑘 largest singular values and corresponding singular vectors from 𝑈𝑈 and 𝑉𝑉: Concepts are base vectors of semantic space Latent semantic indexing = dimensionality reduction by SVD Manuela Waldner 26 Chen et al., Effective use of latent semantic indexing and computational linguistics in biological and biomedical applications, 2013 Principal Component Analysis SVD on centered data Projecting data onto lower dimensions (= principal components) First principal component: as much variability of the data as possible Principal components are orthogonal Manuela Waldner 27 Wikipedia: principal component analysis Visualization of Projected Data Scatterplot visualization: Color-coded according to classes (if available) Well suited to: Detect / verify / name clusters Detect outliers Match clusters and classes Example: Iris data set (PCA) Manuela Waldner 28 http://projector.tensorflow.org/ [Brehmer et al., 2014] MNIST PCA Example Manuela Waldner 29 http://projector.tensorflow.org/ PCA with Star Glyphs (Iris Dataset) Manuela Waldner 30 [Keim et al., 2005] Biplot Axes: principal components Vectors: features Points: items Manuela Waldner 31 https://www.mathworks.com/matlabcentral/fileexchange/53438-biplot-by-groups Uncorrelated with other features Strongly correlated Interactive PCA-based Visual Analytics Manuela Waldner 32 [Jeong et al., EuroVis 2009] Projection onto first two eigenvectors Dimensions of the original data Eigenvectors as parallel coordinate axes Pearson correlation of variable pairs Star Coordinates Curvilinear coordinate system Items represented as points: Sum of all unit vectors on each coordinate 𝑢𝑢𝑖𝑖 = (𝑢𝑢𝑥𝑥𝑥𝑥, 𝑢𝑢𝑦𝑦𝑖𝑖) Multiplied by value of data element 𝑑𝑑𝑗𝑗 for that coordinate 𝑃𝑃𝑗𝑗 𝑥𝑥, 𝑦𝑦 = 𝑜𝑜𝑥𝑥 + ∑𝑖𝑖=1 𝑛𝑛 𝑢𝑢𝑥𝑥𝑥𝑥 𝑑𝑑𝑗𝑗𝑗𝑗 − 𝑚𝑚𝑚𝑚𝑚𝑚𝑖𝑖 , 𝑜𝑜𝑦𝑦 + ∑𝑖𝑖=1 𝑛𝑛 𝑢𝑢𝑦𝑦𝑖𝑖 𝑑𝑑𝑗𝑗𝑗𝑗 − 𝑚𝑚𝑚𝑚𝑚𝑚𝑖𝑖 Manuela Waldner 33 [Kandogan, 2000] Star Coordinates Iris dataset: Manuela Waldner 34 Star Coordinates Transformations of axes: Scaling length of axis  changing contribution of dimension Rotation of axis vector  change correlation with other columns Switching off coordinates  “feature selection” Manuela Waldner 35 [Kandogan, 2000] Dust & Magnet Dimensions: magnets Items: dust particles Based on attraction forces Manuela Waldner 36 [Ji Soo Yi et al., 2005] Dimensionality Reduction Linear dimensionality reduction Assumes that there is a lower dimensional linear subspace Finds a linear projection of the data Non-linear dimensionality reduction Low-dimensional surface embedded non-linearly in high-dimensional space („manifold“) Preserves the neighborhood information Locally linear Pairwise distances Manuela Waldner 37 „swiss roll“ http://scikit-learn.org Pairwise Similarities Cosine similarity: Corpus is represented by a set of vectors in vector space (axes: terms) Document similarity is defined by cosine similarity between the document vectors Document similarity matrix Manuela Waldner 38 http://nlp.stanford.edu/IR- book/html/htmledition/dot-products-1.html https://github.com/utkuozbulak/unsupervised- learning-document- clustering/blob/master/README.md Multi-Dimensional Scaling Computation of low-dimensional embedding 𝑌𝑌 that best preserves pair-wise distances between data points 𝑋𝑋 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = ∑𝑖𝑖<𝑗𝑗(𝑑𝑑𝑖𝑖𝑖𝑖 − 𝛿𝛿𝑖𝑖𝑖𝑖) 𝑑𝑑𝑖𝑖𝑖𝑖 = 𝑥𝑥𝑖𝑖 − 𝑥𝑥𝑗𝑗 2 𝛿𝛿𝑖𝑖𝑖𝑖 = 𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑗𝑗 2 Euclidean distances: MDS equivalent to PCA Manuela Waldner 39 https://github.com/utkuozbulak/unsupervised- learning-document- clustering/blob/master/README.md Visual Analysis of Dimensionality Reductions Example: OECD countries: 36 countries 8 dimensions Manuela Waldner 40 MDS projection Agglomerative clustering Dimensions [Stahnke et al., 2016] Visual Analysis of Dimensionality Reductions Inspection techniques: dimension heatmap on projection Manuela Waldner 41 [Stahnke et al., 2016] Visual Analysis of Dimensionality Reductions Inspection techniques: projection errors Manuela Waldner 42 White traces: higher similarity in high-dimensional space Gray traces: lower similarity in highdimensional space [Stahnke et al., 2016] Visual Analysis of Dimensionality Reductions Inspection techniques: comparison of group selections Manuela Waldner 43 [Stahnke et al., 2016] t-SNE t-Distributed Stochastic Neighbor Embedding Input: matrix of pair-wise similarities Similarities presented as joint probability matrix 𝑃𝑃: Low-dimensional conditional probability matrix 𝑄𝑄 using Student-t distribution: Manuela Waldner 44 [van der Maaten and Hinton, 2008] t-SNE Goal: find a low-dimensional data representation that minimizes the mismatch between 𝑝𝑝𝑗𝑗𝑗𝑗 and 𝑞𝑞𝑗𝑗𝑗𝑗 Minimization of sum of Kullback-Leibler divergences over all data points using a gradient descent method: Can be implemented via Barnes-Hut approximations Manuela Waldner 45 [van der Maaten and Hinton, 2008] MNIST t-SNE Example Manuela Waldner 46 http://projector.tensorflow.org/ Perplexity: 25 Learning rate: 10 Iterations: 342 Tensorflow Embedding Projector: word2vec 10K Manuela Waldner 47 Hybrid Approaches Dimensionality reduction often unwanted because domain knowledge is required to understand which dimension combinations make sense Combination of feature selection and feature extraction Feature selection: User selection based on visual analysis Quality metrics Feature extraction is performed on selected dimensions Using multi-dimensional data visualization techniques Manuela Waldner 48 Example: SeekAView Example: 1995 US FBI Crime report (147 dimensions, 2000+ items) Manuela Waldner 49 [Krause et al., 2007] The Data Science Process Manuela Waldner 50 PREDICTIVE MODELING Train machine learning models, evaluate their performance, and use them to make predictions. Predictive Models Manuela Waldner 51 Model trained on known input and output data to predict future outputs Prediction of discrete responses Prediction of continuous responses Predictive Visual Analytics Why do we need visualization? Evaluate: Validation and comparison Train: Model improvement and training Make predictions AI interpretability and explainability Manuela Waldner 52 Predictive Visual Analytics Why do we need visualization? Evaluate: Validation and comparison Train: Model improvement and training Make predictions AI interpretability and explainability Manuela Waldner 53 Evaluation of Classifier Accuracy Confusion Matrix Manuela Waldner 54 https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html Evaluation of Classifier Accuracy: Scatterplots Manuela Waldner 55 Training point Testing point https://scikit-learn.org/stable/ auto_examples/classification/ plot_classifier_comparison.html Convolutional Neural Networks Nicolas Grossmann 56 Convolutional Neural Networks Nicolas Grossmann What has the network learned? • Similarities • Outliers • Ambiguities 57 Inspecting Training Effects [Rauber et al. 2016] Nicolas Grossmann 58 CNN Code Projection Manuela Waldner 59 https://cs.stanford.edu/people/karpathy/cnnembed/ Predictive Visual Analytics Why do we need visualization? Evaluate: Validation and comparison Train: Model improvement and training Make predictions AI interpretability and explainability Manuela Waldner 60 Train SVM Document classification for a given query Relevant Irrelevant Samples = documents Labeled Unlabeled Visualizes SVM decision boundary Manuela Waldner 61 [Heimerl et al., TVCG 2012] Train Naïve Bayes Mazurek / Visual Active Learning 62 Class 1 Class 2 Class 3 Class 6 Class 5 Class 4 1 2 3 4 5 6 A Class Associations 0.4 0.2 0.1 0 0.1 0.2 v1 v2 v3v4 v5 v6 A Train Naïve Bayes Mazurek / Visual Active Learning 63 Train Naïve Bayes Mazurek / Visual Active Learning 64 Predictive Visual Analytics Why do we need visualization? Evaluate: Validation and comparison Train: Model improvement and training Make predictions AI interpretability and explainability Manuela Waldner 65 The extent to which a cause and effect can be observed within a system The extent to which the internal mechanics of a machine learning system can be explained in human terms Interpretability Partial dependence plot Assessing influence of a feature on the prediction Shows marginal effect a feature has on predicted outcome Based on averages in training data: Manuela Waldner 66 [Krause et al. 2016] Partial Dependence Plot Manuela Waldner 67 https://christophm.github.io/interpretable-ml-book/pdp.html Interactive Partial Dependence Interactively testing scenarios: Manuela Waldner 68 [Krause et al. 2016] Interactive Partial Dependence Manuela Waldner 69 [Krause et al. 2016] Explainability Feature visualization: Manuela Waldner 70 https://distill.pub/2017/feature-visualization/ Explainability Feature visualization: Manuela Waldner 71 https://distill.pub/2018/building-blocks/ Explainability Feature visualizations and attribution maps: Manuela Waldner 72 https://distill.pub/2018/building-blocks/ Visual Data Science - Summary Data exploration / scalable visualization Perceptual scalability: model-based / aggregate visualization Interactive scalability: online aggregation, aggregate queries, data tiles Feature engineering / high-dimensional data visualization Feature selection Feature extraction (dimensionality reduction) Hybrid approach Predictive visual analytics Supervised machine learning (regression, classification) Evaluation, training, interpretability & explainability Manuela Waldner 81 References Liu et al., Visualizing High-Dimensional Data: Advances in the Past Decade, EuroVis 2015 State of the Art Report Brehmer, M., Sedlmair, M., Ingram, S., & Munzner, T. (2014, November). Visualizing dimensionally-reduced data: Interviews with analysts and a characterization of task sequences. In Proceedings of the Fifth Workshop on Beyond Time and Errors: Novel Evaluation Methods for Visualization (pp. 1-8). ACM. Bertini, E., Tatu, A., & Keim, D. (2011). Quality metrics in high-dimensional data visualization: An overview and systematization. IEEE Transactions on Visualization and Computer Graphics, 17(12), 2203-2212. Seo, J., & Shneiderman, B. (2005). A rank-by-feature framework for interactive exploration of multidimensional data. Information visualization, 4(2), 96-113. Sips, M., Neubert, B., Lewis, J. P., & Hanrahan, P. (2009, June). Selecting good views of high‐dimensional data using class consistency. In Computer Graphics Forum (Vol. 28, No. 3, pp. 831-838). Blackwell Publishing Ltd. Johansson, S., & Johansson, J. (2009). Interactive dimensionality reduction through user-defined combinations of quality metrics. IEEE transactions on visualization and computer graphics, 15(6), 993-1000. Manuela Waldner 82 References Tatu, A., Albuquerque, G., Eisemann, M., Schneidewind, J., Theisel, H., Magnork, M., & Keim, D. (2009, October). Combining automated analysis and visualization techniques for effective exploration of highdimensional data. In Visual Analytics Science and Technology, 2009. VAST 2009. IEEE Symposium on (pp. 59- 66). IEEE Molchanov, V., & Linsen, L. (2014). Interactive Design of Multidimensional Data Projection Layout. Jeong, D. H., Ziemkiewicz, C., Fisher, B., Ribarsky, W., & Chang, R. (2009, June).iPCA: An Interactive System for PCA‐based Visual Analytics. In Computer Graphics Forum (Vol. 28, No. 3, pp. 767-774). Blackwell Publishing Ltd. Wise, J. A., Thomas, J. J., Pennock, K., Lantrip, D., Pottier, M., Schur, A., & Crow, V. (1995, October). Visualizing the non-visual: Spatial analysis and interaction with information from text documents. In Information Visualization, 1995. Proceedings. (pp. 51-58). IEEE. Fried, D., & Kobourov, S. G. (2014, March). Maps of computer science. In Visualization Symposium (PacificVis), 2014 IEEE Pacific (pp. 113-120). IEEE. Stahnke, J., Dörk, M., Müller, B., & Thom, A. (2016). Probing projections: Interaction techniques for interpreting arrangements and errors of dimensionality reductions. IEEE transactions on visualization and computer graphics, 22(1), 629-638. Manuela Waldner 83 References Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), 2579-2605. Rieck, B., & Leitte, H. (2015, June). Persistent homology for the evaluation of dimensionality reduction schemes. In Computer Graphics Forum (Vol. 34, No. 3, pp. 431-440). Endert, A., Fiaux, P., & North, C. (2012, May). Semantic interaction for visual text analytics. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 473-482). ACM. Turkay, C., Filzmoser, P., & Hauser, H. (2011). Brushing dimensions-a dual visual analysis model for highdimensional data. IEEE transactions on visualization and computer graphics, 17(12), 2591-2599. Yang, J., Ward, M. O., & Rundensteiner, E. A. (2002). Visual hierarchical dimension reduction for exploration of high dimensional datasets. Krause, J., Dasgupta, A., Fekete, J. D., & Bertini, E. (2016, October). SeekAView: An intelligent dimensionality reduction strategy for navigating high-dimensional data spaces. In Large Data Analysis and Visualization (LDAV), 2016 IEEE 6th Symposium on (pp. 11-19). IEEE. Wilkinson, L., Anand, A., & Grossman, R. (2005). Graph-theoretic scagnostics. Manuela Waldner 84 References Jiang, L., Liu, S., & Chen, C. (2018). Recent research advances on interactive machine learning. Journal of Visualization, 1-17. Sacha, D., Zhang, L., Sedlmair, M., Lee, J. A., Peltonen, J., Weiskopf, D., ... & Keim, D. A. (2017). Visual interaction with dimensionality reduction: A structured literature analysis. IEEE transactions on visualization and computer graphics, 23(1), 241-250. Endert, A., Ribarsky, W., Turkay, C., Wong, B. W., Nabney, I., Blanco, I. D., & Rossi, F. (2017, December). The state of the art in integrating machine learning into visual analytics. In Computer Graphics Forum (Vol. 36, No. 8, pp. 458-486). Fails, J. A., & Olsen Jr, D. R. (2003, January). Interactive machine learning. In Proceedings of the 8th international conference on Intelligent user interfaces (pp. 39-45). ACM. Kandogan, E. (2001, August). Visualizing multi-dimensional clusters, trends, and outliers using star coordinates. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 107-116). ACM. Zhao, Y., Luo, F., Chen, M., Wang, Y., Xia, J., Zhou, F., ... & Chen, W. (2019). Evaluating multi-dimensional visualizations for understanding fuzzy clusters. IEEE transactions on visualization and computer graphics, 25(1), 12-21. Manuela Waldner 85 References Liu, S., Wang, X., Liu, M., & Zhu, J. (2017). Towards better analysis of machine learning models: A visual analytics perspective. Visual Informatics, 1(1), 48-56. Kriegel, H. P., Kröger, P., & Zimek, A. (2009). Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(1), 1. Hu, J., & Pei, J. (2018). Subspace multi-clustering: a review. Knowledge and Information Systems, 56(2), 257- 284. Nam, J. E., & Mueller, K. (2013). TripAdvisor^{ND}: A Tourism-Inspired High-Dimensional Space Exploration Framework with Overview and Detail. IEEE transactions on visualization and computer graphics, 19(2), 291- 305. Friedman, J. H., & Stuetzle, W. (2002). John W. Tukey's work on interactive graphics. The Annals of Statistics, 30(6), 1629-1639. Manuela Waldner 86