IB031 ´Uvod do strojov´eho uˇcen´ı Tom´aˇs Br´azdil 1 Course Info Resources: ▶ Lectures & tutorials (the main source) ▶ Many books, few perfect for introductory level One relatively good, especially the first part: A. G´eron. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media; 3rd edition, 2022 ▶ (Almost) infinitely many online courses, tutorials, materials, etc. 2 Evaluation The evaluation is composed of three parts: ▶ Mid-term exam: Written exam from the material of the first half of the semester. ▶ End-term exam: The ”big” one containing everything from the semester (with possibly more stress in the second half). ▶ Projects: During tutorials, you will work on larger projects (in pairs). Each part contributes the following number of points: ▶ Mid-term exam: 25 ▶ End-term exam: 50 ▶ Project: 25 To pass, you need to obtain at least 60 points. 3 Distinguishing Properties of the Course ▶ Introductory, prerequisites are held to a minimum ▶ Formal and precise: Be prepared for a complete and “mathematical” description of presented methods. I assume that you have basic knowledge of ▶ Elementary understanding of mathematical notation (operations on sets, logic, etc.) ▶ Linear algebra: Vectors in Rn, operations on vectors (including the dot product). Geometric interpretation! ▶ Calculus: Functions of multiple real variables, partial derivatives, basic differential calculus. ▶ Probability: Notion of probability distribution, random variables/vectors, expectation. 4 What Is Machine Learning? Machine learning is the science (and art) of programming computers so they can learn from data. Here is a slightly more general definition: Arthur Samuel, 1959 Machine learning is the field of study that allows computers to learn without being explicitly programmed. And a more engineering-oriented one: Tom Mitchell, 1997 A computer program is said to learn from experience E concerning some task T and some performance measure P if its performance on T, as measured by P, improves with experience E. 5 Example In the context of spam filtering: ▶ The task T is to flag spam in new emails. ▶ The experience E is represented by a set of emails labeled either spam or ham by hand (the training data). ▶ The performance measure P could be the accuracy, which is the ratio of the number of correctly classified emails and all emails. There are many more performance measures; we will study the basic ones later. In the context of housing price prediction: ▶ The task T is to predict prices of new houses based on their basic parameters (size, number of bathrooms, etc.) ▶ The experience E is represented by information about existing houses. ▶ The performance measure P could be, e.g., an absolute difference between the predicted and real price. 6 Examples (cont.) In the context of game playing: ▶ The task T is to play chess. ▶ The experience E is represented by a series of self-plays where the computer plays against itself. ▶ The performance measure P is winning/losing the game. Here, the trick is to spread the delayed and limited feedback about the result of the game throughout the individual decisions in the game. In the context of customer behavior: ▶ The task T is to group customers with similar shopping habits in an e-shop. ▶ The experience E consists of lists of items individual customers bought in the shop. ▶ The performance measure P? Measure how ”nicely” the customers are grouped. (whether people with similar habits, as seen by humans, fall into the same group). 7 Comparison of Programming and Learning How to code the spam filter? ▶ Examine what spam mails typically contain: Specific words (”Viagra”), sender’s address, etc. ▶ Write down a rule-based system that detects specific features. ▶ Test the program on new emails and (most probably) go back to look for more spam features. 8 Comparison of Programming and Learning The machine learning way: ▶ Study the problem and collect lots of emails, labeling them spam or ham. ▶ Train a machine learning model that reads an email and decides whether it’s spam or ham. ▶ Test the model and (most probably) go back to collect more data and adjust the model. 9 ML Solutions are Adaptive Spam filter: Authors of spam might and will adapt to your spam filter (possibly change the wording to pass through). ML systems can be adjusted to new situations by retraining on new data (unless the data becomes ugly). 10 ML for Human Understanding Spam filter: A trained system can be inspected for notorious spam features. Some models allow direct inspection, such as decision trees or linear/logistic regression models. 11 Usage of Machine Learning Machine learning suits various applications, especially where traditional methods fall short. Here are some areas where it excels: ▶ Solving complex problems where fine-tuning and rule-based solutions are inadequate. ▶ Tackling complex issues that resist traditional problem-solving approaches. ▶ Adapting to fluctuating environments through retraining on new data. ▶ Gaining insights from large and complex datasets. In summary, machine learning offers innovative solutions and adaptability for today’s complex and ever-changing problems, (sometimes) providing insights beyond the reach of traditional approaches. 12 Types of Learning There are main categories based on information available during the training: ▶ Supervised learning ▶ Unsupervised learning ▶ Semi-supervised learning ▶ Self-supervised learning ▶ Reinforcement learning 13 Supervised Learning Labels are available for all input data. Typical supervised learning tasks are ▶ Classification where the aim is to classify inputs into (typically few) classes (e.g., the spam filter where the classes are spam/ham) ▶ Regression where a numerical value is output for a given input (e.g., housing prices) 14 Unsupervised Learning No labels are available for input data. Typical unsupervised learning tasks are ▶ Clustering where inputs are grouped according to their features (e.g., clients of a bank grouped according to their age, wealth, etc.) ▶ Association where interesting relations and rules are discovered among the features of inputs (e.g., market basket mining where associations between various types of goods are being learned from the behavior of customers) ▶ Dimensionality reduction reduce high-dimensional data to few dimensions (e.g., images to few image features) 15 Semi-Supervised Learning Labels for some data. For example, Medical data, where elaborate diagnosis is available only for some patients. Combines supervised and unsupervised learning: e.g., clusters all data and labels the unlabeled inputs with the most common labels in their clusters. 16 Self-Supervised Learning Generate labels from (unlabeled) inputs. The goal is to learn typical features of the data. It can be later modified to generate images, classify, etc. 17 Reinforcement Learning Learn from performing actions and getting feedback from environment. 18 ML Applications Highlights ▶ ChatGPT (and similar generative models) ▶ The basis forms a generative language model, i.e., a text-generating model trained on texts in a self-supervised way ▶ Currently extended to multimodal versions (text, image, sound) ▶ Machine translation, image captioning ▶ Google translate, etc. ▶ Typically (semi)-supervised learning, ▶ Various image recognition and processing tasks ▶ In medicine where it is slowly making its way into hospitals as assistance tools ▶ Automotive, advertising, quality control etc., etc., etc. ▶ Science ▶ Chemistry & biology: E.g., prediction of features of chemical compounds (Alpha-fold) ▶ Various ”table” data processing in finance, management, etc. ▶ Often straightforward methods (linear/logistic regression) ▶ Essential but not fancy ▶ Game playing: More fancy than useful, learning models beating humans in several difficult games. 19 ML in Context 20 Supervised Learning 21 Example - Fruit Recognition The goal: Create an automatic system for fruit recognition, concretely apple, lemon, and mandarin. Inputs: Measures of height and width of each fruit. Suppose we have a dataset of dimensions of several fruits labeled with the correct class. 22 Data Use similarity to solve the problem. 23 KNN Classification Given a new fruit. What is it? Find five closest examples Where is the machine learning? 24 KNN Classification Given a new fruit. What is it? Find five closest examples Among the five closest: ▶ M = 4 mandarins ▶ A = 1 apples ▶ L = 0 lemons Where is the machine learning? 24 KNN Classification Given a new fruit. What is it? Find five closest examples Among the five closest: ▶ M = 4 mandarins ▶ A = 1 apples ▶ L = 0 lemons It is a mandarin! Where is the machine learning? 24 Learning in Fruit Classification with KNN Learning: Inference: 25 Fruit Classification Algorithm Input: A fruit F with dimensions height, width Output: mandarin, lemon, apple 1: Find K examples {E1, . . . , EK } in the dataset whose dimensions are closest to the dimensions of the fruit F 2: Count the number of examples of each class in {E1, . . . , EK } M mandarins in {E1, . . . , EK } L lemons in {E1, . . . , EK } A apples in {E1, . . . , EK } 3: if M ≥ L and M ≥ A then return mandarin 4: else if L ≥ A then return lemon 5: else return apple 6: end if Does it work? 26 Testing the Model for Fruit Classification Consider a test set of new instances (K = 5, d is Euclidean): Perfect classification of new data! Just deploy and sell!! 27 K Nearest Neighbors . . . on ideal data 28 Learning and Inference Two crucial components of machine learning are the following: Learning: Creating model Inference: Using model 29 Training Data Assume table training data, i.e., of the form x11 x12 · · · x1n c1 x21 x22 · · · x2n c2 ... ... ... ... ... xp1 xp2 · · · xpn cp Formally, we define training dataset T = {(⃗xk, ck) | k = 1, . . . , p} Here each ⃗xk ∈ Rn is an input vector and ck ∈ C is the correct class. T = {(4.0, 6.5), M), (4.47, 7.13), M), (6.49, 7.0), A), . . .} 30 KNN: Learning Consider the training set: T = {(⃗xk, ck) | k = 1, . . . , p} and memorize it exactly as it is. Store in a table. Possibly use a clever representation allowing fast computation of nearest neighbors such as KDTrees (out of the scope of this lecture). Also, ▶ determine the number of neighbors K ∈ N, ▶ and the distance measure d. 31 Inference in KNN Assume a KNN ”trained” by memorizing T = {(⃗xk, ck) ∈ Rn × C | k = 1, . . . , p}, a constant K ∈ N and a distance measure d. For d, consider Euclidean distance, but different norms may also be used to define different distance measures. Input: A vector ⃗z = (z1, . . . , zn) ∈ Rn Output: A class from C 1: Find K indices of examples X = {i1, . . . , iK } ⊆ {1, . . . , p} with minimum distance to ⃗z, i.e., satisfying max d(⃗z, ⃗xℓ) | ℓ ∈ X ≤ min d(⃗z, ⃗xℓ) | ℓ ∈ {1, . . . , p}∖X 2: For every c ∈ C count the number #c of elements ℓ in X such that cℓ = c 3: Return some cmax ∈ arg max c∈C #c A class cmax ∈ C which maximizes #c. 32 The resulting model What exactly constitutes the model? The model consists of ▶ The trained parameters: In this case the memorized training data. ▶ The hyperparameters set “from the outside”: In this case, the number of neighbors K and the distance measure d. Note that different settings of K lead to different classifiers (for the same d): 33 In Practice . . . to get an efficient solution: ▶ Deal with issues in the data ▶ Data almost always comes in weird formats, with inconsistencies, missing values, wrong values, etc. ▶ Data rarely have the ideal form for a given learning model. We need to ingest, validate, and preprocess the data. ▶ Deal with issues in the model ▶ In KNN, the training memorizes the example, but at least the K can be tuned. We need to tune the model. ▶ Deal with the wrong model by testing and validation in as realistic conditions as possible. ▶ Deal with deployment - real-world application issues involving, e.g., implementation in embedded devices with limited resources. 34 Models Considered in This Course Throughout this course, we will meet the following models: ▶ KNN (already did) ▶ Decision trees ▶ (Naive) Bayes classifier ▶ Clustering: K-means and hierarchical ▶ Linear and logistic regression ▶ Support Vector Machines (SVM) ▶ Kernel linear models ▶ Neural networks (light intro to feed-forward networks) ▶ Ensemble methods + random forests ▶ (maybe some reinforcement learning) . . . but first, let us see the whole machine learning pipeline. 35 Machine Learning Pipeline 36 Fetch Data Always start with ▶ The problem formulation & understanding. For example, diagnosis of diabetes from medical records. What info is (possibly) sufficient for such a diagnosis? ▶ Find data sources. In our example, the sources are hospitals. It would be best to persuade them to give you the data and sign a contract. ▶ Collect the data. In our example, the data is possibly small (just tables with results of tests). But for other diagnoses, you may include huge amounts of data from MRI, CT, etc. Then, the collection itself might be a serious technical problem. ▶ Integrate data from various sources. A serious diagnostic system must be trained/tested on data from many hospitals. You must blend the data from various sources (different formats, etc.). 37 Fetch Data For simple “toy” machine learning projects, you may fetch prepared datasets from various databases on the internet. The data should be stored in an identified location and versioned. You will probably keep adding data and training models on the ever-changing datasets. You have to be able to keep track of the changes and map training data to particular models. Tools such as ML Flow or Weights & Biases might be helpful. Data Separation At this point, you should randomize the ordering of the data and select a test set to be used in model evaluation! The test data are supposed to simulate the actual conditions, i.e., they should be “unseen”. Data Exploration Compute basic statistics to identify missing values, outliers, etc. 38 Clean Data The cleaning usually comprises the following steps: ▶ Fix or remove incorrect or corrupted values. ▶ Identify outliers and decide what to do with them. Outliers may harm some training methods and are not “representative”. However, sometimes, they naturally belong to the dataset, and expert insight is needed. ▶ Fix formatting. For example, the Date may be expressed in many ways, and a simple Yes/No answer. ▶ Resolve missing values (by either removing the whole examples or imputing) Many methods have been developed for missing values imputation. It is a susceptible issue because new values may strongly bias the model. ▶ Remove duplicates. The above steps often affect the training and need expertise in the application domain. Later in this course, we will discuss techniques for data cleaning. 39 40 Prepare Data Unlike cleaning, which is application-dependent, data preparation/transformation is model-dependent. This usually subsumes: ▶ Scaling: Settings values of inputs to a similar range. Some models, especially those utilizing distance, are sensitive to large differences between input sizes. ▶ Encoding: Encode non-numeric data using real-valued vectors. Many models, especially those based on geometry, work only with numeric data. Non-numeric data such as Yes/No, Short/Medium/Long must be encoded appropriately. ▶ Binning or Discretization Convert continuous features into discrete bins to capture patterns in ranges. Comment: Sometimes Normalization, that is changing the distribution of inputs to resemble the normal distribution, is mentioned. However, this step is typically not essential for machine learning itself. However, it is important to use statistical inference to test the significance of learned parameters. 41 Prepare Data ▶ Feature selection Throw out input features that are too “similar” to other features. For example, if the temperature is measured both in Celsius and in Kelvin, keep one of them. The relationship can, of course, be a more complex (non-linear) correlation. ▶ Dimensionality reduction Transforming data from Rn to Rm where m << n. Growing dimension means growing difficulty of training for all models. Some models cease to work for high-dimensional data. The reduction typically searches for a few important characteristic features of inputs. ▶ Feature aggregation Introducing new features using operations on the original ones. We will see kernel transformations later in this course, allowing simple models to solve complex problems. 42 Train Model Now the dataset has been cleaned; we may train a model. Before training, we should split the dataset into ▶ training dataset on which the model will learn ▶ validation dataset on which we fine-tune hyperparameters The resulting model is obtained after several iterations of the above process. 43 Evaluate Model Here, we use the test set that we separated during data fetching. In some cases, a brand new test set can be generated. patients are examined regularly, creating new records continuously. In some cases, it is tough to obtain new data. For example, new expensive and difficult measurements are needed to obtain new data. Critical issue: Make sure that you are truly testing exactly the whole inference process. Often, just a model is tested, and the testing and production inference engines are separated. This leads to truly nasty errors in the production! We will discuss various generic metrics helpful in measuring the quality of the resulting model. 44 Deploy to Production Deployment of machine learning models is a complex question, application dependent. The recently emerging area of MLOps is concerned with the engineering side of the model deployment. From the technical point of view, the typical issues solved by ML Ops teams are ▶ how to extract/process data in real-time ▶ how much storage is required ▶ how to store/collect model (and data) artifacts/predictions ▶ how to set up APIs, tools, and software environments ▶ What the period of predictions (instantaneous or batch predictions) should be ▶ how to set up hardware requirements (or cloud requirements for on-cloud environments) by the computational resources required ▶ how to set up a pipeline for continuous training and parameter tuning 45 Deploy to Production From the user’s point of view: ▶ How to get a sensible and valuable user output? ▶ AI researchers will be satisfied with tons of running text in terminals. ▶ “Normal” people need a graphical interface with understandable output. ▶ Experts working in other domains typically demand speed and clarity at the extreme. ▶ How do you persuade users that the AI is working for them? ▶ Especially if safety is at stake, you need to have outstanding arguments and explanations ready for end-users ▶ In many areas, the devices need to be certified (medicine, automotive) for ML-based systems. This complex subject will be only touched on in this course. 46 Monitor, collect Data Deployed machine learning models must be constantly monitored. Because of the influx of new data, ML models work in highly dynamic environments. For example, an image-processing medical diagnostic model suddenly misdiagnosed a patient because a nurse marked the sample with a marker pen. Every customer has a different infrastructure and may produce data slightly differently. Data for retraining and improvement should be stored. Also, many areas allow the active learning where users provide feedback for (continuous) retraining of the models. 47 Data 48 Data Science Example You receive data from a medical researcher concerning a project that you are eager to work on. The data consists of a 1000 lines table with five columns: 012 232 33.5 0 10.7 020 121 16.9 2 210.1 027 165 24.0 0 427.6 · · · The aim is to predict the last field given the others. The medical researcher does not elaborate further on the data, but they seem to be pretty easy to work with, right? After a few days, you have trained a model that predicts numbers resembling the ones in the table. You contact the medical researcher and discuss the results. 49 Model Discussion Researcher: So, you got the data for all the patients? Data Miner: Yes. I haven’t had much time for analysis, but I do have a few interesting results. Researcher: Amazing. There were so many data issues with this set of patients that I couldn’t do much. Data Miner: Oh? I didn’t hear about any possible problems. Researcher: Well, first, there is field 5, the variable we want to predict. It’s common knowledge among people who analyze this type of data that results are better if you work with the log of the values, but I didn’t discover this until later. Was it mentioned to you? Data Miner: No. 50 Model Dicsuccion Researcher: But surely you heard about what happened to field 4? It’s supposed to be measured on a scale from 1 to 10, with 0 indicating a missing value, but because of a data entry error, all 10’s were changed into 0’s. Unfortunately, since some of the patients have missing values for this field, it’s impossible to say whether a 0 in this field is a real 0 or a 10. Quite a few of the records have that problem. Data Miner: Interesting. Were there any other problems? Researcher: Yes, fields 2 and 3 are basically the same, but I assume that you probably noticed that. Data Miner: Yes, but these fields were only weak predictors of field 5. 51 Model Discussion Researcher: Anyway, given all those problems, I’m surprised you were able to accomplish anything. Data Miner: True, but my results are really quite good. Field 1 is a very strong predictor of field 5. I’m surprised that this wasn’t noticed before. Researcher: What? Field 1 is just an identification number. Data Miner: Nonetheless, my results speak for themselves. Researcher: Oh, no! I just remembered. We assigned ID numbers after we sorted the records based on field 5. There is a strong connection, but it isn’t very sensible. Sorry. OK, what’s the point? You have to Understand the task you want to solve and the data! 52 Data Objects Data objects represent entities we work with (e.g., classify them). For example, in cancer prediction, the data objects are patients. In fruit classification, the data objects are individual fruits. Data objects are described by attributes (or features or variables). For example, the age, weight, genetic profile, and other patient characteristics. Or the width and height of a fruit. 53 Attributes vs Features vs Variables The name differs from field to field. So, the following names are usually used as synonyms: ▶ Attributes - used mostly by database and data mining experts. ▶ Features - used mostly by machine learning experts. ▶ Variables - used mostly by statisticians. One may make some distinctions ▶ Attributes represent information about the object without any additional assumptions. ▶ Features assume that their values are somewhat characteristic of the object. ▶ Variables assume that there is some process behind them (typically a random process in the case of statistics). 54 Data Types - Categorical Attributes Categorical attributes (nominal attributes) are symbols or names of things. ▶ Each value represents some kind of category, code, or state. ▶ Values are not ordered and should not be used quantitatively (in computer science, the values are known as enumerations). ▶ Examples: hair color ∈ {black, brown, blond, red, auburn, gray, white} marital status ∈ {single, married, divorced, widowed} customer ID ∈ {0, 1, 2, . . .} Even though the last one is usually expressed using numbers, it should not be used quantitatively. Binary attributes are categorical attributes with only two values. 55 DataTypes - Ordinal Attributes Ordinal attribute is an attribute with values that have a meaningful order or ranking among them. Examples: drink size ∈ {small, medium, large} grades ∈ {A, B, C, D, E, F} It can also be obtained by discretizing numeric quantities into series of intervals. Ordinal attributes do not allow arithmetic operations. Categorical and ordinal attributes are called qualitative attributes. Next, we look at numeric, i.e., quantitative attributes. 56 Data Types - Numeric Attributes Numeric attributes are quantities represented by numbers. Distinguish two types: Interval-scale and ratio-scale. INTERVAL SCALE RATIO SCALE Measurement interval Equal intervals between consecutive points. Equal intervals with the presence of a true zero. Absolute zero Lacks a true zero point. Possesses a true zero point. Statistical analysis Limited to addition and subtraction Allows for meaningful multiplication and division. Meaningful ratios Ratios are not meaningful due to the lack of zero. Ratios are meaningful due to the presence of zero. Examples IQ scores, Celsius temperature, NPS data, etc. Height, weight, income, etc. 57 Discrete vs Continuous Attributes Often, two kinds of numeric attributes are distinguished: ▶ Discrete A finite or countably infinite range of values, i.e., integers may represent the values. Some (but not all) authors count the qualitative (categorical, ordinal) attributes among the discrete attributes. ▶ Continuous An uncountably infinite range of values, typically an interval. There are several more or less formal definitions of continuous attributes in the literature. For example: ▶ All non-discrete variables. ▶ Have an infinite number of values between any two values. ▶ Their values are measured (??). Deeper characteristics of data (statistical properties, etc.) will be examined at tutorials. 58 Decision Trees 59 Decision Trees ▶ One of the widely used methods for machine learning. ▶ Intuitively simple, directly explainable. ▶ Basis for random forests (a powerful model). ▶ We will consider the ID3 algorithm. Quinlan, 1979 ▶ Various adjustments that appear in C4.5, CART, etc. 60 Consider the weather forecast for tennis playing. How would you decide whether to play today? How do we obtain such a tree based on experience/data? 61 Learning Decision Trees Consider data represented as follows: ▶ A finite set of attributes A = {A1, . . . , An}. ▶ Each attribute A ∈ A has its set of values V (A). We start with trees on discrete datasets, that is, assume V (A) finite for all A ∈ A. Objects to be classified are described by vectors of values of all attributes: ⃗x = (x1, . . . , xn) ∈ V (A1) × · · · × V (An) Given ⃗x and an attribute Ak we denote by Ak(⃗x) the value xk of the attribute Ak in ⃗x. Consider a set C of classes. We consider a multiclass classification in general, i.e., C is an arbitrary finite set. 62 Example The tennis problem: ▶ The attributes are: A1 = Outlook, A2 = Temperature, A3 = Humidity, A4 = Wind 63 Example The tennis problem: ▶ The attributes are: A1 = Outlook, A2 = Temperature, A3 = Humidity, A4 = Wind ▶ The sets of values of the attributes: ▶ V (A1) = {Sunny, Overcast, Rain} ▶ V (A2) = {Hot, Mild, Cool} ▶ V (A3) = {High, Normal} ▶ V (A4) = {Strong, Weak} 63 Example The tennis problem: ▶ The attributes are: A1 = Outlook, A2 = Temperature, A3 = Humidity, A4 = Wind ▶ The sets of values of the attributes: ▶ V (A1) = {Sunny, Overcast, Rain} ▶ V (A2) = {Hot, Mild, Cool} ▶ V (A3) = {High, Normal} ▶ V (A4) = {Strong, Weak} ▶ Consider ⃗x = (Overcast,Hot, Normal, Weak) ∈ V (A1) × V (A2) × V (A3) × V (A4) 63 Example The tennis problem: ▶ The attributes are: A1 = Outlook, A2 = Temperature, A3 = Humidity, A4 = Wind ▶ The sets of values of the attributes: ▶ V (A1) = {Sunny, Overcast, Rain} ▶ V (A2) = {Hot, Mild, Cool} ▶ V (A3) = {High, Normal} ▶ V (A4) = {Strong, Weak} ▶ Consider ⃗x = (Overcast,Hot, Normal, Weak) ∈ V (A1) × V (A2) × V (A3) × V (A4) Then A3(⃗x) = Humidity(⃗x) = Normal A4(⃗x) = Wind(⃗x) = Weak 63 Example The tennis problem: ▶ The attributes are: A1 = Outlook, A2 = Temperature, A3 = Humidity, A4 = Wind ▶ The sets of values of the attributes: ▶ V (A1) = {Sunny, Overcast, Rain} ▶ V (A2) = {Hot, Mild, Cool} ▶ V (A3) = {High, Normal} ▶ V (A4) = {Strong, Weak} ▶ Consider ⃗x = (Overcast,Hot, Normal, Weak) ∈ V (A1) × V (A2) × V (A3) × V (A4) Then A3(⃗x) = Humidity(⃗x) = Normal A4(⃗x) = Wind(⃗x) = Weak ▶ C = {Yes, No} 63 Decision Trees Consider (directed, rooted) trees T = (T, E) where T is a set of nodes and E ⊆ T × T is a set of directed edges. Denote by Tleaf ⊆ T the set of all leaves of the tree and by Tint the set T ∖ Tleaf of internal nodes. A decision tree is ▶ a tree T = (T, E) where ▶ each leaf τ ∈ Tleaf is assigned a class C(τ) ∈ C, ▶ each internal node τ ∈ Tint is assigned an attribute A(τ) ∈ A, ▶ and there is a bijection between edges from τ and values of the attribute A(τ). Given an edge (τ, τ′ ) ∈ E we write V (τ, τ′ ) to denote the value of the attribute A(τ) assigned to the edge. Inference: Given an input ⃗x, we traverse the tree from the root to a leaf, always choosing edges labeled with values of attributes from ⃗x. The output is the class labeling the leaf. 64 Example T = {O, H, W , z1, z2, z3, z4, z5} Tleaf = {z1, z2, z3, z4, z5}, Tint = {O, H, W } E = (O, H), (O, W ), (H, z1), (H, z2), (O, z3), (W , z4), (W , z5) C(z1) = C(z3) = No, C(z2) = C(z4) = Yes A(O) = Outlook, A(H) = Humidity, A(W ) = Wind V (O, H) = Sunny, V (O, z3) = Overcast, V (O, W ) = Rain V (H, z1) = High, V (H, z2) = Normal V (W , z4) = Strong, V (W , z5) = Weak Inference: For (Rain, Hot, High, Strong) we reach z4, yielding No. 65 Training Dataset Consider a training dataset D = {(⃗xk, ck) | k = 1, . . . , p} Here ⃗xk ∈ V (A1) × · · · × V (Ak) and ck ∈ C for every k. Technically D can be a multiset containing several occurrences of the same vector. 66 Index Outlook Temperature Humidity Wind PlayTennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No D = ((Sunny, Hot, High, Weak), No), ((Sunny, Hot, High, Strong), No) · · · ((Rain, Mild, High, Strong), No) 67 Learning Decision Trees The learning algorithm ID3 works as follows: ▶ Start with the whole training dataset D. ▶ If there is just a single class in D, create a single node decision tree that returns the class. ▶ Otherwise, identify an attribute A ∈ A which best classifies the examples in D. For every v ∈ V (A) we obtain Dv = {⃗x | ⃗x ∈ D, A(⃗x) = v} We aim to have each Dv as pure as possible, that is, ideally, to contain examples of just a single class. ▶ Finally, ▶ create a root node τ of a decision tree, ▶ assign the attribute A to τ, ▶ for every v ∈ V (A), recursively construct a decision tree with a root τv using Dv , ▶ for every v ∈ V (A) introduce an edge (τ, τv ) assigned v. 68 1: function ID3(dataset D, attribute set A) 2: Create a root node τ for the tree 3: if D = ∅ then 4: Return the single node τ assigned with a default class. 5: else if all examples in D are of the same class c then 6: Return the single-node tree, where τ is assigned c 7: else if set of attributes A is empty then 8: Return the single-node tree where τ is assigned the most common class in D 9: else 10: Choose attribute A ∈ A best classifying examples in D. 11: Set the decision attribute for τ to A 12: for each value v ∈ D(A) do 13: Compute a decision tree ID3(Dv , A ∖ {A}) with root τv , 14: add a new edge (τ, τv ) assigned v. 15: end for 16: end if 17: return τ 18: end function 69 Best Classifying Attribute We aim to choose an attribute that best informs us about the class. As a result, we would possibly use as few attributes as possible and obtain a small tree containing only class-relevant decisions. How to choose an attribute that best classifies examples in D? There are several measures used in practice. The most common are ▶ information gain ▶ Gini impurity decrease 70 Information Gain The information gain is based on the notion of entropy. We need some notation: ▶ Given a training dataset D and a class c ∈ C we denote by pc the proportion of examples with class c in D. ▶ We define the entropy of D by Entropy(D) = c∈C −pc log2 pc ▶ The information gain of an attribute A is then defined by Gain(D, A) = Entropy(D) − v∈V (A) |Dv | |D| Entropy(Dv ) In every step of the ID3 algorithm, we choose an attribute maximizing the information gain for the current dataset D. 71 Information Gain The intuition behind information gain: ▶ Consider C = {0, 1} and p the proportion of examples of class 1. p measures the “uncertainty” of the class: ▶ v∈V (A) |Dv | |D| Entropy(Dv ) is weighted uncertainty of classes in each Dv (weighted by the relative size of Dv ). ▶ Gain(D, A) measures reduction in uncertainty of classes by splitting D according to A. 72 Gini Impurity ▶ We define Gini impurity of D by Gini(D) = 1 − c∈C p2 c ▶ The impurity decrease of an attribute A is then defined similarly to the gain in the entropy case ImpDec(D, A) = Gini(D) − v∈V (A) |Dv | |D| Gini(Dv ) In every step of the ID3 algorithm, we choose an attribute maximizing the impurity decrease for the current dataset D. 73 Gini Impurity What is the intuition behind Gini(D) ? Assume we randomly independently choose objects from D. 1 − c∈C p2 c is the probability of choosing two objects of different classes in two consecutive independent trials. Indeed, pc is the probability of choosing an object of class c, p2 c the probability of choosing objects of the class c twice, and c∈C p2 c the probability of choosing two objects of the same class. In what follows (and at the exam), we will work only with the Gini impurity as it is easier to compute. 74 Example Consider our tennis example (see the table). ▶ Consider the whole dataset D. ▶ pYes = 9/14 ▶ pNo = 5/14 ▶ Gini(D) = 1 − (9/14)2 − (5/14)2 = 0.45918 ▶ For A = Outlook we get ▶ Gini(DSunny ) = 1 − (2/5)2 − (3/5)2 = 0.48 ▶ Gini(DOvercast) = 1 − 12 − 02 = 0 ▶ Gini(DRain) = 1 − (3/5)2 − (2/5)2 = 0.48 Thus ImpDec(D, Outlook) = 0.459 − (5/14) · 0.48 − (4/14) · 0 − (5/14) · 0.48 = 0.117 ▶ ImpDec(D, Temperature) = 0.018 ▶ ImpDec(D, Humidity) = 0.091 ▶ ImpDec(D, Wind) = 0.030 So the largest information gain is given by the Outlook. 75 Example Going further on, consider D = DSunny . We get ▶ ImpDec(D, Temperature) = 0.279 ▶ ImpDec(D, Humidity) = 0.48 ▶ ImpDec(D, Wind) = 0.013 The best choice attribude after Sunny in Outlook is Humidity. Now consider D = DRain. ▶ ImpDec(D, Temperature) = 0.013 ▶ ImpDec(D, Humidity) = 0.013 ▶ ImpDec(D, Wind) = 0.48 The best choice attribude after Rain in Outlook is Wind. 76 Continuous-Valued Attributes What if values of an attribute A come from a continuous variable? A is a numerical attribute that can take any value in an interval, such as temperature, size, time, etc. Consider an internal node τ ∈ Tint assigned such a continuous attribute A. Then ▶ τ is assigned a threshold value called a cut point H ∈ R, ▶ there are two edges etrue, efalse from τ, ▶ etrue labeled with True and efalse labeled with False. During inference, when considering an example ⃗x in the node τ, ▶ evaluate A(⃗x) ≤ H, ▶ if A(⃗x) ≤ H, then follow etrue, ▶ else follow efalse. In training, the cut point is chosen from the attribute values in the training set using information gain/impurity decrease similar to discrete attributes. 77 Iris Example Attributes Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Classes (Variety) Setosa, Versicolor, Virginica 78 Iris Example The dataset (150 examples): Sepal.Length Sepal.Width Petal.Length Petal.Width Variety 5.5 3.5 1.3 0.2 Setosa 6.8 2.8 4.8 1.4 Versicolor 6.7 3.1 4.7 1.5 Versicolor 6.9 3.1 5.1 2.3 Virginica 7.3 2.9 6.3 1.8 Virginica 5.4 3.7 1.5 0.2 Setosa 4.6 3.4 1.4 0.3 Setosa 6.2 2.8 4.8 1.8 Virginica 5.4 3.0 4.5 1.5 Versicolor 4.7 3.2 1.6 0.2 Setosa 6.7 3.3 5.7 2.1 Virginica 5.0 3.4 1.5 0.2 Setosa 5.0 3.0 1.6 0.2 Setosa 4.4 2.9 1.4 0.2 Setosa 6.0 3.4 4.5 1.6 Versicolor 5.1 3.5 1.4 0.2 Setosa 6.6 3.0 4.4 1.4 Versicolor 5.9 3.2 4.8 1.8 Versicolor 5.6 2.8 4.9 2.0 Virginica · · · 79 Iris Example 80 Iris Example - Decision Tree 81 Iris Example - Decision Tree Boudaries If the leaves are split further, the Depth = 2 boundary would be added. 82 Attribute Importance Computation How important are attributes for the trained tree T ? Depends on ▶ how close they are to the root of T , ▶ how large information gain/decrease in impurity they give. There are several formulae for computing the importance. One example is mean decrease impurity defined as follows: Consider a decision tree T trained on a dataset D using the ID3. For every node τ of T , denote by D[τ] the subset of D which was used in the ID3 procedure when the node τ was created (line 2). Consider an attribute A and denote by T[A] ⊆ Tint the set of all nodes of T assigned the attribute A by ID3 (line 11). Then define the importance as the average decrease in Gini impurity (i.e., average ImpDec) in the nodes of T[A]: GiniImportance(A) = τ∈T[A] |D[τ]| |D| ImpDec(D[τ], A) 83 Decision Trees Practical Issues 84 Practical Issues ▶ Data preprocessing ▶ Model tunning (overfitting and underfitting) ▶ Sensitivity to changes in data/hyperparameters ▶ Learning representation problems (the XOR) 85 Data Preprocessing Little preprocessing is needed for decision trees. Of course, ideally, clean up the data, but ▶ Missing values are not such a big issue (considering a concrete example, exclude the attributes with missing values) ▶ Outliers are not such a big issue either; the division of nodes is done based on relative counts, not concrete values. ▶ Decision trees can cope with continuous and categorical values directly (i.e., no encoding necessary) Imbalanced classes might cause problems because of small information gain/impurity decrease in splitting. 86 Imbalanced Classes Consider a dataset D where ▶ there are two classes, C = {0, 1}, ▶ 106 examples have the class 1, ▶ 100 examples have the class 0. Then ▶ p1 = 106/(106 + 100) ≈ 1 and p0 = 100/106 ≈ 0, ▶ thus the Gini impurity 1 − p2 1 − p2 0 ≈ 0. Consider an attribute A with V (A) = {a, b}. Splitting D according to A gives to sets Da and Db. What is the impurity decrease caused by the attribute? ImpDec(D, A) = Gini(D) − |Da| |D| Gini(Da) − |Db| |D| Gini(Db) For small |Da| (say ≤ 1000) we have small |Da|/|D| For not so small Da we have Gini(Da) ≈ 0. In both cases, the impurity decrease is very small. 87 Model Tuning - Over/Under Fitting The behavior of the model on the training set: ▶ The left one is strongly overfitting. It would possibly not work well on new data. ▶ The right one is strongly underfitting. It would probably give poor classification results. ▶ The middle one seems good (but still needs to be tested on fresh data). 88 Model Tuning - Overfitting in Decision Trees See the overfitting on the left and the “nice” model on the right. Both overfitting and underfitting are best avoided. But how do we find out? 89 Model Tuning (In General) Recall from the first lecture: The validation should be done on a validation set separated from the training set. We will discuss more sophisticated techniques later. What hyperparameters to set? (see the next slide) What to observe? In the case of decision trees, one should observe the difference between performance measures (e.g., classification accuracy) on the training and validation sets. The too-large difference implies an improperly fitting model. 90 How to Fit Decision Trees? There are several approaches available for decision trees. Generally, the overfitting can be either prevented or resolved. ▶ Pre-pruning: Build the tree so it does not overfit by restricting its size. ▶ Post-pruning: Overfit with a large tree and remove subtrees to make it smaller. ▶ Ensemble methods: Fit several different trees and let them classify together (e.g., using majority voting). The post-pruning approach has been more successful in practice than the pre-pruning because it is usually hard to say when to stop growing the tree. We shall meet this controversy also in deep learning, where recent history shows a similar phenomenon. The ensemble methods will be covered later when we discuss random forests. 91 Pre-Prunning - Hyperparamaters Hyperparameters controlling the size of the tree: ▶ Maximum depth - do not grow the tree beyond the max depth The deeper the tree, the more complex models you can create ⇒ overfitting. Low depth may restrict expressivity. ▶ Minimum number of examples to split a node - if D[τ] is small, τ becomes a leaf (labeled with the majority class) Higher values prevent a model from learning relations specific only for a few examples. Too high values may result in underfitting. ▶ Minimum number of examples required to be in a leaf Similar to the previous one. A higher number means we cannot have very specific branches concerned with particular combinations of values. ▶ Minimum information gain/impurity decrease A small impurity decrease means that the split does not contribute too much to the classification (their proportions after a split are similar to proportions before a split). However, keep in mind that it is weighted average impurity after the split. 92 Post-Pruning - Reduced Error Pruning Train a large tree and then remove nodes that make classification worse on the validation set. Given a decision tree T and its internal node τ ∈ Tint, we denote by T−τ the tree obtained from T by removing the subtree rooted in τ, i.e., τ is a leaf of T−τ . 1: Train T to maximum fit on the training dataset. 2: while true do 3: Err[T ] ← the error of T on the validation set. 4: for τ ∈ Tint do 5: Err[T−τ ] ← the error of T−τ on the validation set. 6: end for 7: if Err[T ] ≤ min{Err[T−τ ] | τ ∈ Tint}] then return T 8: else 9: T ← argmin{Err[T−τ ] | τ ∈ Tint} 10: end if 11: end while The error Err[T ] can be any measure of the “badness” of the decision tree T . For example, 1 − Accuracy. 93 Other Pruning Methods There are more pruning methods. ▶ Rule Post-Pruning: ▶ Transform the tree into a set of rules. Rules correspond to paths in the tree; they have a form of implication: Specific values of attributes imply a class. ▶ Remove the attribute conditions from the premises of the implications. This gives a more refined pruning: Instead of removing the whole subtree, each path of the subtree can be pruned individually. ▶ Using cost complexity measure: Evaluate trees not only based on the classification error but also based on their size. Typically introduce regularization into the error functions: Given a decision tree T Errα(T ) = Err(T ) + α|T | The original paper by Breiman et al. (1984) defined Err(T ) to be the misclassification rate on the training dataset, and |T | is the number of nodes of the tree T . 94 Sensitivity to Small Changes and Randomness ▶ Decision trees are sensitive to small changes in data and hyperparameters. Several attributes may provide (almost) identical information gain but divide the training dataset very differently. ▶ Some implementations choose attributes partially in random (sci-kit-learn). You may get completely different trees. A solution is to train an ensemble of many decision trees and then use majority voting for classification. This is the fundamental idea behind random forests (see later lectures). 95 Iris - Illustration Decision trees trained on the Iris dataset. Iris Setosa is perfectly separated by many choices for the first split. 96 Axis Sensitivity The decision makes divisions along particular axes: That is, rotated data may result in a completely different model. That is why decision trees are often preceded by the principal component analysis (PCA) transformation, which aligns data along the axes of maximum data variance. 97 XOR Training Problem Consider the following training dataset: An ideal decision tree would look like this: 98 Attempts at Training on XOR Max depth = 2: The problem: Both information gain and decrease in impurity consider only the relationship of a single attribute and the class. However, there is no relationship between a single attribute and the class; both attributes need to be considered together! 99 More Attempts at Training on XOR Max depth = 3: It’s better but still fails occasionally. 100 Advantages of Decision Trees ▶ Simple to understand and interpret; trees can be visualized. ▶ Uses a white box model, where conditions are easily explained by boolean logic. ▶ Can approximate an arbitrary (reasonable) boundary and capture complex non-linear relationships between attributes. ▶ Capable of handling multi-class problems. ▶ Little data preparation, unlike other techniques requiring normalization, dummy variables, or missing value removal. ▶ Handles numerical and categorical data. ▶ Not sensitive to outliers since the splitting is based on the proportion of examples within the split ranges and not on absolute values. ▶ The cost of using a well-balanced tree is logarithmic in the number of data points used to train it. 101 Disadvantages of Decision Trees ▶ Overfitting: Trees can be over-complex and not generalize well, needing pruning or limits on tree depth. ▶ Instability: Small data variations can result in very different trees. This is mitigated in ensemble methods. ▶ Non-smooth predictions: Decision trees make piecewise constant approximations, which are not suitable for extrapolation. ▶ Difficulty expressing certain concepts, such as XOR, parity, or multiplexer problems (see the next slide). ▶ Bias in trees: Decision trees can create biased trees if some classes dominate. Balancing the dataset is recommended. ▶ Learning optimal trees is NP-complete: Heuristic algorithms like greedy algorithms are used, which do not guarantee globally optimal trees. Ensemble methods can help. 102 History of Decision Trees ▶ Hunt and colleagues use exhaustive search decision-tree methods (CLS) to model human concept learning in the 1960’s. ▶ In the late 70’s, Quinlan developed ID3 with the information gain heuristic to learn expert systems from examples. ▶ Simultaneously, Breiman, Friedman, and colleagues develop CART (Classification and Regression Trees), similar to ID3. ▶ In the 1980s, various improvements were introduced to handle noise, continuous features, missing features, and improved splitting criteria. Various expert-system development tools results. ▶ Quinlan’s updated decision-tree package (C4.5) released in 1993. 103 Comment on Regression Trees Decision trees can also be used to approximate functions. Assign a function value to the leaves instead of classes. Here, “mse” is the mean-squared-error. 104 Comment on Regression Trees Intuitively, for every subinterval of x1, the value (the red line) is at the average y over the subinterval. How are the subintervals being set? 105 Regression Trees A regression tree is a decision tree whose leaves are labeled by values from R. We follow the same procedure as in decision trees during inference on an input ⃗x. How exactly are such trees trained? We aim to minimize the mean squared error. Assume, for the moment, that we want to train a single-node tree. What will be the best labeling value for the single node? Now, consider the training for arbitrary trees. Assume ordinal attributes. The algorithm also works for discrete attributes; ordinal attributes, however, allow us to make binary splits. The training procedure is the same as for the decision trees, except that the splits and cut points are selected differently. 106 Regression Trees Given a dataset D = {(⃗x1, d1), . . . , (⃗xp, dp)}, we denote by ¯D the average desired value in D, that is ¯D = 1 p p k=1 dk. Consider a dataset D and let us find a cut point for A in D. We are looking for a value H of the attribute A such that the split: D≤H = {(⃗x, d) ∈ D | A(⃗x) ≤ H} D>H = {(⃗x, d) ∈ D | A(⃗x) > H} Minimizes the following split error: 1 |D≤H| (⃗x,d)∈D≤H d − ¯D≤H 2 + 1 |D>H| (⃗x,d)∈D>H d − ¯D>H 2 Denote by MinE(D) the minimum of the split error over all attributes A and all cut points H.Compute ∆ =   1 |D| (⃗x,d)∈D d − ¯D 2   − MinE If ∆ is large enough, split on A and H that minimize the split error. Otherwise, stop splitting and label the leaf with ¯D. 107 Regression Tress Without any lower bound on the number of examples in the leaves, the algorithm will eventually overfit by splitting into (possibly) singleton leaves. 108 Probabilistic Classification 109 Probabilistic Classification – Idea Imagine that ▶ I look out of a window and see a bird, ▶ it is black, approx. 25 cm long and has a rather yellow beak. My daughter asks: What kind of bird is this? My usual answer: This is probably a kind of blackbird (kos ˇcern´y in Czech). Here probably means that out of my extensive catalog of four kinds of birds that I can recognize, ”blackbird” gets the highest degree of belief based on features of this particular bird. Frequentists might say that the largest proportion of birds with similar features I have ever seen were blackbirds. The degree of belief (Bayesian), or the relative frequency (frequentists), is the probability. 110 Basic Discrete Probability Theory ▶ A finite or countably infinite set Ω of possible outcomes, Ω is called sample space. Experiment: Roll one dice once. Sample space: Ω = {1, . . . , 6} ▶ Each element ω of Ω is assigned a ”probability” value f (ω), here f must satisfy ▶ f (ω) ∈ [0, 1] for all ω ∈ Ω, ▶ ω∈Ω f (ω) = 1. If the dice is fair, then f (ω) = 1 6 for all ω ∈ {1, . . . , 6}. ▶ An event is any subset E of Ω. ▶ The probability of a given event E ⊆ Ω is defined as P(E) = ω∈E f (ω) Let E be the event that an odd number is rolled, i.e., E = {1, 3, 5}. Then P(E) = 1 2 . ▶ Basic laws: P(Ω) = 1, P(∅) = 0, given disjoint sets A, B we have P(A ∪ B) = P(A) + P(B), P(Ω ∖ A) = 1 − P(A). 111 Conditional Probability and Independence ▶ P(A | B) is the probability of A given B (assume P(B) > 0) defined by P(A | B) = P(A ∩ B)/P(B) (We assume that B is all and only information known.) A fair dice: what is the probability that 3 is rolled assuming that an odd number is rolled? ... and assuming that an even number is rolled? ▶ A and B are independent if P(A ∩ B) = P(A) · P(B). It is easy to show that if P(B) > 0, then A, B are independent iff P(A | B) = P(A). 112 Random Variables and Random Vectors ▶ A random variable X is a function X : Ω → R. A dice: X : {1, . . . , 6} → {0, 1} such that X(n) = n mod 2. ▶ A random vector is a function X : Ω → Rd . We use X = (X1, . . . , Xd ) where Xi is a random variable returning the i-th component of X. ▶ Consider random variables X1, X2 and Y . The variables X1, X2 are conditionally independent given Y if for all x1, x2 and y we have that P(X1 = x1, X2 = x2 | Y = y) = P(X1 = x1 | Y = y) · P(X2 = x2 | Y = y) 113 Random Vectors – Example Let Ω be a space of colored geometric shapes that are divided into two categories (111 and 000). Assume a random vector X = (Xcolor , Xshape, Xcat) where ▶ Xcolor : Ω → {red, blue}, ▶ Xshape : Ω → {circle, square}, ▶ Xcat : Ω → {111,000}. The following tables give probability distribution of values: category 111: circle square red 0.2 0.02 blue 0.02 0.01 category 000: circle square red 0.05 0.3 blue 0.2 0.2 114 Random Vectors – Example Example: P(red, circle,111) = P(Xcolor = red, Xshape = circle, Xcat = 111) = 0.2 ”Summing over” all possible values of some variable(s) gives the distribution of the rest: P(red, circle) = P(Xcolor = red, Xshape = circle) = P(red, circle,111) + P(red, circle,000) = 0.2 + 0.05 = 0.25 P(red) = 0.2 + 0.02 + 0.05 + 0.3 = 0.57 Thus also, all conditional probabilities can be computed: P(111 | red, circle) = P(red, circle,111) P(red, circle) = 0.2 0.25 = 0.8 115 Bayesian Classification Let Ω be a sample space (a universum) of all objects that can be classified. We assume a probability P on Ω. We consider the problem of binary classification: ▶ Let Y be the random variable for the category which takes values in {000,111}. ▶ Let X be the random vector describing n features of a given instance, i.e., X = (X1, . . . , Xn) ▶ Denote by ⃗x ∈ Rn values of X, ▶ and by xi ∈ R values of Xi . Bayes classifier: Given a vector of feature values ⃗x, CBayes (⃗x) := 111 if P(Y = 111 | X = ⃗x) ≥ P(Y = 000 | X = ⃗x) 000 otherwise. Intuitively, CBayes assigns to ⃗x the most probable category it might be in. 116 Bayesian Classification – Example Imagine a conveyor belt with apples and apricots. A machine is supposed to correctly distinguish apples from apricots based on their weight and diameter. That is, ▶ Y ∈ {111,000} (here our interpretation is 111 = apple, 000 = appricot) ▶ X = (Xweight, Xdiam) We are given a fruit of a diameter of 5cm that weighs 40g. The Bayes classifier compares P(Y = 111 | X = (40g, 5cm)) with P(Y = 000 | X = (40g, 5cm)) and selects the more probable category given the features. Crucial question: Is such a classifier good? There are other classifiers, e.g., one which compares the weight divided by 10 with the diameter and decides based on the answer, or maybe a classifier that sums the weight and the diameter and compares the result with a constant, etc. 117 Bayes Classifier Let C be an arbitrary classifier, that is a function that to every feature vector ⃗x ∈ Rn assigns a class from {000,111}. Define the error of the classifier C by EC = P(Y ̸= C) (Here we slightly abuse notation and apply C to samples, technically we apply the composition C ◦ X of C and X which first determines the features using X and then classifies according to C). Theorem The Bayes classifier CBayes minimizes EC , that is ECBayes = min C is a classifier EC 118 Practical Use of Bayes Classifier The crucial problem: The probability P is not known! In particular, where to get P(Y = 111 | X = ⃗x) ? Note that P(Y = 000 | X = ⃗x) = 1 − P(Y = 111 | X = ⃗x) Given no other assumptions, this requires a table showing the probability of the category 111 for each possible feature vector ⃗x. Where do you get these probabilities? In some cases, the probabilities might come from the knowledge of the solved problem (e.g., applications in physics might be supported by a theory giving the probabilities). In most cases, however, P is estimated from sampled data by ¯P(Y = 111 | X = ⃗x) = number of samples with Y = 111 and X = ⃗x number of samples with X = ⃗x (We use ¯P to denote an estimate of P from data.) 119 Estimating P Consider a problem with X = (X1, X2, X3) where each Xi returns either 0 or 1. What might the data look like? Part of the data table: Y X1 X2 X3 111 1 0 1 111 0 1 1 000 1 0 1 000 0 0 1 111 0 0 0 000 1 1 1 · · · All data with X1 = 1, X2 = 0, X3 = 1: Y X1 X2 X3 111 1 0 1 111 1 0 1 000 1 0 1 000 1 0 1 111 1 0 1 111 1 0 1 Estimate: ¯P(111 | 1, 0, 1) = 2/3 The probability table and the necessary data are typically too large! Concretely, if all X1, . . . , Xn are binary, there are 2n probabilities P(Y = 111 | X = ⃗x), one for each possible ⃗x ∈ {0, 1}n. 120 Let’s Look at It the Other Way Round Theorem (Bayes,1764) P(A | B) = P(B | A) · P(A) P(B) Proof. P(A | B) = P(A ∩ B) P(B) = P(A∩B) P(A) · P(A) P(B) = P(B | A) · P(A) P(B) 121 Bayesian Classification Determine the category for ⃗x by computing P(Y = y | X = ⃗x) = P(Y = y) · P(X = ⃗x | Y = y) P(X = ⃗x) for both y ∈ {000,111} and deciding whether or not the following holds: P(Y = 111 | X = ⃗x) ≥ P(Y = 000 | X = ⃗x) So, to make the classifier, we need to compute the following: ▶ The prior P(Y = 111) (then P(Y = 000) = 1 − P(Y = 111)) ▶ The conditionals P(X = ⃗x | Y = y) for y ∈ {000,111} and for every ⃗x 122 Estimating the Prior and Conditionals ▶ P(Y = 111) can be easily estimated from data by ¯P(Y = 111) = number of samples with Y = 111 number of all samples ▶ If the dimension of features is small, P(X = ⃗x | Y = y) can be estimated from data similarly as P(Y = 111 | X = ⃗x) by ¯P(X = ⃗x | Y = y) = number of samples with Y = y and X = ⃗x number of samples with Y = y Unfortunately, for higher dimensional data too many samples are needed to estimate all P(X = ⃗x | Y = y) (there are too many ⃗x’s). So where is the advantage of using the Bayes thm.?? We introduce independence assumptions about the features! 123 Naive Bayes ▶ We assume that features are (conditionally) independent given the category. That is for all ⃗x = (x1, . . . , xn) and y ∈ {000,111} we assume: P(X = x | Y = y) = P(X1 = x1, · · · , Xn = xn | Y ) = n i=1 P(Xi = xi | Y = y) ▶ Therefore, we only need to specify P(Xi = xi | Y = y) for each possible pair of a feature-value xi and y ∈ {000,111}. Note that if all Xi are binary (values in {0, 1}), this requires specifying only 2n parameters: P(Xi = 1 | Y = 111) and P(Xi = 1 | Y = 000) for each Xi as P(Xi = 0 | Y = y) = 1 − P(Xi = 1 | Y = y) for y ∈ {000,111}. Compared to specifying 2n parameters without any independence assumption. 124 Estimating the marginal probabilities Estimate the probabilities P(Xi = xi | Y = y) by ¯P(Xi = xi | Y = y) = number of samples with Xi = xi and Y = y number of samples with Y = y Example: Consider a problem with X = (X1, X2, X3) where each Xi returns either 0 or 1. The data is Y X1 X2 X3 111 1 0 1 111 0 1 1 000 1 0 1 000 0 0 1 111 0 0 0 000 1 1 1 ¯P(X1 = 1 | Y = 111) = 1/3 ¯P(X1 = 1 | Y = 000) = 2/3 ¯P(X2 = 1 | Y = 111) = 1/3 ¯P(X2 = 1 | Y = 000) = 1/3 ¯P(X3 = 1 | Y = 111) = 2/3 ¯P(X3 = 1 | Y = 000) = 1 125 Naive Bayes – Example Consider classification of geometric shapes: X1 ∈ {small, medium, large} X2 ∈ {red, blue, green} X3 ∈ {square, triangle, circle} Assume that we have already estimated the following probabilities: Y = 111 Y = 000 ¯P(Y ) 0.5 0.5 ¯P(small | Y ) 0.4 0.4 ¯P(medium | Y ) 0.1 0.2 ¯P(large | Y ) 0.5 0.4 ¯P(red | Y ) 0.9 0.3 ¯P(blue | Y ) 0.05 0.3 ¯P(green | Y ) 0.05 0.4 ¯P(square | Y ) 0.05 0.4 ¯P(triangle | Y ) 0.05 0.3 ¯P(circle | Y ) 0.9 0.3 Does (medium, red, circle) belong to the category 111 ? 126 Y = 111 Y = 000 ¯P(Y ) 0.5 0.5 ¯P(medium | Y ) 0.1 0.2 ¯P(red | Y ) 0.9 0.3 ¯P(circle | Y ) 0.9 0.3 Denote ⃗x = (medium, red, circle). P(Y = 111 | X = ⃗x) = = P(111) · P(medium | 111) · P(red | 111) · P(circle | 111) / P(X = ⃗x) . = 0.5 · 0.1 · 0.9 · 0.9 / P(X = ⃗x) = 0.0405/P(X = ⃗x) P(Y = 000 | X = ⃗x) = = P(000) · P(medium | 000) · P(red | 000) · P(circle | 000) / P(X = ⃗x) . = 0.5 · 0.2 · 0.3 · 0.3 / P(X = ⃗x) = 0.009/P(X = ⃗x) (Note that we used the estimates ¯P of P to finish the computation above.) Apparently, P(Y = 111 | X = ⃗x) . = 0.0405/P(X = ⃗x) > 0.009/P(X = ⃗x) . = P(000 | X = ⃗x) So we classify ⃗x to the category 111. 127 Estimating Probabilities in Practice We already know that P(Xi = xi | Y = y) can be estimated by ¯P(Xi = xi | Y = y) = ℓy,xi / ℓy where ▶ ℓy,xi = number of samples with Y = y and Xi = xi ▶ ℓy = number of samples with Y = y Problem: If, by chance, a rare value xi of a feature Xi never occurs in the training data, we get ¯P(Xi = xi | Y = y) = 0 for both y ∈ {000,111} But then ¯P(X = x) = 0 for x containing the value xi for Xi , and thus ¯P(Y = y | X = x) is not well defined. Moreover, ¯P(Y = y) · ¯P(X = x | Y = y) = 0 (for y ∈ {000,111}) so even this cannot be used for classification. 128 Probability Estimation Example Training data: Size Color Shape Class small red circle 111 large red circle 111 small red triangle 000 large blue circle 000 Estimated probabilities: Y = 111 Y = 000 ¯P(Y ) 0.5 0.5 ¯P(small | Y ) 0.5 0.5 ¯P(medium | Y ) 0 0 ¯P(large | Y ) 0.5 0.5 ¯P(red | Y ) 1 0.5 ¯P(blue | Y ) 0 0.5 ¯P(green | Y ) 0 0 ¯P(square | Y ) 0 0 ¯P(triangle | Y ) 0 0.5 ¯P(circle | Y ) 1 0.5 Note that ¯P(medium | 111) = P(medium | 000) = 0 and thus also ¯P(medium, red, circle) = 0. So what is ¯P(111 | medium, red, circle) ? 129 Smoothing ▶ To account for estimation from small samples, probability estimates are adjusted or smoothed. ▶ Laplace smoothing adds one to every count of feature values ˜P(Xi = xi | Y = y) = ℓy,xi + 1 ℓy + vi where ▶ ℓy = number of training samples with Y = y, ▶ ℓy,xi = number of training samples with Y = y and Xi = xi , ▶ vi is the number of all distinct values of the variable Xi . To understand note that ℓy = xi is a value of Xi ℓy,xi and thus ¯P(Xi = xi | Y = y) = ℓy,xi / xi is a value of Xi ℓy,xi ˜P(Xi = xi | Y = y) = (ℓy,xi + 1) / xi is a value of Xi (ℓy,xi + 1) 130 Laplace Smoothing Example ▶ Assume training set contains 10 samples of category 111: ▶ 4 small ▶ 0 medium ▶ 6 large ▶ Estimate parameters as follows ▶ ˜P(small | 111) = (4 + 1)/(10 + 3) = 0.384 ▶ ˜P(medium | 111) = (0 + 1)/(10 + 3) = 0.0769 ▶ ˜P(large | 111) = (6 + 1)/(10 + 3) = 0.538 131 Continuous Features Ω may be (potentially) continuous, Xi may assign a continuum of values in R. ▶ The probabilities are computed using probability density p : R → R+. A random variable X : Ω → R+ has a density p : R → R+ if for every interval [a, b] we have P(a ≤ X ≤ b) = b a p(x)dx Usually, P(Xi | Y = y) is used to denote the density of Xi conditioned on Y = y. ▶ The densities P(Xi | Y = y) are usually estimated using Gaussian densities as follows: ▶ Estimate the mean µiy and the standard deviation σiy based on training data. ▶ Then put ¯P(Xi | Y = y) = 1 σiy √ 2π exp −(Xi − µiy )2 2σ2 iy 132 Comments on Naive Bayes ▶ Tends to work well despite rather a strong assumption of conditional independence of features. ▶ Experiments show that it is quite competitive with other classification methods. Even if the probabilities are not accurately estimated, it often picks the correct maximum probability category. ▶ Directly constructs a model from parameter estimates that are calculated from the training data. ▶ Typically handles outliers and noise well. ▶ Missing values are easy to deal with (simply average overall missing values in feature vectors). 133 Bayesian Networks (Basic Information) In the Naive Bayes, we have assumed that all features X1, . . . , Xn are independent. This is usually not realistic. E.g. Variables ”rain” and ”grass wet” are (usually) strongly dependent. What if we return some dependencies? (But now in a well-defined sense.) Bayesian networks are a graphical model that uses a directed acyclic graph to specify dependencies among variables. 134 Bayesian Networks – Example Now, e.g., P(C, S, W , B, A) = P(C) · P(S) · P(W | C) · P(B | C, S) · P(A | B) Now, we may, e.g., infer the probability P(C = T | A = T) that we sit in the wrong chair, assuming that our back aches. We have to store only 10 numbers as opposed to 25 − 1 possible probabilities for all vectors of values of C, S, W , B, A. 135 Bayesian Networks – Learning & Naive Bayes Many algorithms have been developed for learning: ▶ the structure of the graph of the network, ▶ the conditional probability tables. The methods are based on maximum-likelihood estimation, gradient descent, etc. Automatic procedures are usually combined with expert knowledge. Can you express the naive Bayes for Y , X1, . . . , Xn using a Bayesian network? 136 Classifier Evaluation 137 Classifier Assume binary classification into two classes {0, 1}. Consider a classification dataset: {(⃗xk, ck) | k = 1, . . . , p} Here ⃗xk is a vector of attributes/features and ck ∈ {0, 1} for all k. Consider a sequence of predictions generated by a classifier: h1, . . . , hp ∈ {0, 1} Here each hk has been predicted for the k-the example (⃗xk, ck). How good are the predictions h1, . . . , hp w.r.t. c1, . . . , cp? There are many possible metrics ... I will call the class 1 positive and the class 0 negative. Note that the class 0 is not negative in the numerical sense but in the absence of something (e.g., predicted illness). 138 Confusion Matrix for Binary Classifier Predicted 1 0 Actual 1 TP FN 0 FP TN ▶ TP = number of correctly classified examples with actual class 1 TP = |{k | hk = 1 ∧ ck = 1}| ▶ TN = number of correctly classified examples with actual class 0 TN = |{k | hk = 0 ∧ ck = 0}| ▶ FP = number of incorrectly classified examples with actual class 0 FP = |{k | hk = 1 ∧ ck = 0}| ▶ FN = number of correctly classified examples with actual class 1 FN = |{k | hk = 0 ∧ ck = 1}| 139 Example Given a sample of 12 individuals, eight have cancer, and four are cancer-free. Assume that we have trained a classifier with the following results: Index 1 2 3 4 5 6 7 8 9 10 11 12 Actual 1 1 1 1 1 1 1 1 0 0 0 0 Predicted 0 0 1 1 1 1 1 1 1 0 0 0 Result FN FN TP TP TP TP TP TP FP TN TN TN Actual condition Predicted condition Cancer Non-cancer Cancer TP = 6 FN = 2 Non-cancer FP = 1 TN = 3 Total 8 + 4 = 12 140 Terminology ▶ TP aka hit ▶ TN aka correct rejection ▶ FP aka type I error, false alarm, overestimation ▶ FN aka type II error, miss, underestimation Usually, TP, TN, FP, and FN are used to denote the individual examples of a particular kind and the number of these examples. In what follows, we also use ▶ P = TP + FN of all cases with the actual class 1 ▶ N = TN + FP of all cases with the actual class 0 ▶ PP = TP + FP of all cases with the predicted class 1 ▶ PN = TN + FN of all cases with the predicted class 0 Note that P + N = PP + PN is the number of all cases. There is a large number of derived metrics. We consider some of the most used in practice. 141 Accuracy Accuracy = TP + TN P + N Intuitively, Accuracy is the proportion of correctly classified cases w.r.t. all cases. Example: Consider our cancer predictor with the confusion matrix Actual condition Predicted condition Cancer Non-cancer Cancer TP = 6 FN = 2 Non-cancer FP = 1 TN = 3 Total 8 + 4 = 12 The Accuracy is ACC = TP + TN P + N = 6 + 3 12 = 3 4 142 Accuracy - Imbalanced Classes Accuracy can be misleading when the classes are imbalanced: ▶ Consider 100 cases, 90 in the class 0 and 10 in the class 1, ▶ consider a classifier that returns 1 for a single sample of class 1 and 0 for all other samples. Actual Predicted Pos Neg Pos 1 9 Neg 0 90 Total 90 + 10 = 100 The Accuracy is 91/100 > 0.9. Pretty good, right? However, the classifier is pretty bad in the positive cases. In the case of cancer prediction, such a classifier would be a disaster. 143 Precision & Recall To mitigate the defect of the Accuracy, we may compute the following metrics: Precision = TP PP (= how often is predicted positive actually positive) Precision is also known as positive predictive value (PPV) Recall = TP P (= how often is actually positive predicted positive) Recall is also known as true positive rate, sensitivity, hit rate, and power. 144 Precision & Recall - Example Example: In our cancer example: Actual condition Predicted condition Cancer Non-cancer Cancer TP = 6 FN = 2 Non-cancer FP = 1 TN = 3 Total 8 + 4 = 12 ▶ Precision measures how often is the patient predicted to be ill truly ill (in our case, 6/7) ▶ Recall measures how often is an ill patient found to be ill (in our case, 6/8) 145 Precision & Recall - Imbalanced Classes ▶ Consider 100 cases, 90 in the class 0 and 10 in the class 1, ▶ consider a classifier that returns 1 for a single sample of class 1 and 0 for all other samples. Actual Predicted Pos Neg Pos 1 9 Neg 0 90 Total 90 + 10 = 100 Precision = 1 Recall = 1 10 You can see that the predictor is very precise (on the class 1) but useless due to the weak Recall. 146 Precision & Recall - Relative Importance Let us get back to our cancer example: Actual condition Predicted condition Cancer Non-cancer Cancer TP = 6 FN = 2 Non-cancer FP = 1 TN = 3 Total 8 + 4 = 12 Consider Precision and Recall. By now, you should remember what they measure. Which of the two is more important in medicine? Which of the two is more important for plagiarism detectors? Can we get a single number summarizing both Precision and Recall? For example, to compare two classifiers. 147 F1 Score F1 score is the harmonic mean of Recall and Precision: F1 = 2 Recall−1 + Precision−1 = 2TP 2TP + FP + FN Compare the arithmetic (left) and harmonic (right) mean: The harmonic mean prefers the two values closer to each other. For example, the harmonic mean of 2/3 and 1/3 is (approx) 0.44444. 148 F1 Score - Examples Consider the cancer example: Actual condition Predicted condition Cancer Non-cancer Cancer TP = 6 FN = 2 Non-cancer FP = 1 TN = 3 Total 8 + 4 = 12 Here F1 = 2TP 2TP+FP+FN = (2 · 6)/((2 · 6) + 1 + 2) = 0.8. Our imbalanced example: Actual Predicted Pos Neg Pos 1 9 Neg 0 90 Total 90 + 10 = 100 Here F1 = 2TP 2TP+FP+FN = (2 · 1)/((2 · 1) + 0 + 9) = 0.18. Note that the average of Precision and Recall is 0.55, which would give us a much less severe warning that the classifier is bad. 149 Imbalanced Classes Once More Note that the standard definitions of Precision and Recall for binary classifiers reveal only part of the truth. In particular, true negatives are not used in the definition of F1. Consider Actual Predicted Pos Neg Pos 90 0 Neg 9 1 Total 90 + 10 = 100 Precision = 90/99 Recall = 90/90 F1 = 2TP 2TP + FP + FN = (2 · 90)/(2 · 90 + 9 + 0) = 0.95 All great, except that the classifier sucks on the negative cases. If you are concerned with the negative cases, swap the classes and compute another set of metrics. 150 F1 Score ▶ F1 is often used as a summary score for binary classifiers instead of Accuracy. Works better with imbalanced classes. ▶ Criticised for giving Precision and Recall the same importance. ▶ Is not symmetric, ignores true negatives, i.e., is misleading for some cases of imbalanced classes. ▶ Fowlkes-Mallows index is a geometric mean of Precision and Recall (used in clustering). The geometric mean is between the arithmetic and harmonic mean. For example, the geometric mean of 2/3 and 1/3 is (approx) 0.4714. 151 More Derived Metrics You can see that the negative predictive value becomes the Precision when we swap the classes (and vice versa). 152 More Derived Metrics Note that specificity becomes Recall when we swap the classes (and vice versa). For example, medical doctors communicate in terms of sensitivity and specificity. 153 Actual condition Predicted condition Cancer Non-cancer Cancer TP = 6 FN = 2 Non-cancer FP = 1 TN = 3 Total 8 + 4 = 12 TPR = Sensitivity = Recall = TP/P = 6/8 How often is positive predicted positive? TNR = Specificity = TN/N = 3/4 How often is negative predicted negative? FPR = Prob. of false alarm = FP/N = 1/4 How often is negative predicted positive? FNR = Miss rate = FN/P = 2/8 How often is positive predicted negative? 154 Evaluating Multi-class Classifiers 155 Classification Into Multiple Classes Assume classification into classes from a finite set C. Consider a classification dataset: {(⃗xk, ck) | k = 1, . . . , p} Here ⃗xk is a vector of attributes/features and ck ∈ C for all k. Consider a sequence of predictions generated by a classifier: h1, . . . , hp ∈ C Here each hk has been predicted for the k-the example (⃗xk, ck). How good are the predictions h1, . . . , hp w.r.t. c1, . . . , cp? There are many possible metrics ... Consider an arbitrary (finite) number of classes in C. 156 Confusion Matrix Assume that C = {1, . . . , m}. Now, given two classes i, j ∈ C we denote by Mij the number of samples of class i classified into the class j. Formally, Mij = |{k | ck = i ∧ hk = j}| Actual Predicted 1 · · · j · · · m 1 M11 · · · M1j · · · M1m ... ... ... ... i Mi1 · · · Mij · · · Mim ... ... ... ... m Mm1 · · · Mmj · · · Mmm 157 Example Actual Predicted big big big big small big medium medium big small big big small small small small medium medium medium small small small big big medium small small medium big big 158 Example Actual Predicted big big big big small big medium medium big small big big small small small small medium medium medium small small small big big medium small small medium big big Actual Predicted big medium small big 5 0 1 medium 0 2 2 small 1 1 3 Note that the diagonal counts the correctly classified samples. The off-diagonal elements correspond to misclassified samples. 158 Metrics We can easily generalize Accuracy, Precision, Recall, and F1-score from the binary classification to multiple classes. Notation ▶ Mi• = m j=1 Mij ▶ M•j = m i=1 Mij ▶ M•• = m i=1 m j=1 Mij Now, the metrics: Accuracy = m k=1 Mkk M•• For a given class i ∈ C: Precision[i] = Mii M•i Recall[i] = Mii Mi• F1[i] = 2 ∗ Precision[i] ∗ Recall[i] Precision[i] + Recall[i] Note that Precision, Recall, and F1 can be defined only for a given class! 159 Example Actual Predicted big medium small big 5 0 1 medium 0 2 2 small 1 1 3 Compute the metrics. 160 Example Accuracy = (5+2+3)/15 = 0.66 Precision[big] = 5/6 Precision[medium] = 2/3 Precision[small] = 3/6 Recall[big] = 5/6 Recall[medium] = 2/4 Recall[small] = 3/5 Actual Predicted big medium small big 5 0 1 medium 0 2 2 small 1 1 3 F1[big] = 2 ∗ (5/6) ∗ (5/6) (5/6) + (5/6) = 5/6 = 0.83 F1[medium] = 0.57 F1[medium] = 0.54 How do you get a single number out of these? Average Precision, Recall, and F1 are usually computed, but one needs to be careful about the variance. 161 162 Machine learning/data mining is needed to understand the matrix. 163 Probabilistic Classifier Evaluation 164 Binary Probabilistic Classifier Assume binary classification into two classes {0, 1}. Consider a classification dataset: {(⃗xk, ck) | k = 1, . . . , p} Here ⃗xk is a vector of attributes/features and ck ∈ C for all k. Consider a sequence of predictions generated by a classifier. Now the classifier returns probability of class 1 for a given input: h1, . . . , hp ∈ [0, 1] Here each hk has been predicted for the k-the example (⃗xk, ck). How to interpret the predictions h1, . . . , hp? How good are the predictions h1, . . . , hp w.r.t. c1, . . . , cp? 165 Probabilistic Classifier Let us fix predictions h1, . . . , hp. Given a threshold T ∈ [0, 1] we define hT k = 1 if hk ≥ T 0 if hk < T For every T we can compute all the metrics (Precision, Recall, etc.) Given a metric MET and a threshold T, we denote by MET[T] the metric MET evaluated on hT 1 , . . . , hT p . We obtain TP[T] = |{k | hT k = 1 ∧ ck = 1}| and TN[T], FP[T], FN[T], Accuracy[T], Precision[T], Recall[T], F1[T], . . . However, all metrics are now functions of the threshold T. 166 Thresholded Classifier Metrics Index 1 2 3 4 5 6 7 8 9 10 11 12 Actual 1 1 1 1 1 0 0 1 1 0 0 0 Predicted .98 .95 .9 .86 .66 .48 .43 .42 .36 .15 .1 .05 T=0.5 TP TP TP TP TP TN TN FN FN TN TN TN T=0.42 TP TP TP TP TP FP FP TP FN TN TN TN T=0.1 TP TP TP TP TP FP FP TP TP FP FP TN For example, consider T = 0.42, then TP[T] = 6 FP[T] = 2 FN[T] = 1 TN[T] = 3 Accuracy[T] = 3 + 6 12 Precision[T] = 6 6 + 2 Recall[T] = 6 6 + 1 F1[T] = 2 · 6/8 · 6/7 6/8 + 6/7 = 0.8 167 Receiver Operating Characteristic (ROC) Consider two metrics for a given T: TPR[T] = TP[T] P[T] (True Positive Rate) FPR[T] = FP[T] N[T] (False Positive Rate) ROC curve is then a function ROC : [0, 1] → [0, 1]2 defined by ROC(T) = (TPR[T], FPR[T]) Observe that ROC(0) = (1, 1) Because the classifier with T = 0 simply classifies everything as positive, i.e., into the class 1. Both TPR[T] and FPR[T] are non-increasing in T. 168 Index 1 2 3 4 5 6 7 8 9 10 11 12 Actual 1 1 1 1 1 0 0 1 1 0 0 0 Predicted .98 .95 .9 .86 .66 .48 .43 .42 .36 .15 .1 .05 ▶ 0.00 ≤ T ≤ 0.05: TPR = 1 and FPR = 1 ▶ 0.05 < T ≤ 0.10: TPR = 1 and FPR = 4/5 ▶ 0.10 < T ≤ 0.15: TPR = 1 and FPR = 3/5 ▶ 0.15 < T ≤ 0.36: TPR = 1 and FPR = 2/5 ▶ 0.36 < T ≤ 0.42: TPR = 6/7 and FPR = 2/5 ▶ 0.42 < T ≤ 0.43: TPR = 5/7 and FPR = 2/5 ▶ 0.43 < T ≤ 0.48: TPR = 5/7 and FPR = 1/5 ▶ 0.48 < T ≤ 0.66: TPR = 5/7 and FPR = 0 ▶ 0.66 < T ≤ 0.86: TPR = 4/7 and FPR = 0 ▶ 0.86 < T ≤ 0.90: TPR = 3/7 and FPR = 0 ▶ 0.90 < T ≤ 0.95: TPR = 2/7 and FPR = 0 ▶ 0.95 < T ≤ 0.98: TPR = 1/7 and FPR = 0 ▶ 0.98 < T ≤ 1.00: TPR = 0 and FPR = 0 169 ROC Index 1 2 3 4 5 6 7 8 9 10 11 12 Actual 1 1 1 1 1 0 0 1 1 0 0 0 Predicted .98 .95 .9 .86 .66 .48 .43 .42 .36 .15 .1 .05 170 Iris Dataset - A Classifier Example from the scikit-learn manual - SVM classifier trained in Iris 171 Using ROC and Threshold Search for the best threshold at the elbow of the ROC curve. 172 ROC - Explanation The larger the area under the ROC curve (ROC-AUC), the better. ROC-AUC ranges from 0 to 1. ROC-AUC ≈ 0.5 indicates random guessing. 173 ROC-AUC Index 1 2 3 4 5 6 7 8 9 10 11 12 Actual 1 1 1 1 1 0 0 1 1 0 0 0 Predicted .98 .95 .9 .86 .66 .48 .43 .42 .36 .15 .1 .05 ROC-AUC = 0.8857 174 Iris - ROC-AUC ROC-AUC = 0.79 175 ROC-AUC - Probabilistic Interpretation How is the ROC-AUC connected with the samples? Consider our cancer detection example: Index 1 2 3 4 5 6 7 8 9 10 11 12 Actual 1 1 1 1 1 0 0 1 1 0 0 0 Predicted .98 .95 .9 .86 .66 .48 .43 .42 .36 .15 .1 .05 AUC has a probabilistic explanation: Consider the following experiment: ▶ Choose randomly a patient i from positive patients Each positive patient has the same probability of being chosen. ▶ Choose randomly a patient j from negative patients Each negative patient has the same probability of being chosen. ▶ Check if hi > hj . The ROC-AUC is the probability of succeeding in the hi > hj test. 176 Summary We have discussed various metrics that can be used to evaluate the quality of a classifier. The metrics summarize the results of evaluation on a given dataset. We have discussed metrics for evaluating ▶ binary classifiers, Accuracy, Precision, Recall, F1, and few more ▶ multi-class classifiers, Accuracy, Precision, Recall, F1 ▶ probabilistic classifiers, parametrized metrics, ROC-AUC There are still several questions unanswered: ▶ When to use the metrics. ▶ How to estimate the influence of sampling the dataset. 177 Use of Evaluation Metrics In our case, the following scenarios are typical: ▶ Final test: Evaluate the model on the test set (separated at the beginning of training) and then compute the metrics. May inform the user about the quality of the model. ▶ Validation: Evaluate models on a separate validation set and use the metrics to compare models. There are (at least) two scenarios in which this happens: ▶ Hyperparameter fine-tuning. ▶ Comparison of different models (e.g., KNN and decision trees). Keep in mind that the metrics are artificial, and the results of the model are roughly summarized. It would be best if you always strived to test the proper functionality of your model in as natural conditions as possible. For example, a model for medical diagnosis should be evaluated by medical doctors who may observe many features of its behavior that are difficult to express quantitatively. 178 How to Estimate Significance Machine learning models are typically trained on (pseudo) random samples of data objects. For example, a set of patients treated by the concrete hospital. However, the purpose of testing/evaluation is to get information about the whole population (i.e., all possible patients). How do we estimate how much specific properties of the given sample influence our model? This is a challenging question; methods of inferential statistics are needed to get the answer. We will consider these issues in some later lecture. Concretely, ▶ Bias-variance tradeoff ▶ Statistical tests for testing ▶ significance of the metrics values, ▶ paired t-tests for comparing models. 179 How to Compare Classifiers Let us consider two classifiers. How do you compare them? Accuracies and F1 scores can be compared easily (they are just numbers). How to compare (Precision1, Recall1) of the fist classifier with (Precision2, Recall2) of the second classifier? Thresholding ▶ Introduce a threshold 0 ≤ t ≤ 1 ▶ Demand, one of the two metrics (typically the Recall), to be at least t. That is Recall1 ≥ t Recall2 ≥ t ▶ Compare the values of the other metric numerically. In our case, decide whether Precision1 ≥ Precision2 (Still need to be concerned about the statistical significance.) 180 Example Actual Predicted condition condition Canc. Non-canc. Cancer 6 2 Non-canc. 1 3 Total 8 + 4 = 12 Actual Predicted condition condition Canc. Non-canc. Cancer 5 3 Non-canc. 0 4 Total 8 + 4 = 12 Precision1 = 6 7 Recall1 = 6 8 Precision2 = 5 5 = 1 Recall2 = 5 8 Consider a threshold t on the Recall. The second classifier is better if the threshold t is 5/8, then the second classifier is better. If the threshold t is 6/8, then the second classifier is unacceptable. 181 Numerical features ▶ Throughout this lecture we assume that all features are numerical, i.e., feature vectors belong to Rn. ▶ Most non-numerical features can be conveniently transformed to numerical ones. For example: ▶ Colors {blue, red, yellow} can be represented by {(1, 0, 0), (0, 1, 0), (0, 0, 1)} (one-hot encoding) ▶ Words can be embedded into vector spaces by various means (word2vec etc.) ▶ A black-and-white picture of x × y pixels can be encoded as a vector of xy numbers that capture the shades of gray of the pixels. (Even though this is not the best way of representing images.) 182 Basic Problems We consider two basic problems: ▶ (Binary) classification Our goal: Classify inputs into two categories. 183 Basic Problems We consider two basic problems: ▶ (Binary) classification Our goal: Classify inputs into two categories. ▶ Regression Our goal: Find a (hypothesized) functional dependency in data. 183 Linear Models Binary Classification 184 Binary classification in Rn Our goal: ▶ Given a set D of training examples of the form (⃗x, c) where ⃗x ∈ Rn and c ∈ {0, 1}, ▶ construct a model h : Rn → {0, 1} that is consistent with D, i.e., h(⃗x) = c for all training examples (⃗x, c) ∈ D Comments: ▶ In practice, we often do not strictly demand h(⃗x) = c for all training examples (⃗x, c) ∈ D (often it is impossible) ▶ We are more interested in good generalization, that is how well h classifies new instances that do not belong to D. (Recall that we usually evaluate accuracy of the resulting hypothesized function h on a test set.) 185 Models We consider two kinds of hypothesis spaces: ▶ Linear (affine) classifiers (this lecture) ▶ Non-linear classifiers (kernel SVM, neural networks) (later lectures) 186 Linear Classifier – Example 187 Length and Scalar Product of Vectors ▶ We consider vectors ⃗x = (x1, . . . , xn) ∈ Rm. ▶ Euclidean metric on vectors: ||⃗x|| = n i=1 x2 i The distance between two vectors (points) ⃗x, ⃗y is ||⃗x − ⃗y||. ▶ Scalar product ⃗x · ⃗y of vectors ⃗x = (x1, . . . , xn) and ⃗y = (y1, . . . , yn) defined by ⃗x · ⃗y = n i=1 xi yi ▶ Recall that ⃗x · ⃗y = ||⃗x|| ||⃗y|| cos θ where θ is the angle between ⃗x and ⃗y. That is ⃗x · ⃗y is the length of the projection of ⃗y on ⃗x multiplied by ||⃗x||. ▶ Note that ⃗x · ⃗x = ||⃗x|| 2 188 Linear Classifier A linear classifier h[⃗w] is determined by a vector of weights ⃗w = (w0, w1, . . . , wn) ∈ Rn+1 as follows: Given ⃗x = (x1, . . . , xn) ∈ Rn, h[⃗w](⃗x) := 1 w0 + n i=1 wi · xi ≥ 0 0 w0 + n i=1 wi · xi < 0 More succinctly: h(⃗x) = sgn w0 + n i=1 wi · xi where sgn(y) = 1 y ≥ 0 0 y < 0 We define separating hyperplane determined by ⃗w as the set of all ⃗x ∈ Rn satisfying w0 + n i=1 wi · xi = 0. 189 190 190 190 190 Linear Classifier – Geometry 191 Linear Classifier – Notation Given ⃗x = (x1, . . . , xn) ∈ Rn we define an augmented feature vector ˜x = (x0, x1, . . . , xn) where x0 = 1 This makes the notation for the linear classifier more succinct: h[⃗w](⃗x) = sgn(⃗w · ˜x) 192 Linear Classifier – Learning 0 0 0 0 1 1 1 ▶ classification in the plane using a linear classifier ▶ if a point is incorrectly classified, the learning algorithm turns the line (hyperplane) to improve the classification 193 Perceptron Learning ▶ Given a training set D = {(⃗x1, c1) , (⃗x2, c2)) , . . . , (⃗xp, cp))} Here ⃗xk = (xk1 . . . , xkn) ∈ Rn and ck ∈ {0, 1}. Recall that ˜xk = (xk0, xk1 . . . , xkn) where xk0 = 1. ▶ A weight vector ⃗w ∈ Rn+1 is consistent with D if h[⃗w](⃗xk) = sgn(⃗w · ˜xk) = ck for all k = 1, . . . , p D is linearly separable if there is a vector ⃗w ∈ Rn+1 which is consistent with D. ▶ Our goal is to find a consistent ⃗w assuming that D is linearly separable. 194 Perceptron – Learning Algorithm Online learning algorithm: Idea: Cyclically go through the training examples in D and adapt weights. Whenever an example is incorrectly classified, turn the hyperplane so that the example becomes closer to its correct half-space. Compute a sequence of weight vectors ⃗w(0), ⃗w(1), ⃗w(2), . . .. ▶ ⃗w(0) is randomly initialized close to ⃗0 = (0, . . . , 0) ▶ In (t + 1)-th step, ⃗w(t+1) is computed as follows: ⃗w(t+1) = ⃗w(t) − ε · h[⃗w(t) ](⃗xk) − ck · ˜xk = ⃗w(t) − ε · sgn ⃗w(t) · ˜xk − ck · ˜xk Here k = (t mod p) + 1, i.e., the examples are considered cyclically, and 0 < ε ≤ 1 is a learning rate. Theorem (Rosenblatt) If D is linearly separable, then there is t∗ such that ⃗w(t∗) is consistent with D. 195 Example Training set: D = {((2, −1), 1), ((2, 1), 1), ((1, 3), 0)} That is ⃗x1 = (2, −1) ⃗x2 = (2, 1) ⃗x3 = (1, 3) ˜x1 = (1, 2, −1) ˜x2 = (1, 2, 1) ˜x3 = (1, 1, 3) c1 = 1 c2 = 1 c3 = 0 Assume that the initial vector ⃗w(0) is ⃗w(0) = (0, −1, 1). Consider ε = 1. 196 Example: Separating by ⃗w(0) −1 1 2 3 −3 −2 −1 1 2 3 4 ⃗x1 ⃗x2 ⃗x3 Denoting ⃗w(0) = (w0, w1, w2) = (0, −1, 1) the blue separating line is given by w0 + w1x1 + w2x2 = 0. The red vector normal to the blue line is (w1, w2). The points on the side of (w1, w2) are assigned 1 by the classifier, the others zero. (In this case ⃗x3 is assigned one and ⃗x1, ⃗x2 are assigned zero, all of this is inconsistent with c1 = 1, c2 = 1, c3 = 0.) 197 Example: Computing ⃗w(1) We have ⃗w(0) · ˜x1 = (0, −1, 1) · (1, 2, −1) = 0 − 2 − 1 = −3 thus sgn ⃗w(0) · ˜x1 = 0 and thus sgn ⃗w(0) · ˜x1 − c1 = 0 − 1 = −1 (I.e., ⃗x1 is not correctly classified, and ⃗w(0) is not consistent with D.) Hence, ⃗w(1) = ⃗w(0) − sgn ⃗w(0) · ˜x1 − c1 · ˜x1 = ⃗w(0) + ˜x1 = (0, −1, 1) + (1, 2, −1) = (1, 1, 0) 198 Example: Separating by ⃗w(1) −1 1 2 3 −3 −2 −1 1 2 3 4 ⃗x1 ⃗x2 ⃗x3 199 Example: Computing ⃗w(2) We have ⃗w(1) · ˜x2 = (1, 1, 0) · (1, 2, 1) = 1 + 2 = 3 thus sgn ⃗w(1) · ˜x2 = 1 and thus sgn ⃗w(1) · ˜x2 − c2 = 1 − 1 = 0 (I.e., ⃗x2 is currently correctly classified by ⃗w(1) . However, as we will see, ⃗x3 is not well classified.) Hence, ⃗w(2) = ⃗w(1) = (1, 1, 0) 200 Example: Computing ⃗w(3) We have ⃗w(2) · ˜x3 = (1, 1, 0) · (1, 1, 3) = 1 + 1 = 2 thus sgn ⃗w(2) · ˜x3 = 1 and thus sgn ⃗w(2) · ˜x3 − c3 = 1 − 0 = 1 (This means that ⃗x3 is not well classified, and ⃗w(2) is not consistent with D.) Hence, ⃗w(3) = ⃗w(2) − sgn ⃗w(2) · ˜x3 − c3 · ˜x3 = ⃗w(2) − ˜x3 = (1, 1, 0) − (1, 1, 3) = (0, 0, −3) 201 Example: Separating by ⃗w(3) −1 1 2 3 −3 −2 −1 1 2 3 4 ⃗x1 ⃗x2 ⃗x3 202 Example: Computing ⃗w(4) We have ⃗w(3) · ˜x1 = (0, 0, −3) · (1, 2, −1) = 3 thus sgn ⃗w(3) · ˜x1 = 1 and thus sgn ⃗w(3) · ˜x1 − c1 = 1 − 1 = 0 (I.e., ⃗x1 is currently correctly classified by ⃗w(3) . However, we shall see that ⃗x2 is not.) Hence, ⃗w(4) = ⃗w(3) = (0, 0, −3) 203 Example: Computing ⃗w(5) We have ⃗w(4) · ˜x2 = (0, 0, −3) · (1, 2, 1) = −3 thus sgn ⃗w(4) · ˜x2 = 0 and thus sgn ⃗w(4) · ˜x2 − c2 = 0 − 1 = −1 (I.e., ⃗x2 is not correctly classified, and ⃗w(4) is not consistent with D.) Hence, ⃗w(5) = ⃗w(4) − sgn ⃗w(4) · ˜x2 − c2 · ˜x2 = ⃗w(4) + ˜x2 = (0, 0, −3) + (1, 2, 1) = (1, 2, −2) 204 Example: Separating by ⃗w(5) −1 1 2 3 −3 −2 −1 1 2 3 4 ⃗x1 ⃗x2 ⃗x3 205 Example: The result The vector ⃗w(5) is consistent with D: sgn ⃗w(5) · ˜x1 = sgn ((1, 2, −2) · (1, 2, −1)) = sgn(7) = 1 = c1 sgn ⃗w(5) · ˜x2 = sgn ((1, 2, −2) · (1, 2, 1)) = sgn(3) = 1 = c2 sgn ⃗w(5) · ˜x3 = sgn ((1, 2, −2) · (1, 1, 3)) = sgn(−3) = 0 = c3 206 Perceptron – Learning Algorithm Batch learning algorithm: Compute a sequence of weight vectors ⃗w(0), ⃗w(1), ⃗w(2), . . .. ▶ ⃗w(0) is randomly initialized close to ⃗0 = (0, . . . , 0) ▶ In (t + 1)-th step, ⃗w(t+1) is computed as follows: ⃗w(t+1) = ⃗w(t) − ε · p k=1 h[⃗w(t) ](⃗xk) − ck · ˜xk = ⃗w(t) − ε · p k=1 sgn ⃗w(t) · ˜xk − ck · ˜xk Here 0 < ε ≤ 1 is a learning rate. 207 Linear Regression 208 Linear Regression – Oaks in Wisconsin This example is from How to Lie with Statistics by Darrell Huff (1954) 209 Linear Regression – Oaks in Wisconsin This example is from How to Lie with Statistics by Darrell Huff (1954) NO! 209 Linear Regression – Oaks in Wisconsin This example is from How to Lie with Statistics by Darrell Huff (1954) possibly YES! 209 Linear Regression Our goal: ▶ Given a set D of training examples of the form (⃗x, f ) where ⃗x ∈ Rn and f ∈ R, ▶ construct a model function h : Rn → R such that h(⃗x) ≈ f for all training examples (⃗x, f ) ∈ D Here ≈ means that the values are somewhat close to each other w.r.t. an appropriate error function E. In what follows we use the squared error defined by E = 1 2 (⃗x,f )∈D (h(⃗x) − f )2 Our goal is to minimize E. The main reason is that this function has nice mathematical properties (as opposed, e.g., to (⃗x,f )∈D |h(⃗x) − f | ). 210 Linear Function Approximation ▶ Given a set D of training examples: D = {(⃗x1, f1) , (⃗x2, f2) , . . . , (⃗xp, fp)} Here ⃗xk = (xk1 . . . , xkn) ∈ Rn and fk ∈ R. ▶ Our goal: Find ⃗w so that h[⃗w](⃗xk) = ⃗w · ˜xk is close to fk for every k = 1, . . . , p. Recall that ˜xk = (xk0, xk1 . . . , xkn) where xk0 = 1. ▶ Squared Error Function: E(⃗w) = 1 2 p k=1 (⃗w · ˜xk − fk)2 = 1 2 p k=1 n i=0 wi xki − fk 2 211 Error function 212 Gradient of the Error Function Consider the gradient of the error function: ∇E(⃗w) = ∂E ∂w0 (⃗w), . . . , ∂E ∂wn (⃗w) = p k=1 (⃗w · ˜xk − fk) · ˜xk What is the gradient ∇E(⃗w)? It is a vector in Rn+1 which points in the direction of the steepest ascent of E (its length corresponds to the steepness). Note that here the vectors ˜xk are fixed parameters of E! Fact: If ∇E(⃗w) = ⃗0 = (0, . . . , 0), then ⃗w is a global minimum of E. This follows from the fact that E is a convex paraboloid that has a unique extreme, which is a minimum. 213 Gradient of the error function Consider n = 1, which means that ⃗w = (w0, w1) and we write x instead of ⃗x since ⃗x ∈ Rn = R1 = R. Then the model is h[⃗w](x) = w0 + w1 · x. Consider a concrete training set: T = {(2, 1), (3, 2), (4, 5)} = {(x1, f1), (x2, f2), (x3, f3)} The augmented feature vectors are: (1, 2), (1, 3), (1, 4). E(w0, w1) = 1 2[(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2] ∂E ∂w0 = (w0 +w1 ·2−1)·1+(w0 +w1 ·3−2)·1+(w0 +w1 ·4−5)·1 ∂E ∂w1 = (w0 +w1 ·2−1)·2+(w0 +w1 ·3−2)·3+(w0 +w1 ·4−5)·4 ∇E(⃗w) = ( ∂E ∂w0 , ∂E ∂w1 ) = (w0+w1·2−1)·(1, 2)+(w0+w1·3−2)·(1, 3)+(w0+w1·4−5)·(1, 4) 214 Function Approximation – Learning Gradient Descent: ▶ Weights ⃗w(0) are initialized randomly close to ⃗0. ▶ In (t + 1)-th step, ⃗w(t+1) is computed as follows: ⃗w(t+1) = ⃗w(t) − ε · ∇E(⃗w(t) ) = ⃗w(t) − ε · p k=1 ⃗w(t) · ˜xk − fk · ˜xk = ⃗w(t) − ε · p k=1 h[⃗w(t) ](⃗xk) − fk · ˜xk Here 0 < ε ≤ 1 is a learning rate. Note that the algorithm is almost similar to the batch perceptron algorithm! Proposition For sufficiently small ε > 0 the sequence ⃗w(0), ⃗w(1), ⃗w(2), . . . converges (component-wisely) to the global minimum of E. 215 Training set: D = {(x1, f1), (x2, f2), (x3, f3)} = {(0, 0), (2, 1), (2, 2)} Note that input vectors are one dimensional, so we write them as numbers. That is x1 = 0 x2 = 2 x3 = 2 ˜x1 = (1, 0) ˜x2 = (1, 2) ˜x3 = (1, 2) f1 = 0 f2 = 1 f3 = 2 Assume that the initial vector ⃗w(0) is ⃗w(0) = (w (0) 0 , w (0) 1 ) = (0, 2). Consider ε = 1 10. 216 217 Training set: D = {(x1, f1), (x2, f2), (x3, f3)} = {(0, 0), (2, 1), (2, 2)} Augmented input vectors: ˜x1 = (1, 0), ˜x2 = (1, 2), ˜x1 = (1, 2) ∇E(⃗w) = ∂E ∂w0 (⃗w), ∂E ∂w1 (⃗w) = (w0 + w1 · x1 − f1) · ˜x1 + (w0 + w1 · x2 − f2) · ˜x2 + (w0 + w1 · x3 − f3) · ˜x3 For ⃗w(0) = (0, 2) we have ∇E(⃗w(0) ) =(0 + 2 · 0 − 0) · (1, 0) + (0 + 2 · 2 − 1) · (1, 2) + (0 + 2 · 2 − 2) · (1, 2) = (3, 6) + (2, 4) = (5, 10) Finally, ⃗w(1) is computed by ⃗w(1) = ⃗w(0) − ε · ∇E(⃗w(0) ) = (0, 2) − 1 10 · (5, 10) = (−1/2, 1) 218 219 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Linear Regression - Animation 220 Finding the Minimum in Dimension One Assume n = 1. Then, the error function E is E(w0, w1) = 1 2 p k=1 (w0 + w1xk − fk)2 Minimize E w.r.t. w0 a w1: ∂E ∂w0 = 0 ⇔ w0 = ¯f − w1 ¯x ⇔ ¯f = w0 + w1 ¯x where ¯x = 1 p p k=1 xk a ¯f = 1 p p k=1 fk ∂E ∂w1 = 0 ⇔ w1 = 1 p p k=1(fk − ¯f )(xk − ¯x) 1 p p k=1(xk − ¯x)2 i.e. w1 = cov(f , x)/var(x) 221 Effect of Outliers 222 Effect of Outliers 222 Effect of Outliers 222 Effect of Outliers 222 Effect of Outliers 222 Maximum Likelihood vs Least Squares (Dim 1) Fix a training set D = {(x1, f1) , (x2, f2) , . . . , (xp, fp)} Assume that each fk has been generated randomly by fk = (w0 + w1 · xk ) + ϵk where w0, w1 are unknown weights, and ϵk are independent, normally distributed noise values with mean 0 and some variance σ2 How ”probable” is it to generate the correct f1, . . . , fp ? 223 Maximum Likelihood vs Least Squares (Dim 1) How ”probable” is it to generate the correct f1, . . . , fp ? The following conditions are equivalent: ▶ w0, w1 minimize the squared error E ▶ w0, w1 maximize the likelihood (i.e., the “probability”) of generating the correct values f1, . . . , fp using fk = (w0 + w1 · xk ) + ϵk 223 Comments on Linear Models ▶ Linear models are parametric, i.e., they have a fixed form with a small number of parameters that need to be learned from data (as opposed, e.g. to decision trees where the structure is not fixed in advance). ▶ Linear models are stable, i.e., small variations in the training data have only limited impact on the learned model. (tree models typically vary more with the training data). ▶ Linear models are less likely to overfit (low variance) the training data but sometimes tend to underfit (high bias). ▶ Linear models are prone to outliers. 224 Unsupervised Learning 225 Clustering Often data form clusters based on some notion of similarity. This means that the data distribution is multimodal, i.e., contains several regions of higher probability mass. We aim to group data into clusters of “similar” examples without using any additional information. (no supervision). 226 Motivation Clustering is useful, e.g., in ▶ Customer segmentation based on their purchases. ▶ Data exploration - identify patterns in data ▶ Semi-supervised learning - cluster labeled examples with the unlabeled ones ▶ Search engines - searching for images similar to a given image ▶ Image segmentation ▶ ... 227 Segmentation 228 Clustering Problem Consider a dataset D = {⃗x1, . . . , ⃗xp} Note that no target class/value is provided. Clustering is a partition U = {U1, . . . , UK } of D into K clusters. How do we identify the clusters? Assume that we have a distance measure d measuring how far apart the objects being clustered. We want close objects to be clustered together. For concreteness: ▶ We stick with numerical features, which means that the dataset D = {⃗x1, . . . , ⃗xp} contains vectors ⃗xi ∈ Rn. ▶ Assume the Euclidean distance d. Note that clustering may be based on completely different similarity/dissimilarity measures and non-numerical data. 229 K-Means Clustering 230 K-means clustering The K-means clustering model consists of ▶ The number of clusters K ▶ K cluster prototypes ⃗m1, . . . , ⃗mK ∈ Rn ▶ An assignment qij ∈ {0, 1} for i = 1, . . . , p and j = 1, . . . , K of inputs ⃗xi to clusters Uj so that j qij = 1 for i = 1, . . . , p A given assignment {qij } induces a clustering U = {U1, . . . , UK } where ⃗xi ∈ Uj iff qij = 1 How good is a given model? 231 Error Function Measure the distance of inputs ⃗xi to their cluster prototypes ⃗mj E({qij }, { ⃗mj }) = p i=1 K j=1 qij d(⃗xi , ⃗mj )2 We aim to minimize this error, i.e., to find proper positions of cluster prototypes and their assignment to minimize the total squared distance of examples to their prototypes. 232 K-means Clustering Algorithm The Problem: Minimize E({qij }, { ⃗mj }) = p i=1 K j=1 qij d(⃗xi , ⃗mj )2 w.r.t. {qij }, { ⃗mj } Note that ▶ If we fix { ⃗mj }, we can minimize E({qij }, { ⃗mj }) by setting qij = 1 iff ⃗mj is the closest prototype to ⃗xi . ▶ If we fix {qij }, we can minimize E({qij }, { ⃗mj }) by letting each ⃗mj to minimize the total squared distance to its prototypes: i qij d(⃗xi , ⃗mj )2 This is achieved by putting each prototype ⃗mj into the centroid of all inputs it represents: ⃗mj = 1 p i=1 qij p i=1 qij⃗xi Note that p i=1 qij is the size of the cluster represented by ⃗mj . 233 K-Means Clustering Algorithm Algorithm 1 K-means clustering 1: Initialize K cluster centers ⃗m1, ⃗m2, . . . , ⃗mK randomly 2: repeat 3: for each data point ⃗xi do 4: Assign ⃗xi to the nearest centroid, i.e., set qij = 1 for j = arg min j d(⃗xi , ⃗mj )2 5: end for 6: for each cluster prototype ⃗mj do 7: Update ⃗mj to be the centroid of all points assigned to it ⃗mj = 1 p i=1 qij p i=1 qij⃗xi 8: end for 9: until convergence 234 Example 235 Example Lines 3-5: Assign examples to the prototypes. 235 Example Lines 6-8: Move the prototypes to the centroids of their examples. 235 Example Lines 3-5: Assign examples to the prototypes. 235 Example Lines 6-8: Move the prototypes to the centroids of their examples. 235 236 Convergence of K-means Clustering Every step of K-means reduces the error E({qij }, { ⃗mj }): ▶ We always assign an input vector to the closest prototype. ▶ We always move the prototype to be “closest” to the input vectors it represents. Convergence can be tested by computing the error and checking whether it has not changed in the last step. This will always happen after finitely many steps. There are only finitely many possible assignments to qij , and we always minimize the distance of inputs to their assigned centers. Example error development during training. Blue circles mean reassignment, and red circles mean moving prototypes. 237 Setting K - the Elbow Method K-means clustering minimizes the inertia measure: E({qij }, { ⃗mj }) = p i=1 k j=1 qij d(⃗xi , ⃗mj )2 That is the sum of squared distances of all examples of D to the cluster prototypes. Note that the error does not consider the distance between the centers of the clusters. Still, it is a valid measure that can be used to select the number of clusters. 238 Elbow Method The following method for setting up the hyperparameters can be used in general. Let us illustrate the elbow method on K-means clustering with the inertia measure. Consider the following data: 239 Elbow Method We could choose four clusters because adding more leads only to small decrements in the inertia. 240 Bad Behavior Minimizing E({qij }, { ⃗mj }) starting from random positions of prototypes does not always produce “nice” results. Some runs correspond to apparently bad solutions to the clustering problem even though a better solution exists. Possible solution: Start the algorithm several times with random initialization of the prototypes. 241 Properties of K-means Clustering ▶ Prototype initialization is a big issue in K-means. There are various strategies. For example: ▶ Start with all centers in a single corner. ▶ Include randomness in the setting of centers throughout the algorithm. ▶ Initialize sequentially, always fit prototypes, and then choose a new one as far away from the others as possible. ▶ Use hierarchical clustering (next slides) to find clusters and initialize K-means with their centroids. ▶ Empty clusters may occur - need to resolve, e.g., by assigning the farthest point from any current prototype. ▶ As the squared error is behind the basic method, outliers may strongly affect its behavior (as in the linear regression case). ▶ Other problematic properties of data include ▶ non-convex clusters ▶ clusters of different sizes ▶ non-linearly separable clusters ▶ overlapping clusters 242 243 Agglomerative Clustering 244 Agglomerative Clustering Consider a dataset D = {⃗x1, . . . , ⃗xp} Here ⃗xi ∈ Rn for all i = 1, . . . , p. Assume a distance d (e.g., Euclidean). Idea: ▶ Start by merging the closest examples (w.r.t. d) ▶ Incrementally build larger clusters by merging smaller clusters. More concretely: ▶ Maintain a set of clusters ▶ Initially, each ⃗xi in its own cluster ▶ Repeat until only one cluster is left: ▶ Pick two closest clusters ▶ Merge them into a new cluster How do we determine the closest clusters? 245 Closest Clusters 246 Closest Clusters 246 Closest Clusters 246 Distance Between Clusters 247 Distance Between Clusters 247 Distance Between Clusters 247 Which One is Closer? 248 Distance Between Clusters Consider two clusters Uj , Uk ⊆ D. single linkage(Uj , Uk) = min{d(⃗x, ⃗z) | ⃗x ∈ Uj , ⃗z ∈ Uk} complete linkage(Uj , Uk) = max{d(⃗x, ⃗z) | ⃗x ∈ Uj , ⃗z ∈ Uk} average linkage(Uj , Uk) = 1 |Uj ||Uk| ⃗x∈Uj ⃗z∈Uk d(⃗x, ⃗z) Each linkage can result in a different clustering. 249 Agglomerative Hierarchical Clustering Algorithm Maintain a set of clusters Initially, each ⃗xi in its own cluster repeat Pick two closest clusters Using the distance measure d and single, average, or complete linkage. Merge them into a new cluster until only one cluster is left 250 Example 251 Example - Single Linkage 252 Example - Single Linkage d(3, 6) = 0.11 which is the minimum distance between points. 252 Example - Single Linkage 252 Example - Single Linkage d(2, 5) = 0.14 which is the second smallest distance. 252 Example - Single Linkage 252 Example - Single Linkage d(2, 3) = 0.15 = min{d(2, 3), d(2, 6), d(5, 3), d(5, 6)} which is smaller than d(1, 2) = 0.24, d(1, 3) = 0.22, d(4, 2) = 0.2, d(4, 3) = 0.16, d(4, 1) = 0.37 the min. distances of points in all other pairs of clusters. 252 Example - Single Linkage 252 Example - Single Linkage d(4, 3) = 0.15 = min{d(4, 3), d(4, 5), d(4, 2), d(4, 6)} which is smaller than d(1, 3) = 0.22, the distance of 1 to the cluster 3. 252 Example - Single Linkage 252 Example - Single Linkage 252 Example - Single Linkage 252 Example - Average Linkage 253 Example - Average Linkage 253 Example - Average Linkage 253 Example - Average Linkage d(2, 5) = 0.14 which is second smallest distance. 253 Example - Average Linkage 253 Example - Average Linkage The average distance between 4 and both points of {3, 6} is 1 2 (d(4, 3) + d(4, 6)) = 0.19 which is smaller than the average distance between all points of clusters 1, 2: d(5, 2) + d(5, 3) + d(2, 3) + d(2, 6) 4 (equal to 0.205), and the average distance of 1 to any cluster. 253 Example - Average Linkage 253 Example - Average Linkage The average distance between clusters 2, 3 is 0.26 which is smaller than the average distance of 1 to any of the two clusters 1, 2 (the average distances are 0.273 and 0.29). 253 Example - Average Linkage 253 Example - Average Linkage 253 Example - Complete Linkage 254 Example - Complete Linkage d(3, 6) = 0.11 which is the minimum distance between points. 254 Example - Complete Linkage 254 Example - Complete Linkage d(2, 5) = 0.14 which is the second smallest distance. 254 Example - Complete Linkage 254 Example - Complete Linkage d(4, 6) = 0.22 = max{d(4, 3), d(4, 6)} which is smaller than d(4, 5) = 0.29, d(1, 5) = 0.34, d(1, 6) = 0.23, d(5, 6) = 0.39, d(4, 1) = 0.37 the max distances of points in all other pairs of clusters. 254 Example - Complete Linkage 254 Example - Complete Linkage d(1, 5) = 0.34 which is smaller than d(1, 4) = 0.37, d(5, 6) = 0.39, which are the maximum distances of points in all other pairs of clusters. 254 Example - Complete Linkage 254 Example - Complete Linkage 254 Example - Complete Linkage 254 Properties of Agglomerative Hierarchical Clustering ▶ Provides hierarchy of clusters - different cut levels provide different levels of coarseness of clusters ▶ Compared with k-means, it does not depend on the initialization and may provide better clusters than k-means. ▶ Lack of global objective function ▶ The agglomerative hierarchical clustering uses local criteria to decide which clusters to merge. ▶ Agglomerative clustering has a “rich get richer” behavior that leads to uneven cluster sizes ▶ Merging decision cannot be undone - bad for noisy data ▶ Computationally expensive. 255 256 State level statistics for US 257 State level statistics for US 257 State level statistics for US 257 State level statistics for US 257 State level statistics for US 257 Cluster Validation 258 Cluster Validity For supervised classification (= we have class labels) we have a variety of measures to evaluate how good our model is: Accuracy, Precision, Recall, F1, etc. For cluster analysis (=unsupervised learning), the analogous question is: How to evaluate the “goodness” of the resulting clusters? Keep in mind that the dataset can be large and high-dimensional. Visualization might be difficult. 259 Random points: K-means Hierarchical 260 Different Aspects of Cluster Validation 1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure exists in the data (e.g., to avoid overfitting). 2. Internal Validation: Evaluating how well the cluster analysis results fit the data without reference to external information. 3. External Validation: Compare the cluster analysis results to externally known class labels (class labels). 4. Compare clusterings to determine which is better. 5. Determining the ‘correct’ number of clusters. For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters. 261 Measures of Cluster Validity Numerical measures applied to judge various aspects of cluster validity are classified into the following three types. ▶ Internal Index: Used to measure the goodness of a clustering structure without respect to external information. ▶ External Index: Used to measure the extent to which cluster labels match externally supplied class labels. ▶ Relative Index: Used to compare two different clusterings or clusters. 262 Internal Index Consider a dataset D = {⃗x1, . . . , ⃗xp} Assume that a clustering algorithm produced a partition U = {U1, . . . , UK } of D into K clusters. No other information has been provided. We aim to measure the clustering’s “niceness” (??) Assume that we have a distance measure d measuring how far apart the objects being clustered. For concreteness: ▶ We stick with numerical features, which means that the dataset D = {⃗x1, . . . , ⃗xp} contains vectors ⃗xi ∈ Rn. ▶ Assume the Euclidean distance d. Note that the validity measures may be based on completely different similarity/dissimilarity measures and non-numerical data. 263 Cohesion and Separation Consider a dataset D = {⃗x1, . . . , ⃗xp} and its clustering U = {U1, . . . , UK }. Let us utilize the concept of distance to cluster prototypes and consider the distance between prototypes. To measure the dissimilarity of examples, we use the notion of proximity defined on pairs of vectors ⃗x, ⃗z: proximity(⃗x, ⃗z) The proximity might be, e.g., ▶ the distance d(⃗x, ⃗z), ▶ the square of the distance, that is d(⃗x, ⃗z)2, ▶ any other notion of dissimilarity based on the application. We consider the notions of cohesion (proximity of examples within clusters) and separation (proximity of clusters). 264 Prototype-based Cohesion and Separation Prototype-based cohesion = the similarity of examples within a given cluster to a prototype of the cluster (e.g., centroid). Given a cluster Uj ∈ U and its prototype ⃗mj ∈ Rn, cohesion(Uj ) = ⃗x∈Uj proximity(⃗x, ⃗mj ) Note that the prototype does not have to be an element of Uj . Intuitively, cohesion is the proximity of cluster’s examples and a point somewhere “between” all examples of the cluster. 265 Prototype-based Cohesion and Separation Prototype-based separation = dissimilarity of prototypes of different clusters. Given a cluster Uj ∈ U, its prototype ⃗mj ∈ Rn, and a prototype of all examples ⃗m ∈ Rn (e.g. the centroid of all examples) separation(Uj ) = proximity( ⃗mj , ⃗m) Intuitively, separation is the proximity of the cluster’s examples to the dataset’s center. 265 Prototype-based Cohesion and Separation Summarize the prototype-based cohesion and separation as follows: cohesion(U) = K j=1 cohesion(Uj ) = K j=1 ⃗x∈Uj proximity(⃗x, ⃗mj ) separation(U) = K j=1 |Uj |separation(Uj ) = K j=1 |Uj |proximity( ⃗mj , ⃗m) If proximity(⃗x, ⃗z) is defined as d(⃗x, ⃗z)2 then the cohesion is the inertia. There is an interesting relationship between the above measures and the squared distances to the prototype of the whole dataset ⃗m. 266 Prototype-based Cohesion and Separation Consider a dataset D = {⃗x1, . . . , ⃗xp} and its clustering U = {U1, . . . , UK } of D. Consider proximity(⃗x, ⃗z) = d(⃗x, ⃗z)2 and all prototypes to be centroids. Let ⃗m ∈ Rn be the centroid of all examples: ⃗m = 1 |D| p i=1 ⃗xi Define TSS = p i=1 proximity(⃗xi , ⃗m) = p i=1 d(⃗xi , ⃗m)2 The following holds: TSS = cohesion(U) + separation(U) Note that TSS is determined by D. 267 Silouette Score Silhouette score can be used to measure both qualities of clustering from the point of view of individual examples and from the point of view of the overall clustering. silhouette(⃗x) = b − a max{a, b} 268 Silhouette Score Consider a clustering U = {U1, . . . , Uk} and ⃗x ∈ Uj . If |Uj | > 1 we define a(⃗x) = 1 |Uj | − 1 ⃗z∈Uj ∖{⃗x} d(⃗x, ⃗z) b(⃗x) = min k̸=j 1 |Uk| ⃗z∈Uk d(⃗x, ⃗z) If |Uj | > 1 we define silhouette(⃗x) = b(⃗x) − a(⃗x) max{a(⃗x), b(⃗x)} Else, we define silhouette(⃗x) = 0. 269 Example 270 Example 271 Silhouette for Clusters and Clusterings We have defined the silhouette for a single ⃗x ∈ D. To obtain the silhouette score for a whole cluster Uj or for D we summarize using simple averaging: silhouette(Uj ) = 1 |Uj | ⃗x∈Uj silhouette(⃗x) silhouette(D) = 1 |D| ⃗x∈D silhouette(⃗x) 272 Example The colored graphs on the left are silhouette scores of the individual elements of clusters. The red vertical line denotes the silhouette score for the whole dataset. 273 Example The colored graphs on the left are silhouette scores of the individual elements of clusters. The red vertical line denotes the silhouette score for the whole dataset. 273 Example The colored graphs on the left are silhouette scores of the individual elements of clusters. The red vertical line denotes the silhouette score for the whole dataset. 273 Example The colored graphs on the left are silhouette scores of the individual elements of clusters. The red vertical line denotes the silhouette score for the whole dataset. 273 Example The colored graphs on the left are silhouette scores of the individual elements of clusters. The red vertical line denotes the silhouette score for the whole dataset. 273 External Index Consider a supervised learning dataset D = {(⃗x1, c1), . . . , (⃗xp, cp)} Here ci ∈ C is a class of ⃗xi . Assume that a clustering algorithm produced a partition U = {U1, . . . , UK } of D into K clusters. We measure how the clustering conforms with the given classes. 274 Purity Consider the clustering to be a classification model. Define a classifier h : D → C such that given ⃗xi ∈ Uj ∈ U h(⃗xi ) = the most frequent class in Uj Now we can measure the Accuracy of h. Accuracy of h is called purity. Intuitively, it is the proportion of non-majority class elements in clusters. Is it a good measure? Probably not; many clusters lead to high purity (each element in its own cluster means purity = 1). 275 Classifier Point of View Given ⃗xi , denote by U(⃗xi ) the cluster Uj ∈ U containing ⃗xi . Distinguish the following types of pairs of examples: ▶ TP = number of examples of the same class and the same cluster TP = |{(i, j) | U(⃗xi ) = U(⃗xj ) ∧ ci = cj )} ▶ TN = number of examples of different classes and different clusters TN = |{(i, j) | U(⃗xi ) ̸= U(⃗xj ) ∧ ci ̸= cj )} ▶ FP = number of examples of different classes and the same cluster FP = |{(i, j) | U(⃗xi ) = U(⃗xj ) ∧ ci ̸= cj )} ▶ FN = number of examples of the same class and different clusters FN = |{(i, j) | U(⃗xi ) ̸= U(⃗xj ) ∧ ci = cj )} Now, we may apply all the measures from the supervised model. 276 Example TP = 4 2 + 5 2 + 2 2 = 6 + 10 + 1 = 17 TN = 4 ∗ 5 + 1 ∗ 2 = 22 FP = 1 ∗ 4 + 5 ∗ 2 = 14 FN = 1 ∗ 5 + 4 ∗ 2 = 13 Cluster same diff Class same TP=17 FN=13 diff FP=14 TN=22 277 Rand Index Accuracy (in this area known as Rand index) is RandInd = Accuracy = TP + TN TP + TN + FP + FN In our example, RandInd = (17 + 22)/(17 + 22 + 14 + 13) = 0.59 Here, note that the Rand index considers the purity and the number of clusters. Note that the Rand index can be used to compare two clusterings: Simply consider class labels to be indicators of clusters. Similarly, we may compute the other measures such as Precision, Recall, and F1 with all their benefits and limitations. 278 Logistic Regression 279 What about classification using regression? Binary classification: Desired outputs 0 and 1 ... we want to capture the probability distribution of the classes 280 What about classification using regression? Binary classification: Desired outputs 0 and 1 ... we want to capture the probability distribution of the classes ... does not capture the probability well (it is not probability at all) 280 What about classification using regression? Binary classification: Desired outputs 0 and 1 ... we want to capture the probability distribution of the classes ... logistic sigmoid 1 1+e−(⃗w·˜x) is much better! 280 Logistic Regression Logistic regression model h[⃗w] is determined by a vector of weights ⃗w = (w0, w1, . . . , wn) ∈ Rn+1 as follows: Given ⃗x = (x1, . . . , xn) ∈ Rn, h[⃗w](⃗x) := 1 1 + e−(w0+ n k=1 wk xk ) = 1 1 + e−⃗w·˜x Here ˜x = (x0, x1, . . . , xn) where x0 = 1 is the augmented feature vector. 281 But what is the meaning of the sigmoid? The model gives probability h[⃗w](⃗x) of the class 1 given an input ⃗x. But why do we model such probability using 1/(1 + e−⃗w·˜x) ?? 282 But what is the meaning of the sigmoid? The model gives probability h[⃗w](⃗x) of the class 1 given an input ⃗x. But why do we model such probability using 1/(1 + e−⃗w·˜x) ?? Denote by ¯h the probability P(Y = 1 | X = ⃗x), i.e., the “true” probability of the class 1 given features ⃗x. The probability ¯h cannot be easily modeled using a linear function (the probabilities are between 0 and 1). 282 But what is the meaning of the sigmoid? The model gives probability h[⃗w](⃗x) of the class 1 given an input ⃗x. But why do we model such probability using 1/(1 + e−⃗w·˜x) ?? Denote by ¯h the probability P(Y = 1 | X = ⃗x), i.e., the “true” probability of the class 1 given features ⃗x. What about odds of the class 1? odds(¯h) = ¯h/(1 − ¯h) Better, at least it is unbounded on one side ... 282 But what is the meaning of the sigmoid? The model gives probability h[⃗w](⃗x) of the class 1 given an input ⃗x. But why do we model such probability using 1/(1 + e−⃗w·˜x) ?? Denote by ¯h the probability P(Y = 1 | X = ⃗x), i.e., the “true” probability of the class 1 given features ⃗x. What about log odds (aka logit) of the class 1? logit(¯h) = log(¯h/(1 − ¯h)) Looks almost linear, at least for probabilities not too close to 0 or 1 ... 282 But what is the meaning of the sigmoid? Assume that ¯h is the actual probability of the class 1 for an “object” with features ⃗x ∈ Rn. Put log(¯h/(1 − ¯h)) = ⃗w · ˜x Then log((1 − ¯h)/¯h)) = −⃗w · ˜x and (1 − ¯h)/¯h = e−⃗w·˜x and ¯h = 1 1 + e−⃗w·˜x = h[⃗w](⃗x) If we model log odds using a linear function, the probability is obtained by applying the logistic sigmoid on the result of the linear function. 283 Logistic Regression ▶ Given a set D of training samples: D = {(⃗x1, c1) , (⃗x2, c2) , . . . , (⃗xp, cp)} Here ⃗xk = (xk1 . . . , xkn) ∈ Rn and ck ∈ {0, 1}. Recall that h[⃗w](⃗xk) = 1 / 1 + e−⃗w·˜xk where ˜xk = (xk0, xk1 . . . , xkn), here xk0 = 1 Our goal: Find ⃗w such that for every k = 1, . . . , p we have that h[⃗w](⃗xk) ≈ ck ▶ Binary Cross-entropy: E(⃗w) = − p k=1 ck log(h[⃗w](⃗xk)) + (1−ck) log(1−h[⃗w](⃗xk)) 284 Gradient of the Error Function Consider the gradient of the error function: ∇E(⃗w) = ∂E ∂w0 (⃗w), . . . , ∂E ∂wn (⃗w) = p k=1 (h[⃗w](⃗xk) − ck)·˜xk Fact 1 If ∇E(⃗w) = ⃗0 = (0, . . . , 0), then ⃗w is a global minimum of E. This follows from the fact that E is convex. Using the squared error with the logistic sigmoid would lead to a non-convex error with several minima! 285 Logistic Regression – Learning Gradient Descent: ▶ Weights ⃗w(0) are initialized randomly close to ⃗0. ▶ In (t + 1)-th step, ⃗w(t+1) is computed as follows: ⃗w(t+1) = ⃗w(t) − ε · ∇E(⃗w(t) ) = ⃗w(t) − ε · p k=1 h[⃗w(t) ](⃗xk) − ck · ˜xk Here 0 < ε ≤ 1 is the learning rate. Note that the algorithm is almost similar to the batch perceptron algorithm! Proposition For sufficiently small ε > 0, the sequence ⃗w(0), ⃗w(1), ⃗w(2), . . . converges (in a component-wise manner) to the global minimum of the error function E. 286 Logistic Regression - Using the Trained Model We have already trained our logistic regression model, i.e., we have a vector of weights ⃗w = (w0, w1, . . . , wn). The model is the function h[⃗w] which for a given feature vector ⃗x = (x1, . . . , xn) returns the probability h[⃗w](⃗x) = 1 1 + e−(w0+ n k=1 wk xk ) that ⃗x belongs to the class 1. To decide whether a given ⃗x belongs to the class 1 we use h[⃗w] as a Bayes classifier: Assign ⃗x to the class 1 iff h[⃗w](⃗x) ≥ 1/2. Other thresholds can also be used depending on the application and properties of the model. In such a case, given a threshold ξ ∈ [0, 1], assign ⃗x to the class 1 iff h[⃗w](⃗x) ≥ ξ. 287 Maximum Likelihood vs Cross-entropy (Dim 1) Fix a training set D = {(x1, c1) , (x2, c2) , . . . , (xp, cp)} Generate a sequence c′ 1, . . . , c′ p ∈ {0, 1}p where each c′ k has been generated independently by the Bernoulli trial generating 1 with probability h[w0, w1](xk ) = 1 1 + e−(w0+w1·xk ) and 0 otherwise. Here w0, w1 are unknown weights. How “probable” is it to generate the correct classes c1, . . . , cp ? The following conditions are equivalent: ▶ w0, w1 minimize the binary cross-entropy E ▶ w0, w1 maximize the likelihood (i.e., the “probability”) of generating the correct values c1, . . . , cp using the above described Bernoulli trials (i.e., that c′ k = ck for all k = 1, . . . , p) Note that the above equivalence is a property of the cross-entropy and is not dependent on the “implementation” of h[w0, w1](xk ) using the logistic sigmoid. 288 Support Vector Machines (SVM) 289 SVM Idea – Which Linear Classifier is the Best? Benefits of maximum margin: ▶ Intuitively, the maximum margin is good w.r.t. generalization. ▶ Only the support vectors (those on the margin) matter; others can, in principle, be ignored. 290 Support Vector Machines (SVM) Notation: ▶ ⃗w = (w0, w1, . . . , wn) a vector of weights, ▶ ⃗w = (w1, . . . , wn) a vector of all weights except w0, ▶ ⃗x = (x1, . . . , xn) a (generic) feature vector. ▶ ˜x = (x0, x1, . . . , xn) an augmented feature vector where x0 = 1. Consider a linear classifier: h[⃗w](⃗x) := 1 w0 + n i=1 wi · xi = ⃗w · ˜x ≥ 0 −1 w0 + n i=1 wi · xi = ⃗w · ˜x < 0 The distance of ⃗x from the separating hyperplane determined by ⃗w is d[⃗w](⃗x) = |⃗w · ˜x| ∥⃗w∥ Recall that ⃗w · ˜x is positive for ⃗x on the side to which ⃗w points and negative on the opposite side. 291 292 Margin ▶ Given a training set D = {(⃗x1, y1) , (⃗x2, y2) , . . . , (⃗xp, yp)} Here ⃗xk = (xk1 . . . , xkn) ∈ X ⊆ Rn and yk ∈ {−1, 1}. ▶ Assume that D is linearly separable, let ⃗w be consistent with D. Margin of ⃗w is twice the minimum distance between feature vectors ⃗xk and the separating hyperplane determined by ⃗w, i.e., 2 min k d[⃗w](⃗xk ) = 2 min k |⃗w · ˜xk | ∥⃗w∥ ▶ Our goal is to find ⃗w consistent with D that maximizes the margin. Note that to maximize the margin it suffices to maximize mink |⃗w·˜xk | ∥⃗w∥ over ⃗w consistent with D. 293 Finding the Maximum Margin Classifier We want to maximize the minimum distance of the feature vectors ⃗xk from the separating hyperplane determined by ⃗w. Formally, we use the following: To maximize the margin, find ⃗w maximizing min k |⃗w · ˜xk| ||⃗w|| (= the distance of closest ⃗xk ’s to the sep. hyperplane) over the following constraints ⃗w · ˜xk > 0 for all k satisfying yk = 1 ⃗w · ˜xk < 0 for all k satisfying yk = −1 (the contraints make sure that ⃗w is consistent with the training set D) 294 To maximize the margin, find ⃗w maximizing min k |⃗w · ˜xk| ||⃗w|| over the following constraints ⃗w · ˜xk > 0 for all k satisfying yk = 1 ⃗w · ˜xk < 0 for all k satisfying yk = −1 Can be made more succinct: To maximize the margin, find ⃗w maximizing min k yk · ⃗w · ˜xk ∥⃗w∥ over min k (yk · ⃗w · ˜xk) > 0 The reason is that ⃗w is consistent with D. That is, ⃗w · ˜xk > 0 for yk = 1, and ⃗w · ˜xk < 0 for yk = −1. 295 To maximize the margin, find ⃗w maximizing min k yk · ⃗w · ˜xk ∥⃗w∥ over min k (yk · ⃗w · ˜xk) > 0 Observation: For every ⃗w satisfying mink(yk · ⃗w · ˜xk) > 0 there is ⃗w′ satisfying mink(yk · ⃗w′ · ˜xk) = 1 such that min k yk · ⃗w · ˜xk ∥⃗w∥ = min k yk · ⃗w′ · ˜xk ∥⃗w′ ∥ Proof: Just consider ⃗w′ = ⃗w/ξ where ξ = mink(yk · ⃗w · ˜xk). To maximize the margin, find ⃗w maximizing min k yk · ⃗w · ˜xk ∥⃗w∥ over min k (yk · ⃗w · ˜xk) = 1 296 297 298 To maximize the margin, find ⃗w maximizing min k yk · ⃗w · ˜xk ∥⃗w∥ over min k (yk · ⃗w · ˜xk) = 1 can be further simplified to To maximize the margin, find ⃗w maximizing 1 ∥⃗w∥ over min k (yk · ⃗w · ˜xk) = 1 299 To maximize the margin, find ⃗w maximizing 1 ∥⃗w∥ over min k (yk · ⃗w · ˜xk) = 1 Can be adjusted by loosening the constraints: To maximize the margin, find ⃗w maximizing 1 ∥⃗w∥ over min k (yk · ⃗w · ˜xk) ≥ 1 If the latter is solved by ⃗w′ with mink(yk · ⃗w′ · ˜xk) > 1, then min k yk · ⃗w′ · ˜xk ⃗w′ > 1 ⃗w′ ≥ 1 ||⃗w|| = mink yk · ⃗w · ˜xk ||⃗w|| For all ⃗w satisfying mink(yk · ⃗w · ˜xk) = 1, which contradicts the fact that the maximum margin is attained by such a ⃗w. 300 To maximize the margin, find ⃗w maximizing 1 ∥⃗w∥ over min k yk · ⃗w · ˜xk ≥ 1 Can be turned into To maximize the margin, find ⃗w minimizing ||⃗w|| over min k yk · ⃗w · ˜xk ≥ 1 And, finally, To maximize the margin, find ⃗w minimizing ⃗w · ⃗w over yk · ⃗w · ˜xk ≥ 1 for all k Indeed, just note that ||⃗w|| = ⃗w · ⃗w. 301 SVM – Optimization Assume a given training set D = {(⃗x1, y1)) , (⃗x2, y2) , . . . , (⃗xp, yp)} Here ⃗xk = (xk1 . . . , xkn) ∈ X ⊆ Rn and yk ∈ {−1, 1}. (recall ˜xk = (xk0, xk1, . . . , xkn) where xk0 = 1) Margin maximization as a quadratic optimization problem: Find ⃗w minimizing ⃗w · ⃗w under the constraints yk · ⃗w · ˜xk ≥ 1 for all k Support vectors are vectors ⃗xk closest to the optimal separating hyperplane, i.e., those satisfying yk · ⃗w · ˜xk = 1 for a minimizing ⃗w. 302 Example Training set: D = {((0, 0), −1), ((1, 1), 1), ((0, 3), 1)} That is ⃗x1 = (0, 0) ⃗x2 = (1, 1) ⃗x3 = (0, 3) ˜x1 = (1, 0, 0) ˜x2 = (1, 1, 1) ˜x3 = (1, 0, 3) y1 = −1 y2 = 1 y3 = 1 303 304 Find ⃗w minimizing w2 1 + w2 2 under the constraints (−1) · (1w0 + 0w1 + 0w2) = −w0 ≥ 1 1 · (1w0 + 1w1 + 1w2) = w0 + w1 + w2 ≥ 1 1 · (1w0 + 0w1 + 3w2) = w0 + 3w2 ≥ 1 It can be solved using a quadratic programming solver. To solve by hand, assume that we know that ⃗x1 and ⃗x2 are support vectors. Find ⃗w minimizing w2 1 + w2 2 under the constraints −w0 = 1 w0 + w1 + w2 = 1 w0 + 3w2 ≥ 1 Note that the equality constraints correspond to our assumption that ⃗x1 and ⃗x2 are support vectors. 305 Find ⃗w minimizing w2 1 + w2 2 under the constraints −w0 = 1 w0 + w1 + w2 = 1 w0 + 3w2 ≥ 1 Can be transformed to Find ⃗w minimizing w2 1 + w2 2 under the constraints w1 + w2 = 2 3w2 ≥ 2 306 Find ⃗w minimizing w2 1 + w2 2 under the constraints w1 + w2 = 2 3w2 ≥ 2 Substituting w2 = 2 − w1 into the quadratic function we obtain w2 1 + (2 − w1)2 = w2 1 + w2 1 − 4w1 + 4 = 2w2 1 − 4w1 + 4 substituting w2 = 2 − w1 into the inequality 3w2 ≥ 2 we obtain 6 − 3w1 ≥ 2 This reduces our problem to Find ⃗w minimizing 2w2 1 −4w1 +4 under the constraint w1 ≤ 4 3 307 Find ⃗w minimizing 2w2 1 −4w1 +4 under the constraint w1 ≤ 4 3 Is solved by w1 = 1 From w2 = 2 − w1 we obtain w2 = 2 − 1 = 1 From −w0 = 1 we obtain w0 = −1 The final model is h[⃗w](⃗x) = −1 + x1 + x2 The separating hyperplane is determined by −1 + x1 + x2 = 0 308 309 SVM – Optimization ▶ Need to optimize a quadratic function subject to linear constraints. ▶ Quadratic optimization problems are a well-known class of mathematical programming problems for which efficient methods (and tools) exist. But why has the SVM been so successful? ... the improvement by finding the maximum margin classifier does not seem to be so strong ... right? The answer lies in their ability to deal with non-linearly separable sets efficiently using the kernel trick (see a later lecture). 310 Comments on Algorithms ▶ The main bottleneck of SVM is quadratic programming (QP) complexity. A naive QP solver has cubic complexity. ▶ For small problems, any general-purpose optimization algorithm can be used. ▶ For large problems, this is usually not possible; many methods avoiding direct solutions have been devised. ▶ These methods usually decompose the optimization problem into a sequence of smaller ones. Intuitively, ▶ starts with a (smaller) subset of training examples. ▶ Find an optimal solution using any solver. ▶ Afterwards, only support vectors matter in the solution! Leave only them in the training set, and add new training examples. ▶ This iterative procedure decreases the (general) cost function. 311 Soft-margin SVM Trade-off few misclassifications with a wide margin for the rest. Find ⃗w minimizing ⃗w · ⃗w + C k ζk C is a hyperparameter under the constraints yk · ⃗w · ˜xk ≥ 1 − ζk for all k ζk ≥ 0 for all k Which is the same as the following unconstrained optimization: Find ⃗w minimizing the hinge loss ⃗w · ⃗w + C k max(0, 1 − yk · ⃗w · ˜xk) 312 Hard vs Soft Margin SVM Source: Dishaa Agarwal https://www.analyticsvidhya.com/blog/2021/04/insight-into-svm-support-vector-machine-along-with-code/ 313 Comments on SVM ▶ SVMs were originally proposed by Boser, Guyon, and Vapnik in 1992, and gained increasing popularity in the late 1990s. ▶ SVMs are currently among the best performers for several classification tasks ranging from text to genomic data. ▶ SVMs can be applied to complex data types beyond feature vectors (e.g., graphs, sequences, relational data) by designing kernel functions for such data. ▶ SVM techniques have been extended to several tasks, such as regression [Vapnik et al. ’97], principal component analysis [Sch¨olkopf et al. ’99], etc. 314 Kernel Methods 315 Quadratic Decision Boundary Left: The original set, Right: Transformed using the square of features. Right: the green line is a separating hyperplane in transformed space. Left: the green ellipse maps exactly to the green line. How to classify (in the original space): First, transform a given feature vector by squaring the features, then use a linear classifier. 316 Another Solution Mapping from R2 to R3 so that there is ”more space” for linear separation. 317 Do We Need to Map Explicitly? In general, mapping to (much) higher feature space helps (there are more ”degrees of freedom” so linear separability might get a chance). However, the complexity of learning grows (quickly) with dimension. Sometimes it’s even beneficial to map to infinite-dimensional spaces. To avoid explicit construction of the higher dimensional feature space, we use the so-called kernel trick. But first, we need to dualize our learning algorithm. 318 Linear Regression ▶ Given a set D of training examples: D = {(⃗x1, f1) , (⃗x2, f2) , . . . , (⃗xp, fp)} Here ⃗xk = (xk1 . . . , xkn) ∈ Rn and fk ∈ R. ▶ Our goal: Find ⃗w so that h[⃗w](⃗xk) = ⃗w · ˜xk is close to fk for every k = 1, . . . , p. Recall that ˜xk = (xk0, xk1 . . . , xkn) where xk0 = 1. ▶ Squared Error Function: E(⃗w) = 1 2 p k=1 (⃗w · ˜xk − fk)2 = 1 2 p k=1 n i=0 wi xki − fk 2 319 Regularized Linear Regression Regularized Squared Error Function: E(⃗w) = 1 2 p k=1 (⃗w · ˜xk − fk)2 + ⃗w · ⃗w Intuition: the added term ⃗w · ⃗w prevents growth of weights. The Representer Theorem: The weight vector ⃗w∗ minimizing the regularized squared error function can be written as ⃗w∗ = p i=1 αi fi ˜xi Here α1, . . . , αp are suitable coefficients. Substituting this expression for weights in E gives E′ (⃗w∗ ) = 1 2 p k=1 p i=1 αi fi ˜xi · ˜xk − fk 2 + p i=1 p j=1 αi αj fi fj ˜xi · ˜xj and we minimize E′ w.r.t. α1, . . . , αp. What is this good for?? 320 Given a set D of training examples: D = {(⃗x1, f1) , (⃗x2, f2) , . . . , (⃗xp, fp)} Here ⃗xk = (xk1 . . . , xkn) ∈ Rn and fk ∈ R. Find α1, . . . , αp minimizing dual regularized squared error E′ (⃗w∗ ) = 1 2 p k=1 p i=1 αi fi ˜xi · ˜xk − fk 2 + p i=1 p j=1 αi αj fi fj ˜xi · ˜xj The resulting coefficients α1, . . . , αp give a weight vector ⃗w∗ = p i=1 αi fi ˜xi which in turn gives a linear model h[⃗w∗ ](⃗x) = ⃗w∗ ˜x = p i=1 αi fi ˜xi · ˜x Note that all ˜x,˜xi ,˜xj ,˜xk occur in dot products with themselves! 321 Find ⃗α = (α1, . . . , αp) minimizing dual regularized squared error E′ (⃗w∗ ) = 1 2 p k=1 p i=1 αi fi ˜xi · ˜xk − fk 2 + p i=1 p j=1 αi αj fi fj ˜xi · ˜xj Linear model: h[⃗α](⃗x) = p i=1 αi fi ˜xi · ˜x Do we need to use the dot product in the above procedure? NO! Find ⃗α = (α1, . . . , αp) minimizing kernel dual regularized squared error E′ (⃗w∗ ) = 1 2 p k=1 p i=1 αi fi κ(˜xi ,˜xk )) − fk 2 + p i=1 p j=1 αi αj fi fj κ(˜xi ,˜xj ) Non-linear model: h[⃗α](⃗x) = p i=1 αi fi κ(˜xi ,˜x) Here κ is a kernel function. But now what is the trick? The trick is that suitable kernel functions κ correspond to dot products in transformed spaces! 322 Recall the Quadratic Decision Boundary Left: The original set, Right: Transformed using the square of features. Right: the green line is a separating hyperplane in the transformed space. Left: the green ellipse maps exactly to the green line. How to classify (in the original space): Transform a given feature vector by squaring the features, then use a linear classifier. 323 Kernel Trick For simplicity, assume bivariate data: ˜xk = (1, xk1, xk2). The corresponding instance in the quadratic feature space is (1, x2 k1, x2 k2). Consider two instances ˜xk = (1, xk1, xk2) and ˜xℓ = (1, xℓ1, xℓ2). Then the scalar product of their corresponding instances (1, x2 k1, x2 k2) and (1, x2 ℓ1, x2 ℓ2), resp., in the quadratic feature space is 1 + x2 k1x2 ℓ1 + x2 k2x2 ℓ2 which resembles (but is not equal to) (˜xk ·˜xℓ)2 = (1 + xk1xℓ1 + xk2xℓ2)2 = = 1 + x2 k1x2 ℓ1 + x2 k2x2 ℓ2 + 2xk1xℓ1xk2xℓ2 + 2xk1xℓ1 + 2xk2xℓ2 But now consider a mapping ϕ to R6 defined by ϕ(˜xk ) = (1, x2 k1, x2 k2, √ 2xk1xk2, √ 2xk1, √ 2xk2) Then ϕ(˜xk ) · ϕ(˜xℓ) = (˜xk · ˜xℓ)2 THE Idea: Using the kernel κ(˜xk ,˜xℓ) = (˜xk · ˜xℓ)2 in the kernel, dual regularized squared error corresponds to using the regularized squared error after the transformation ϕ. 324 Quadratic Decision Boundary Given a set D of training examples: D = {(⃗x1, f1) , (⃗x2, f2) , . . . , (⃗xp, fp)} Assume that fi ∈ {1, −1} indicates the class of ⃗xi . Yes, I know that squared error regression should not be used for classification! Considering κ(˜xk,˜xℓ) = (˜xk · ˜xℓ)2 in our kernel dual regularized squared error we obtain Find ⃗α = α1, . . . , αp minimizing E′ (⃗w) = 1 2 p k=1 p i=1 αi fi (˜xi · ˜xk )2 − fk 2 + p i=1 p j=1 αi αj fi fj (˜xi · ˜xj )2 Non-linear classifier: h[⃗α](⃗x) = p i=1 αi fi (˜xi · ˜x)2 Intuitively, minimizing E′ in R2 gives a separating hyperplane for the input vectors transformed into R5. This means, that in R2 it searches for a quadratic (i.e., non-linear) boundary. 325 Examples of Kernels ▶ Linear: κ(˜xℓ,˜xk) = ˜xℓ · ˜xk The corresponding mapping ϕ(˜x) = ˜x is identity (no transformation). ▶ Polynomial of power m: κ(˜xℓ,˜xk) = (˜xℓ · ˜xk)m The corresponding mapping assigns to ˜x ∈ Rn+1 the vector ϕ(˜x) in R(n+m m )+1 . ▶ Gaussian (radial-basis function): κ(˜xℓ,˜xk) = e− ∥˜xℓ−˜xk∥2 2σ2 The corresponding mapping ϕ maps ˜x to an infinite-dimensional vector ϕ(˜x) which is, in fact, a Gaussian function; a combination of such functions for support vectors is then the separating hypersurface. ▶ · · · Choosing kernels remains to be the black magic of kernel methods. They are usually chosen based on trial and error (of course, experience and additional insight into data help). A similar trick can be done with (soft-margin) support vector machines. 326 Neural Networks 327 (Primitive) Mathematical Model of Neuron σ ξ x1 x2 xn y 328 Formal neuron σ ξ x1 x2 xn x0 = 1 y w1 w2 · · · wn w0 ▶ x1, . . . , xn real inputs ▶ x0 special input, always 1 ▶ w0, w1, . . . , wn real weights ▶ ξ = w0 + n i=1 wi xi inner potential; In general, other potentials are considered (e.g. Gaussian), more on this in PV021. ▶ y output defined by y = σ(ξ) where σ is an activation function. We consider several activation functions. e.g., linear threshold function σ(ξ) = sgn(ξ) = 1 ξ ≥ 0 ; 0 ξ < 0. 329 Formal Neuron vs Linear Models ▶ If σ is a linear threshold function σ(ξ) = 1 ξ ≥ 0 ; 0 ξ < 0. We obtained a linear classifier. ▶ If σ is identity, i.e., σ(ξ) = ξ, we obtain a linear (affine) function. ▶ If σ(ξ) = 1/(1 + e−ξ) we obtain the logistic regression. Also, other activation functions are used in neural networks! 330 Activation Functions 331 Multilayer Perceptron (MLP) Input Hidden Output · · · · · · ▶ Neurons are organized in layers (input layer, output layer, possibly several hidden layers) ▶ Layers are numbered from 0; The input is 0-th ▶ Neurons in the ℓ-th layer are connected with all neurons in the ℓ + 1-th layer Intuition: The network computes a function: Assign input values to the input neurons and 0 to the rest. Proceed upwards through the layers, one layer per step. In the ℓ-th step consider output values of neurons in ℓ − 1-th layer as inputs to neurons of the ℓ-th layer. Compute output values of neurons in the ℓ-th layer. 332 Example 1 1 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function: linear threshold σ(ξ) = 1 ξ ≥ 0 ; 0 ξ < 0. 333 Example 1 1 σ 11 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function: linear threshold σ(ξ) = 1 ξ ≥ 0 ; 0 ξ < 0. 333 Example 0 0 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function: linear threshold σ(ξ) = 1 ξ ≥ 0 ; 0 ξ < 0. 333 Example 0 0 σ 01 σ1 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function: linear threshold σ(ξ) = 1 ξ ≥ 0 ; 0 ξ < 0. 333 Example 1 0 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function: linear threshold σ(ξ) = 1 ξ ≥ 0 ; 0 ξ < 0. 333 Example 1 0 σ 11 σ1 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function: linear threshold σ(ξ) = 1 ξ ≥ 0 ; 0 ξ < 0. 333 Example 1 0 σ 11 σ1 1 σ 1 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function: linear threshold σ(ξ) = 1 ξ ≥ 0 ; 0 ξ < 0. 333 Example 0 1 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function: linear threshold σ(ξ) = 1 ξ ≥ 0 ; 0 ξ < 0. 333 Example 0 1 σ 11 σ1 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function: linear threshold σ(ξ) = 1 ξ ≥ 0 ; 0 ξ < 0. 333 Example 0 1 σ 11 σ1 1 σ 1 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function: linear threshold σ(ξ) = 1 ξ ≥ 0 ; 0 ξ < 0. 333 Classical Example – ALVINN ▶ One of the first autonomous car driving systems (in the 90s) ▶ ALVINN drives a car ▶ The net has 30 × 32 = 960 input neurons (the input space is R960 ). ▶ The value of each input captures the shade of gray of the corresponding pixel. ▶ Output neurons indicate where to turn (to the center of gravity). Source: http://jmvidal.cse.sc.edu/talks/ann/alvin.html 334 A Bit of History ▶ Perceptron (Rosenblatt et al., 1957) ▶ Single layer (i.e., no hidden layers), the activation function linear threshold (i.e., a bit more general linear classifier) ▶ Perceptron learning algorithm ▶ Used to recognize digits ▶ Adaline (Widrow & Hof, 1960) ▶ Single layer, the activation function identity (i.e., a bit more linear function) ▶ Online version of the gradient descent ▶ Used a new circuitry element called memristor which was able to ”remember” history of current in form of resistance In both cases, the expressive power is somewhat limited- it can only express linear decision boundaries and linear (affine) functions. 335 A Bit of History One of the famous (counter)-examples: XOR 0 (0, 0) 1 (0, 1) 1 (0, 1) 0 (1, 1) x1 x2 No perceptron can distinguish between ones and zeros. 336 XOR vs Multilayer Perceptron 0 (0, 0) 1 (0, 1) 1 (0, 1) 0 (1, 1) P1 P2 x1 x2 σ1 σ 1 σ1 −22 2 −2 1 −1 1 3 −2 (Here, σ is a linear threshold function.) P1 : −1 + 2x1 + 2x2 = 0 P2 : 3 − 2x1 − 2x2 = 0 The output neuron performs an intersection of half-spaces. 337 Boolean functions Activation function: unit step function σ(ξ) = 1 ξ ≥ 0 ; 0 ξ < 0. σ x1 x2 xn x0 = 1 y = AND(x1, . . . , xn) 1 1 · · · 1 −n σ x1 x2 xn x0 = 1 y = OR(x1, . . . , xn) 1 1 · · · 1 −1 σ x1 x0 = 1 y = NOT(x1) −1 0 338 Non-linear separation x1 x2 y ▶ Consider a three-layer network; each neuron has the unit step activation function. ▶ The network divides the input space in two subspaces according to the output (0 or 1). ▶ The first (hidden) layer divides the input space into half-spaces. ▶ The second layer may, e.g., make intersections of the half-spaces ⇒ convex sets. ▶ The third layer may, e.g., make unions of some convex sets. 339 Example Consider a triangle T in R2 determined by three vertices (−1, −1), (1, −1), (−1, 2). Give an example of a multilayer perceptron (MLP) with two input neurons and a single output neuron computing the function F : R2 → {0, 1} defined as follows: F(x1, x2) = 1 iff (x1, x2) lies either inside, or on the border of T All activation functions in the network should be σ(ξ) = sgn(ξ) = 1 ξ ≥ 0 ; 0 ξ < 0. Homework: Consider F(x1, x2) = 1 iff (x1, x2) lies inside of T (but not on the border) 340 341 341 341 341 341 341 Expressive Power of MLP Cybenko’s theorem: ▶ Two-layer networks with a single output neuron and a single layer of hidden neurons (with the logistic sigmoid as the activation function) can ▶ approximate with arbitrarily small error any ”reasonable” boundary a given input is classified as 1 iff the output value of the network is ≥ 1/2. ▶ approximate with arbitrarily small error any ”reasonable” function from [0, 1] to (0, 1). Here, ”reasonable” means that it is pretty tough to find a function that is not reasonable. So, multilayer perceptrons are sufficiently powerful for any application. But for a long time, at least throughout the 60s and 70s, nobody well-known knew any efficient method for training multilayer networks! An efficient way of using the gradient descent was published in 1986! 342 MLP – Notation ▶ X set of input neurons ▶ Y set of output neurons ▶ Z set of all neurons (tedy X, Y ⊆ Z) ▶ individual neurons are denoted by indices, e.g., i, j. ▶ ξj is the inner potential of the neuron j when the computation is finished. ▶ yj is the output value of the neuron j when the computation is finished. (we formally assume y0 = 1) ▶ wji is the weight of the arc from the neuron i to the neuron j. ▶ j← is the set of all neurons from which there are edges to j (i.e. j← is the layer directly below j) ▶ j→ is the set of all neurons with edges from j. (i.e. j→ is the layer directly above j) 343 MLP – Notation ▶ Inner potential of a neuron j: ξj = i∈j← wji yi ▶ A value of a non-input neuron j ∈ Z \ X when the computation is finished is yj = σj (ξj ) Here σj is an activation function of the neuron j. (yj is determined by weights ⃗w and a given input ⃗x, so it’s sometimes written as yj [⃗w](⃗x) ) ▶ Fixing weights of all neurons, the network computes a function F[⃗w] : R|X| → R|Y | as follows: Assign values of a given vector ⃗x ∈ R|X| to the input neurons, evaluate the network, then F[⃗w](⃗x) is the vector of values of the output neurons. Here, we implicitly assume a fixed ordering on input and output vectors. 344 MLP – Learning ▶ Given a set D of training examples: D = ⃗xk, ⃗dk k = 1, . . . , p Here ⃗xk ∈ R|X| and ⃗dk ∈ R|Y |. We write dkj to denote the value in ⃗dk corresponding to the output neuron j. ▶ Error Function: E(⃗w) where ⃗w is a vector of all weights in the network. The choice of E depends on the solved task (classification vs regression etc.). Example (Squared error): E(⃗w) = p k=1 Ek(⃗w) where Ek(⃗w) = 1 2 j∈Y (yj [⃗w](⃗xk) − dkj )2 345 MLP – Batch Gradient Descent The algorithm computes a sequence of weights ⃗w(0), ⃗w(1), . . .. ▶ weights ⃗w(0) are initialized randomly close to 0 ▶ in the step t + 1 (here t = 0, 1, 2 . . .) is ⃗w(t+1) computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = −ε(t) · ∂E ∂wji (⃗w(t) ) is the weight change wji and 0 < ε(t) ≤ 1 is the learning rate in the step t + 1. Note that ∂E ∂wji (⃗w(t) ) is a component of ∇E, i.e., the weight change in the step t + 1 can be written as follows: ⃗w(t+1) = ⃗w(t) − ε(t) · ∇E(⃗w(t) ). 346 Illustration of Gradient Descent – XOR Source: Pattern Classification (2nd Edition); Richard O. Duda, Peter E. Hart, David G. Stork 347 Stochastic Gradient Descent (SGD) Assume that E(⃗w) = p k=1 Ek (⃗w) where Ek (⃗w) is an error w.r.t. the single training example (⃗xk , ⃗dk ). ▶ weights in ⃗w(0) are randomly initialized to values close to 0 ▶ in the step t + 1 (here t = 0, 1, 2 . . .), weights ⃗w(t+1) are computed as follows: ▶ Choose (randomly) a set of training examples T ⊆ {1, . . . , p} ▶ Compute ⃗w(t+1) = ⃗w(t) + ∆⃗w(t) where ∆⃗w(t) = −ε(t) · k∈T ∇Ek (⃗w(t) ) ▶ 0 < ε(t) ≤ 1 is a learning rate in step t + 1 ▶ ∇Ek (⃗w(t) ) is the gradient of the error of the example k Note that the random choice of the minibatch is typically implemented by randomly shuffling all data and then choosing minibatches sequentially. 348 Comments on Training Algorithm ▶ Not guaranteed to converge to zero training error, may converge to a local minimum or oscillate indefinitely. ▶ In practice, does converge to low error for many large networks on (big) real data. ▶ Many epochs (thousands) may be required, hours or days of training for large networks. There are many issues concerning learning efficiency (data normalization, selection of activation functions, weight initialization, learning rate, efficiency of the gradient descent itself, etc.) – see PV021. 349 Overfitting ▶ Due to their expressive power, neural networks are quite sensitive to overfitting. ▶ Keep a hold-out validation set and test the error of the network on this set after every epoch. Stop training when additional epochs increase the validation error. The validation error can be measured by completely different means than the training error E. 350 Hidden Neurons Representations Trained hidden neurons can be seen as newly constructed features. E.g., in a two-layer network used for classification, the hidden layer transforms the input so that important features become explicit (and hence the result may become linearly separable). Consider a two-layer MLP, 64-2-3, for classifying letters (three output neurons, each corresponding to one of the letters). 351 Optimal Architecture? ▶ For MLP: Too few hidden neurons prevent the network from adequately fitting the data. Too many hidden units can result in overfitting. (There are advanced methods that prevent overfitting even for rich models, such as regularization, where the error function penalizes overfitting – see PV021.) ▶ There are (almost) infinitely many types of architectures of neural networks (convolutional, recurrent, transformers, adversarial, etc.) suitable for various tasks. ▶ Transfer learning: Start with a known solution to a related problem. Simplified view: Preserve lower parts of the network trained to solve the related problem (feature extractors). Add your top part and then train only the new top part (or train the whole network but carefully). 352 How to Choose Activation Functions & Error ▶ Hidden neurons: ”Almost” linear activations such as (leaky) ReLU (y = max(0, ξ)) Better than sigmoidal that saturate more often. ▶ Output neurons: Single output: ▶ Regression: Typically ”linear” output, i.e., no activation on the output neuron. ▶ Binary classification: Logistic sigmoid y = 1/(1 + e−ξ ) ▶ Error: Single output: ▶ Regression: (Mean) squared error ▶ Binary classification: Binary cross-entropy For multiple outputs and classification, use softmax output and cross-entropy. 353 Applications ▶ Image recognition, segmentation, etc. ▶ Machine translation and other text processing ▶ Text generation, image generation, movie generation, theatre plays generation ▶ Text to Speech and vice versa ▶ Finance, business predictions, fraud detection ▶ Game playing (backgammon is a classic example, AlphaGo is the famous one, computer games are the big ones, bridge is the hard one) ▶ (artificial brain and intelligence) ▶ ... Text and image processing are possibly the most advanced deep learning applications. 354 ALVINN 355 ALVINN ▶ Two layer MLP, 960 − 4 − 30 (sometimes 960 − 5 − 30) ▶ Inputs correspond to pixels. ▶ Sigmoidal activation function (logistic sigmoid). ▶ Direction corresponds to the center of gravity. , I.e., output neurons are considered as points of mass evenly distributed along a line. The weight of each neuron corresponds to its value. 356 ALVINN – Training Trained while driving. ▶ A camera captured the road from the front window, approx. 25 pictures per second ▶ Training examples (⃗xk, ⃗dk) where ▶ ⃗xk = image of the road ▶ ⃗dk ≈ corresponding direction of the steering wheel set by the driver ▶ the values ⃗dk computed using Gaussian distribution: dki = e−D2 i /10 Where Di is the distance between the i-th output from the one corresponding to the steering wheel’s real direction. (The authors claimed this approach is better than the binary output because similar road directions induce similar driver reactions.) 357 Selection of Training Examples Naive approach: just take images from the camera. Problems: ▶ A too-good driver never teaches the network how to solve deviations from the right track. A couple of harsh solutions: ▶ turn the learning off momentarily, deviate from the right track, then turn on the learning and let the network learn how to solve the situation. ▶ let the driver go crazy! (a bit dangerous, expensive, unreliable) ▶ Images are very similar (the network sees the road from the right lane), overfitting. 358 Selection of Training Examples Problems with too good driver were solved as follows: ▶ every image of the road has been has been transformed to 15 slightly different copies The repetitiveness of images was solved as follows: ▶ the system has a buffer of 200 images (including the 15 copies of the current one), in every round trains on these images ▶ afterward, a new image is captured, 15 copies made, and these new 15 substitute 15 selected from the buffer (10 with the smallest training error, five randomly) 359 ALVINN – Training ▶ gradient descent ▶ constant learning rate (possibly different for each neuron – see PV021) ▶ some other optimizations (see PV021) The result: ▶ Training took 5 minutes, and the speed was 4 miles per hour (The speed was limited by the hydraulic controller of the steering wheel, not the learning algorithm.) ▶ ALVINN was able to go through roads it had never ”seen” and in different weather 360 ALVINN – Weight Learning round 0 round 10 round 20 round 50 h1 h2 h3 h4 h5 Here h1, . . . , h5 are values of hidden neurons. 361 Backpropagation 362 How to Compute the Gradient? To implement a single step of the gradient descent, we need to compute the partial derivatives ∂E ∂wij of E w.r.t. all weights wij . To simplify consider for now a restricted case: ▶ Single element training set D = {(x, d)} where x ∈ R and d ∈ R are numbers (i.e., not vectors as in the general case). ▶ The error function is the squared error. ▶ Assume that 1 is the output neuron, ▶ which means that y1 = y1[⃗w](x) is the output of the network with weights ⃗w and the input x, ▶ and the error on D is then E(⃗w) = 1 2 (y1 − d)2 363 364 ∂E ∂w12 364 ∂E ∂w12 = ∂E ∂y1 ∂y1 ∂w12 364 ∂E ∂w12 = ∂E ∂y1 ∂y1 ∂w12 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂w12 364 ∂E ∂w12 = ∂E ∂y1 ∂y1 ∂w12 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂w12 = ∂E ∂y1 σ′ 1(ξ1)y2 Here σ′ 1 is just the plain derivative of σ′ as a function of a single variable 364 ∂E ∂w12 = ∂E ∂y1 ∂y1 ∂w12 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂w12 = ∂E ∂y1 σ′ 1(ξ1)y2 Here σ′ 1 is just the plain derivative of σ′ as a function of a single variable ∂E ∂y1 = y1 − d Note that if σ1 is identity, we obtain exactly the gradient from the linear regression method. Considering σ1 equal to the logistic sigmoid and E the cross-entropy, we get the logistic regression gradient. 364 ∂E ∂w12 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂w12 = ∂E ∂y1 σ′ 1(ξ1)y2 ∂E ∂w23 = ∂E ∂y2 ∂y2 ∂ξ2 ∂ξ2 ∂w23 = ∂E ∂y2 σ′ 2(ξ2)y3 ∂E ∂y1 = y1 − d ∂E ∂y2 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂y2 = ∂E ∂y1 σ′ 1(ξ1)w12 365 ∂E ∂w12 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂w12 = ∂E ∂y1 σ′ 1(ξ1)y2 ∂E ∂w23 = ∂E ∂y2 ∂y2 ∂ξ2 ∂ξ2 ∂w23 = ∂E ∂y2 σ′ 2(ξ2)y3 ∂E ∂w34 = ∂E ∂y3 ∂y3 ∂ξ3 ∂ξ3 ∂w34 = ∂E ∂y3 σ′ 3(ξ3)y4 ∂E ∂y1 = y1 − d ∂E ∂y2 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂y2 = ∂E ∂y1 σ′ 1(ξ1)w12 ∂E ∂y3 = ∂E ∂y2 ∂y2 ∂ξ2 ∂ξ2 ∂y3 = ∂E ∂y2 σ′ 2(ξ2)w23 366 ∂E ∂w12 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂w12 = ∂E ∂y1 σ′ 1(ξ1)y2 ∂E ∂w13 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂w13 = ∂E ∂y1 σ′ 2(ξ2)y3 ∂E ∂w24 = ∂E ∂y2 ∂y2 ∂ξ2 ∂ξ2 ∂w24 = ∂E ∂y2 σ′ 2(ξ2)y4 ∂E ∂w34 = ∂E ∂y3 ∂y3 ∂ξ3 ∂ξ3 ∂w34 = ∂E ∂y3 σ′ 3(ξ3)y4 ∂E ∂w45 = ∂E ∂y4 ∂y4 ∂ξ4 ∂ξ3 ∂w45 = ∂E ∂y4 σ′ 4(ξ4)y5 367 ∂E ∂y1 = y1 − d ∂E ∂y2 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂y2 = ∂E ∂y1 σ′ 1(ξ1)w12 ∂E ∂y3 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂y3 = ∂E ∂y1 σ′ 1(ξ1)w13 368 ∂E ∂y1 = y1 − d ∂E ∂y2 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂y2 = ∂E ∂y1 σ′ 1(ξ1)w12 ∂E ∂y3 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂y3 = ∂E ∂y1 σ′ 1(ξ1)w13 368 ∂E ∂y1 = y1 − d ∂E ∂y2 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂y2 = ∂E ∂y1 σ′ 1(ξ1)w12 ∂E ∂y3 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂y3 = ∂E ∂y1 σ′ 1(ξ1)w13 368 ∂E ∂y1 = y1 − d ∂E ∂y2 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂y2 = ∂E ∂y1 σ′ 1(ξ1)w12 ∂E ∂y3 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂y3 = ∂E ∂y1 σ′ 1(ξ1)w13 368 ∂E ∂y1 = y1 − d ∂E ∂y2 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂y2 = ∂E ∂y1 σ′ 1(ξ1)w12 ∂E ∂y3 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂y3 = ∂E ∂y1 σ′ 1(ξ1)w13 ∂E ∂y4 368 ∂E ∂y1 = y1 − d ∂E ∂y2 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂y2 = ∂E ∂y1 σ′ 1(ξ1)w12 ∂E ∂y3 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂y3 = ∂E ∂y1 σ′ 1(ξ1)w13 ∂E ∂y4 = ∂E ∂y2 ∂y2 ∂y4 + ∂E ∂y3 ∂y3 ∂y4 368 ∂E ∂y1 = y1 − d ∂E ∂y2 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂y2 = ∂E ∂y1 σ′ 1(ξ1)w12 ∂E ∂y3 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂y3 = ∂E ∂y1 σ′ 1(ξ1)w13 ∂E ∂y4 = ∂E ∂y2 ∂y2 ∂y4 + ∂E ∂y3 ∂y3 ∂y4 = ∂E ∂y2 ∂y2 ∂ξ2 ∂ξ2 ∂y4 + ∂E ∂y3 ∂y3 ∂ξ3 ∂ξ3 ∂y4 368 ∂E ∂y1 = y1 − d ∂E ∂y2 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂y2 = ∂E ∂y1 σ′ 1(ξ1)w12 ∂E ∂y3 = ∂E ∂y1 ∂y1 ∂ξ1 ∂ξ1 ∂y3 = ∂E ∂y1 σ′ 1(ξ1)w13 ∂E ∂y4 = ∂E ∂y2 ∂y2 ∂y4 + ∂E ∂y3 ∂y3 ∂y4 = ∂E ∂y2 ∂y2 ∂ξ2 ∂ξ2 ∂y4 + ∂E ∂y3 ∂y3 ∂ξ3 ∂ξ3 ∂y4 = ∂E ∂y2 σ′ 2(ξ2)w24 + ∂E ∂y3 σ′ 3(ξ3)w34 368 MLP – Gradient Computation Under our simplifying assumptions D = {(x, d)} and E = (y1 − d)2 the gradient computaton proceeds as follows: Applying the chain rule, we obtain ∂Ek ∂wji = ∂Ek ∂yj · σ′ j (ξj ) · yi where (after more applications of the chain rule) ∂Ek ∂y1 = y1 − d Keep in mind that 1 is the only output neuron which means that y1 is the value of the network. ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σ′ r (ξr ) · wrj for j ∈ Z ∖ (Y ∪ X) Here yr = y[⃗w](x) where ⃗w are the current weights and x is the example input. 369 MLP – Gradient Computation – General! Let us drop our simplifying assumptions! ▶ Given a set D of training examples: D = ⃗xk, ⃗dk k = 1, . . . , p Here ⃗xk ∈ R|X| and ⃗dk ∈ R|Y |. We write dkj to denote the value in ⃗dk corresponding to the output neuron j. ▶ Error Function: E(⃗w) where ⃗w is a vector of all weights in the network. The choice of E depends on the solved task (classification vs regression, etc.). Example (Squared error): E(⃗w) = p k=1 Ek(⃗w) where Ek(⃗w) = 1 2 j∈Y (yj [⃗w](⃗xk) − dkj )2 370 MLP – Gradient Computation For every weight wji we have (obviously) ∂E ∂wji = p k=1 ∂Ek ∂wji So now it suffices to compute ∂Ek ∂wji , that is the error for a fixed training example (⃗xk , ⃗dk ). Applying the chain rule, we obtain ∂Ek ∂wji = ∂Ek ∂yj · σ′ j (ξj ) · yi where (more applications of the chain rule) ∂Ek ∂yj is computed directly for the output neurons j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σ′ r (ξr ) · wrj for j ∈ Z ∖ (Y ∪ X) (Here yr = y[⃗w](⃗xk ) where ⃗w are the current weights and ⃗xk is the input of the k-th training example.) 371 MLP – Backpropagation Input: A training set D = ⃗xk , ⃗dk k = 1, . . . , p and the current vector of weights ⃗w. Note that the backprop. is repeated in every iteration of the gradient descent! ▶ Evaluate all values yi of neurons using the standard bottom-up procedure with the input ⃗xk . ▶ For every training example (⃗xk , ⃗dk ) compute ∂Ek ∂yj using backpropagation through layers top-down : ▶ For all j ∈ Y compute ∂Ek ∂yj by taking the derivative of the error. e.g., in the case of the squared error, we have ∂Ek ∂yj = yj − dkj . ▶ In the layer ℓ, assuming that ∂Ek ∂yr has been computed for all neurons r in the layer ℓ + 1, compute ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σ′ r (ξr ) · wrj for all j from the ℓ-th layer. Here σ′ r is the derivative of σr . ▶ Put ∂Ek ∂wji = ∂Ek ∂yj · σ′ j (ξj ) · yi Output: ∂E ∂wji = p k=1 ∂Ek ∂wji . 372 MLP Learning Example Training set: D = {(x, d)} = {(1, 1)} That is y3 = x = 1 d = 1 Error cross-entropy: E(⃗w) = −(d log(y1) + (1 − d) log(1 − y1)) = − log(y1) Assume the initial weight vector ⃗w(0) = (w (0) 12 , w (0) 23 ) = (1 4, 2). Consider the learning rate ε = 0.5. 373 MLP Learning Example – Gradient Descent To make the gradient descent step: w (1) 12 = w (0) 12 − ε ∂E ∂w12 (⃗w(0) ) w (1) 23 = w (0) 23 − ε ∂E ∂w23 (⃗w(0) ) we need to compute the partial derivatives ∂E ∂w12 and ∂E ∂w23 . 374 MLP Learning Example – Forward Pass We have x = 1, w (0) 12 = 1/4, w (0) 23 = 2 First, compute the forward pass y3 = x = 1 ξ2 = w (0) 23 y3 = 2y3 = 2 y2 = max{0, ξ2} = 2 ξ1 = w (0) 12 y2 = 1 4 2 = 1 2 y1 = 1/(1 + e−ξ1 ) = 1/(1 + e−(1/2) ) = 0.6225 375 MLP Learning Example – Backward Pass We have w (0) 12 = 1/4, w (0) 23 = 2, y1 = 0.6225, y2 = 2, y3 = 1. Proceed with the backward pass: ∂E ∂y1 = ∂(− log(y1)) ∂y1 = − 1 y1 = −1.6065 Since σ′ 1 = σ1(1 − σ1) ∂E ∂y2 = ∂E ∂y1 σ′ 1(ξ1)w (0) 12 = ∂E ∂y1 σ1(ξ1)(1 − σ1(ξ1))w (0) 12 = ∂E ∂y1 y1(1 − y1)w (0) 12 = −1.6065 · 0.6225 · 0.3775 · (1/4) = −0.09438 376 MLP Learning Example – The Gradient We have w (0) 12 = 1/4, w (0) 23 = 2, y1 = 0.6225, y2 = 2, y3 = 1, ∂E ∂y1 = −1.6065, ∂E ∂y2 = −0.09438. Compute derivatives of E w.r.t. weights: ∂E ∂w12 = ∂E ∂y1 σ′ 1(ξ1)y2 = ∂E ∂y1 y1(1 − y1)y2 = −0.755 ∂E ∂w23 = ∂E ∂y2 σ′ 2(ξ2)y3 = ∂E ∂y2 y3 = −0.09438 377 Backpropagation – Example – Summary 378 Backpropagation – Example – Summary Note that WE HAVE NOT YET CHANGED ANY WEIGHTS! 378 MLP Learning Example – Gradient Descent Step So ONLY NOW we can make the learning step and change the weights: w (1) 12 = w (0) 12 − ε ∂E ∂w12 (⃗w(0) ) = 1 4 − 0.5 · (−0.755) = 0.627 w (1) 23 = w (0) 23 − ε ∂E ∂w23 (⃗w(0) ) = 2 − 0.5 · (−0.09438) = 2.047 We have made just a single gradient descent step! 379 MLP Training Summary ▶ MLP are trained using the gradient descent algorithm In practice, modifications of GD are typically used, but most of them have strong roots in GD. ▶ The gradients are computed using the backpropagation algorithm the backpropagation is a universal method for automatic differentiation, surpassing any use in neural networks. ▶ Training of neural networks in practice is tricky due to many reasons comprising in particular ▶ highly complex non-linear shape of the error function, ▶ tendency to overfit very quickly, ▶ black-box nature, hard to see what the network does, ▶ huge hype around deep learning (which many people confuse with AI) results in high expectations even in cases where no (learning) algorithm may solve the given problem! An advice: Always concentrate the main effort on the solved problem formulation and, afterward, on the data you have at your disposal (and honestly separate the Test set right at the beginning). 380 Deep Learning ▶ Cybenko’s theorem shows that two-layer networks are omnipotent – such results nearly killed NN when support vector machines were found to be easier to train in 00’s. ▶ Later, it has been shown (experimentally) that deep networks (with many layers) have better representational properties. ▶ ... but how to train them? The gradient descent suffers from a so-called vanishing gradient; intuitively, updates of weights in lower layers are very slow. ▶ In 2006, a solution was found by Hinton et al.: ▶ Use unsupervised methods to initialize the weights layer by layer to capture important data features. More precisely: The lowest hidden layer learns patterns in data, the second lowest learns patterns in data transformed through the first layer, and so on. ▶ Then use a supervised learning algorithm to only fine tune the weights to the desired input-output behavior. ▶ ... but the actual revolution started with convolutional networks trained on several GPUs. 381 Convolutional network 382 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) ImageNet database (16,000,000 color images, 20,000 categories) 383 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) Competition in classification over a subset of images from ImageNet. In 2012, training saw 1,200,000 images and 1000 categories. Validation set 50,000, Test set 150,000. Many images contain several objects → typical rule is top-5 highest probability assigned by the net. 384 KSH net ImageNet classification with deep convolutional neural networks, by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton (2012). Trained on two GPUs (NVIDIA GeForce GTX 580) Results: ▶ Accuracy 84.7% in top-5 (second best alg. at the time: 73.8%) ▶ 63.3% in ”perfect” classification (top-1) 385 ILSVRC 2014 The same set of images as in 2012, top-5 criterium. GoogLeNet: deep convolutional net, 22 layers Results: ▶ 93.33% in top-5 Superhuman power? 386 Superhuman GoogLeNet?! Andrej Karpathy: ...the task of labeling images with 5 out of 1000 categories quickly turned out to be highly challenging, even for some friends in the lab who have been working on ILSVRC and its classes for a while. First, we thought we would put it up on [Amazon Mechanical Turk]. Then, we thought we could recruit paid undergrads. Then, I organized a labeling party of intense labeling effort only among the (expert labelers) in our lab. Then, I developed a modified interface that used GoogLeNet predictions to prune the number of categories from 1000 to only about 100. It was still too hard - people kept missing categories and getting up to ranges of 13-15% error rates. In the end, I realized that to get anywhere competitively close to GoogLeNet, it would be most efficient if I sat down and went through the painfully long training process and the subsequent careful annotation process myself. The labeling happened at a rate of about 1 per minute, but this decreased over time... Some images are easily recognized, while some pictures (such as those of fine-grained breeds of dogs, birds, or monkeys) can require multiple minutes of concentrated effort. I became very good at identifying breeds of dogs... Based on the sample of images I worked on, the GoogLeNet classification error turned out to be 6.8%... In the end, my error turned out to be 5.1% 387 ILSVRC 2015 ▶ Microsoft network ResNet: 152 layers, complex architecture ▶ Trained on 8 GPUs ▶ 96.43% accuracy in top-5 388 ILSVRC 389 ILSVRC 2016 Trumps-Soushen (The Third Research Institute of Ministry of Public Security) There is no new innovative technology or novelty by Trimps-Soushen. Ensemble of the pre-trained models from previous years. Each model is strong at classifying some categories but weak at categorizing others. Test error: 2.99% 390 Top-k accuracy analyzed https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classification-dfbc423111dd 391 Top-20 typical errors Out of 1458 misclassified images in Top-20: https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classification-dfbc423111dd 392 Top-k accuracy analyzed https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classification-dfbc423111dd 393 Top-k accuracy analyzed https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classification-dfbc423111dd 394 Top-k accuracy analyzed https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classification-dfbc423111dd 395 Top-k accuracy analyzed https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classification-dfbc423111dd 396 Anomaly Detection 397 What is an Anomaly? Anomalies are hard to define in general. Attempts at a generic definition include: ▶ Observations that seem to be generated by a different mechanism. ▶ Data instances that occur rarely and whose features differ significantly from most data. ▶ Outliers - observations that appear to be inconsistent (far away) with the remainder of the data However, inliers are not covered, that is, anomalous data that lie within the normal data distribution (say 0 in measurement results that are very likely non-zero but distributed around 0). ▶ Patterns in data that do not conform to a well-defined notion of normal behavior(??) This lecture presents algorithms detecting specific data instances that may be anomalous w.r.t. the used algorithm. It is up to the context-knowing user to decide whether such data are anomalous. 398 What is the Anomaly Detection Good For? There are two main sources of motivation for anomaly detection in machine learning: ▶ Modeling perspective: Many models are sensitive to anomalous data. ▶ Application perspective: Many tasks are based on searching for anomalies in data. 399 Modeling Perspective Recall the behavior of the linear model. To train such a model properly, the outliers should be removed. On the other hand, we will see that the sensitivity of some models can be used to detect the anomalies. 400 Applications Perspective ▶ Fraud Detection ▶ Analysis of purchasing behavior to detect credit card theft. ▶ Identification of theft through uncharacteristic buying patterns and behavioral changes. ▶ Intrusion Detection ▶ Monitoring for attacks on computer systems and networks. ▶ Many attacks exhibit themselves as anomalous behavior. ▶ Ecosystem Disturbances ▶ Detect atypical natural world events. ▶ Prediction and understanding of hurricanes, floods, droughts, heat waves, and fires. ▶ Medicine ▶ Unusual patient symptoms or test results may indicate health problems. ▶ Balancing the need for further tests with the potential costs and risks. ▶ ... 401 Causes of Anomalies in Data Preprocessing Anomaly detection is a standard step in data preprocessing. We typically search for anomalies from the following sources: ▶ Objects from different classes: Pear among apples. ▶ Natural variation: Abnormally tall person - not from a different class (humans) but in the sense of an extreme characteristic. ▶ Data measurement and collection error: ▶ Thermometer positioned on the direct sun (possibly a measurement error) ▶ 40 kg infant is not usual among humans (possibly a data collection error) The anomalies mentioned above should be inspected by domain experts and possibly corrected/removed. 402 Approaches to Anomaly Detection ▶ Model-based techniques: ▶ Build a model of data. ▶ Anomalies should not fit the model very well. For example, using a clustering model, an anomaly may lie far away from larger clusters. ▶ Alternatively, removing anomalies should have the strongest impact on the model parameters (more on this later). ▶ Proximity-based techniques: Given distance between objects, anomalies are objects distant from others. ▶ Density-based techniques: Estimate the density distribution of the objects. Anomalies would probably be outside of dense regions. I would count the most recent deep learning-based anomaly detection among the model-based techniques even though a combination of the above approaches is usually used. We will make the above ideas more precise in the rest of this lecture. 403 Anomaly Detecion Learning There are three approaches to anomaly detection based on available information about data: ▶ Supervised: Given a training set distinguishing normal and anomalous data. We may train a supervised learning classifier. As anomalies are rare, approaches based on supervised learning are rare. ▶ Semi-supervised: ▶ We may know what instances are normal. A tumor detection system trained on healthy people detects tumors as anomalies. In this case, we detect anomalies that are dissimilar to normal instances. We may train a model representing the normal instances and detect instances that do not fit the model. ▶ We may have information about (some) normal and some anomalous instances. Here, we can use methods for semi-supervised classification. ▶ Unsupervised: Create a model of all instances and hope that anomalous instances will still be dissimilar to instances typical for the model. 404 Issues Task-specific issues comprise: ▶ Single vs. multi-attribute anomaly: The question is whether an object is anomalous due to a single attribute (200 kg person) or a combination of attributes (100 kg person of 1 meter height). This is especially important when data is high-dimensional. For example, in genetic data, every sample is an anomaly(?) ▶ Global vs. local perspective: An object may be anomalous w.r.t. all objects but not w.r.t. objects in its local neighborhood (210 cm high basketball player). ▶ Degree of anomaly: Usually, anomalies are reported in a binary fashion. However, we often want to measure how anomalous an object is, similar to normal objects. Another issue is evaluation: How can we determine how good an anomaly detector is? In the supervised case, we have a ground truth, but usually, we just observe detected anomalies. 405 Various Approaches to Anomaly Detection We shall have a look at five approaches to the unsupervised anomaly detection: ▶ Statistical ▶ Proximity-based ▶ Density-based ▶ Cluster-based ▶ Autoencoders The above approaches are also used in the semi-supervised setting where we have a dataset of normal instances. However, some issues discussed further will not appear in this case. 406 Statistical Definition 1 An outlier is an object with a low probability in the probability distribution of the data. Suppose we knew the “true” probability distribution P on objects. In that case, we may set up a threshold t (using an appropriate training set) and say that every object A with P(A) < t is anomalous. However, in most cases, we do not know P and have to resort to a model of P created using a dataset of feature vectors. There are many more or less sophisticated tests based on models of P in the literature. Some use rather advanced statistical methods. Let us have a look at a few simple examples. 407 Box Plot A simple method for outlier detection in the univariate case, that is, for values of a single attribute. 408 Univariate Gaussian Model Consider a numeric attribute whose values x are normally distributed with mean µ and standard deviation σ. We write N(µ, σ2 ). Given an attribute value x, we compute the z-score: z = (x − µ)/σ Then z has the distribution N(0, 1). Now, choose c so the probability P(|z| ≥ c) is small enough for z to be an outlier w.r.t. N(0, 1). Given an attribute value x, we may decide whether x is an outlier by deciding whether |z| = |(x − µ)/σ| ≥ c. 409 Univariate Gaussian Problem 1: The attribute does not have a normal distribution. Before starting, use a normality test (observe the histogram, use a specialized test such as Shapiro-Wilk). Some transformations may sometimes succeed in normalizing an attribute (Box-Cox transformation, etc.) Different distributions can be used to model the attribute (Log Normal, Weibull, etc.), but be aware of the assumptions of anomaly detection tests! 410 Univariate Gaussian Problem 2: We typically do not know the population mean µ and the population variance σ2. They can be estimated from a dataset by the sample mean and the sample variance: ¯µ = 1 p p i=1 xi s2 = 1 p − 1 p i=1 (xi − ¯µ)2 However, then z = (x − ¯µ)/s does no longer have the distribution N(0, 1). The distribution of z is normal if x is sampled independently of the data yielding ¯µ and s2 . Note that ¯µ and s2 are unbiased estimates of µ and σ2. Also, ¯µ converges to µ and s2 converges to σ2 with growing sample size. 411 Univariate Gaussian Model Problem 3: The outliers distort the estimates ¯µ and s2 of the mean and the standard deviation. For example, a millionaire would not look like an outlier in a group containing a billionaire. There is a circularity here: To get outliers using the normal distribution model, we need to remove the outliers to have a good model. One possible heuristic is to remove the outliers one by one, starting from the most extreme, hoping the most extreme outlier will be detected in every step. One such heuristic is the Grubb’s test. 412 Grubb’s Test Consider a dataset D = {x1, . . . , xp} of values of a normally distributed attribute and choose α > 0. The Grubb’s statistic is defined by G = maxi=1,...,p |xi − ¯x| s Here ¯x is the sample mean and s sample standard deviation of D. Now P(G ≥ cα,p) = α if cα,p = p − 1 √ p t2 p − 2 + t2 Here t is such a value that makes P(T ≥ t) = α/(2p) for T with the t-distribution with p − 2 degrees of freedom. For finding t we used to use tables. Grubb’s test simply iteratively tests whether G ≥ cα,p and, if yes, removes an xi maximizing |xi − ¯x| from D. 413 Multivariate Gaussian Model Univariate tests cannot find all anomalies if the anomaly depends on combinations of particular attribute values. The univariate Gaussian approach can be generalized to the multivariate Gaussian by considering the multivariate Gaussian distribution. Mahalanobis distance: mahalanobis(⃗x, ⃗z) = (⃗x − ⃗z)Σ−1 (⃗x − ⃗z)⊤ Here Σ is the covariance matrix of the data. Intuitively, the Mahalanobis distance generalizes the squared Euclidean distance: (⃗x − ⃗z)(⃗x − ⃗z)⊤ By taking into account the “spread” of the data. 414 Mahalanobis Distance Here, we assume multivariate normal data distribution. The covariance matrix is Σ = 1.00 0.75 0.75 3.00 415 Mahalanobis Distance Notice that A has a larger Mahalanobis distance from the cluster’s center than B even though it is closer in the Euclidean distance. The reason is that the density function falls more rapidly in the direction of A than in the direction of B. 416 Likelihood-Based Detection Assume that the dataset D contains objects from a mixture of two probability distributions: ▶ M, the distribution of the majority of normal objects, ▶ A, the distribution of anomalous objects. Let us choose α > 0, indicating how rare the anomalies should be. Let us assume that we have M and A models to compute PM(x) and PA(x) the likelihoods of generating x. A is often considered uniform, and M is learned from the data (yes, distorted by the outliers). Now, we compute the total likelihood of the dataset before and after removing each element. If the likelihood changes significantly, we have probably eliminated an anomaly. 417 Given a partition U, V of D we define L(U, V ) =  (1 − α)|U| xi ∈U PM(xi )    α|V | xi ∈V PA(xi )   This is the likelihood of generating D using the following random process: Generate x1, x2, . . . , xp sequentially and independently where each xi is generated as follows: ▶ Randomly choose between normal and anomaly, where the probability of choosing anomaly is α. ▶ If normal has been chosen, choose xi from the distribution M. ▶ If anomaly has been chosen, choose xi from the distribution A. To eliminate the “large” product, we consider the log-likelihood: LL(U, V ) = log(L(U, V )) = |U| log(1 − α) + xi ∈U log PM(xi ) + |V | log α + xi ∈V log PA(xi ) 418 Likelihood-Based Anomaly Detection Start with sets M0 = D and A0 = ∅. The algorithm computes M1, M2, . . . and A1, A2, . . . by moving elements from Mk to Ak as follows: ▶ For every i = 1, . . . , p such that xi ∈ Mk compute ∆i = |LL(Mk, Ak) − LL(Mk ∖ {xi }, Ak ∪ {xi })| ▶ Consider i maximizing ∆i . If ∆i ≥ c, then Mk+1 = Mk ∖ {xi } Ak+1 = Ak ∪ {xi } Else, stop. Here, c is a threshold we must set up. Ultimately, the Ak will contain the anomalies the algorithm detects. Note that we may also move all anomalies detected in a single iteration from Mk to Ak . This would result in a different method (also valid). 419 Summary of Statistical Anomaly Detection ▶ Build on strong statistical foundations. ▶ Tests are very effective if the dataset is sufficiently large (informative). ▶ Lots of tests for univariate data, the area has developed for a long time. ▶ Fewer options for multivariate data and problematic for highly dimensional data (curse of dimensionality). The likelihood-based approach does not assume a particular distribution shape: It can be used with arbitrary models of PM and PA, including deep learning ones. 420 Proximity-Based Methods 421 Proximity-Based Outlier Detection Assume a distance measure d. That is, given two feature vectors ⃗x, ⃗z their distance is d(⃗x, ⃗z). We consider the Euclidean distance for simplicity. Definition 2 The outlier score of an object is given by the distance to its k-nearest neighbor. A threshold on the minimum distance of an outlier can be set on a training set. 422 Outlier Score The outlier score is based on the distance to the fifth nearest neighbor. 423 Outlier Score The outlier score is based on the distance to the fifth nearest neighbor. 423 Outlier Score The outlier score is based on the distance to the fifth nearest neighbor. 423 Outlier Score The outlier score is based on the distance to the first nearest neighbor. 423 Proximity-Based Approaches ▶ Conceptually simple and easy to implement. ▶ Time complexity typically O(p2) where p is the number of feature vectors in the dataset. ▶ Sensitivity to the choice of parameters (the number of neighbors). ▶ Density/compactness not taken explicitly into account. 424 Density-Based Methods 425 Density-Based Approaches Definition 3 The outlier score of an object is the inverse of the density around the object. We consider the Local Outlier Factor (LOF) method: A local density of an object corresponds to the inverse of the average distance to its k-nearest neighbors. Compute the LOF score for an object A by ▶ computing the local density DA of A, ▶ computing the average ¯D of the local densities of k-nearest neighbors of A, ▶ dividing the average local density with the local density of A, that is, compute ¯D/DA. The question is, how exactly can the local density be defined? 426 LOF - Illustration Here, the density around A is smaller than the densities around the other points. 427 LOF - Formally Let us fix k ∈ N. Define k-distance(A) as the distance of the k-th nearest neighbor. Denote by Nk(A) the set of all objects in the distance from A up to k-distance(A). Note that if there are objects in the same distance from A, we may have more than k elements in Nk (A). Define reachability-distancek(A, B) = max{k-distance(B), d(A, B)} The reachability distance of A from B is the distance between A and B but at least the distance from B to the k-nearest neighbor. 428 Reachability Distance 429 LOF Define local reachability density lrdk(A) = B∈Nk (A) reachability-distancek(A, B) |Nk(A)| −1 That is the reciprocal of the average reachability distance of A from its k-nearest neighbors. We define local outlier factor of an object A by LOFk(A) = B∈Nk (A) lrdk(B) |Nk(A)| /lrdk(A) Now ▶ LOFk(A) ≈ 1 - similar density as the neighbors have ▶ LOFk(A) < 1 - higher density of neighbors than the neighbors have (inlier?) ▶ LOFk(A) > 1 - smaller density of neighbors than the neighbors have (outlier?) 430 Example Consider a dataset D = {A, B, C, D} = {5, 2, 1, 0} Consider k = 2 and the distance d(U, V ) = |U − V |. 431 Example Consider a dataset D = {A, B, C, D} = {5, 2, 1, 0} Consider k = 2 and the distance d(U, V ) = |U − V |. Nk(A) = {B, C} k-distance(.) d(A, .) reach-distk(A, .) B 2 3 3 C 1 4 4 lrdk(A) = 1 2 (reach-distk(A, B) + reach-distk(A, C)) −1 = 2/7 431 Example Consider a dataset D = {A, B, C, D} = {5, 2, 1, 0} Consider k = 2 and the distance d(U, V ) = |U − V |. Nk(B) = {C, D} k-distance(.) d(B, .) reach-distk(B, .) C 1 1 1 D 2 2 2 lrdk(B) = 1 2 (reach-distk(B, C) + reach-distk(B, D)) −1 = 2/3 431 Example Consider a dataset D = {A, B, C, D} = {5, 2, 1, 0} Consider k = 2 and the distance d(U, V ) = |U − V |. Nk(C) = {B, D} k-distance(.) d(C, .) reach-distk(C, .) B 2 1 2 D 2 1 2 lrdk(C) = 1 2 (reach-distk(C, B) + reach-distk(C, D)) −1 = 1/2 431 Example Consider a dataset D = {A, B, C, D} = {5, 2, 1, 0} Consider k = 2 and the distance d(U, V ) = |U − V |. Nk(D) = {B, C} k-distance(.) d(D, .) reach-distk(D, .) B 2 2 2 C 1 1 1 lrdk(D) = 1 2 (reach-distk(D, B) + reach-distk(D, C)) −1 = 2/3 431 Example lrdk(A) = 2/7 lrdk(B) = 2/3 lrdk(C) = 1/2 lrdk(D) = 2/3 LOFk(A) = 1 2 (lrdk(B) + lrdk(C))/lrdk(A) = (1/2)(2/3 + 1/2)/(2/7) = 2.041 LOFk(B) = 1 2 (lrdk(C) + lrdk(D))/lrdk(B) = (1/2)(1/2 + 2/3)/(2/3) = 0.875 LOFk(C) = 1 2 (lrdk(B) + lrdk(D))/lrdk(B) = (1/2)(2/3 + 2/3)/(2/3) = 1 LOFk(D) = 0.875 432 Another Example 433 Yet Another Example 434 Comments on Density Based Methods ▶ As opposed to the distance-based methods, quantifies the distance of the neighborhood of the (potential) outliers. ▶ Time complexity still O(p2) but may be reduced for low dimensional data (to O(p log p)) using special data structures. ▶ Still need to determine the parameter k. 435 Clustering-Based Methods 436 Clustering Approach Clustering is supposed to group similar objects. Anomaly detection detects objects dissimilar to others. So, clustering inherently solves similar problems. But how do we detect anomalies based on clustering? We may detect anomalies as elements of small clusters. This approach is problematic as it strongly depends on the size/number of clusters. A better approach would be to asses how strongly objects belong to clusters. Definition 4 An object is a cluster-based outlier if the object does not strongly belong to any cluster. Depending on the particular clustering algorithm, we have (at least some) information about the cohesion of objects. 437 Anomaly Detection Using k-Means Clustering In k-means, the clusters are represented by centroids. We may measure the anomaly of a given object using the distance to its cluster centroid. This approach does not take into account the density of clusters. 438 Anomaly Detection Using k-Means Clustering Here, we compute the relative distance to the cluster center, that is, the ratio of the object’s distance to the centroid and the median distance of all objects of the same cluster to the centroid. 439 Clustering Based Anomaly Detection Other algorithms may provide different information (probability density of the clusters, etc.). The usual problem: Outliers have an impact on the clustering itself. Some algorithms can be modified to treat potential outliers specially. For example, during the k-means clustering, put objects very far away from the current cluster centroids into a special category not used to move the centroids. After every step, test whether some of these objects became close enough to at least one centroid (in which case these objects will be removed from the special category). Setting the proper number of clusters is an issue as well. For anomaly detection, a larger number of smaller clusters may be beneficial as they may be more cohesive. If an object seems to be an outlier with small clusters, it is probably an outlier. 440 Comments on Clustering Based Methods Some clustering methods have a sub-quadratic complexity. Conceptually, clustering complements anomaly detection, so it is natural to compute both together. On the other hand, the results strongly depend on the number of clusters, and anomalies may distort the clustering. 441 Autoencoders as Anomaly Detectors ... just a short comment 442 Neural Networks in Anomaly Detection The previous approaches work well with (relatively) low-dimensional data. However, what should we do with data with high or even variable dimensions? ▶ Images ▶ Text ▶ Video ▶ Semantic graphs ▶ ... One way is to transform them into lower-dimensional data. Train a neural network that takes the large input and returns a smaller representation, which (hopefully) preserves crucial features of the input. 443 Autoencoders An autoencoder consists of two parts: ▶ ϕ : Rn → Rm the encoder ▶ ψ : Rm → Rn the decoder The goal is to find ϕ, ψ so that ψ ◦ ϕ is (almost) identity. The value ⃗h = ϕ(⃗x) is called the latent representation of ⃗x. Training: Assume T = {⃗x1, . . . , ⃗xp} where ⃗xi ∈ Rn for all i ∈ {1, . . . , n}. Minimize the reconstruction error E = p i=1 (⃗xi − ψ(ϕ(⃗xi )))2 444 Autoencoders – neural networks Both ϕ and ψ can be represented using MLP Mϕ and Mψ, respectively. Mϕ and Mψ can be connected into a single network. 445 Autoencoders – Usage ▶ Compression/feature extraction – from ⃗x to ⃗h. ▶ Dimensionality reduction – the latent representation ⃗h has a smaller dimension. ▶ Generative versions – (roughly) generate ⃗h from a known distribution, let Mψ generate realistic inputs/outputs ⃗x ▶ Anomaly detection (see the next slide) 446 Anomaly Detection with Autoencoders Straightforward approach: ▶ Train the autoencoder on the normal data. ▶ Detect an anomaly using the large reconstruction error. The idea is that an anomaly will not be properly reconstructed. More general approach: ▶ Train the autoencoder and transform the normal data to their low dimensional latent representations. ▶ Use one of the previous approaches to anomaly detection to detect anomalies in the latent representations. The assumption is that the latent representation of an anomaly will substantially differ from the latent representations of normal instances. 447 Summary of Anomaly Detection ▶ It is hard even to define an anomaly - context knowledge is usually needed. ▶ Supervised anomaly detection reduces to classification. ▶ Unsupervised anomaly detection can be done in various ways; we have seen the following: ▶ Statistical ▶ Proximity-based ▶ Density-based ▶ Cluster-based ▶ Autoencoders ▶ Semi-supervised anomaly detection is usually concerned with data where the normal class is known. ▶ There are many more methods: One class SVM, isolation forests, etc. 448 Ensemble Methods 449 Voting Classifiers Train several models. They may differ in ▶ Structure (completely different models) ▶ Training data (subsample the training set) 450 Voting Classifiers During the inference, ensemble the predictions. For binary classifiers, you may do the following: ▶ Take the majority vote. ▶ Summarize the output probabilities (e.g., by averaging) ▶ If logistic regression or neural networks with logistic output activation are used, summarize the outputs before the application of the last logistic sigmoid. 451 The Ensemble Idea An ensemble of weak classifiers may be a strong classifier. Analogy: Assume a coin toss that has a 51 percent chance of heads and 49 percent tails. Try 1,000 tosses in a row. Do this repeatedly. In 75 percent of cases, you get more heads than tails. With 10,000 tosses, it would be 97 percent. Intuitively, we might suspect that a voting ensemble of 1,000 sufficiently independent classifiers, each with an accuracy of around 51 percent, would have an accuracy of around 70 percent. The ensemble methods work best when the models are as independent as possible. Use different learning methods or independently chosen training sets (see later slides). Let us try to formalize the above intuition. 452 Bias/Variance Decomposition Consider a random function g(x) + ε where g : R → R and ε is random noise with mean 0 and variance σ2. Consider a dataset for regression: D = {(x1, d1), . . . , (xp, dp)} Here x1, . . . , xp are fixed inputs, and each dk = g(xk) + εk where ε1, . . . , εp are independent samples from the noise ε. We get different models for different samples of the training set D. We may ask: ▶ How much do the models trained on different samples differ? ▶ How much do they differ (on average) from g? Let us denote by hD : R → R a model trained on the dataset D by minimizing the squared error p k=1 (hD(xk) − dk)2 . 453 Bias/Variance Decomposition Let us consider the expected value of the squared error: E p k=1 (hD(xk) − dk)2 = p k=1 E (hD(xk) − dk)2 The expectation is computed over the distribution of the datasets D. One can prove that p k=1 E (hD(xk) − dk)2 = p k=1 E (hD(xk) − E[hD(xk)])2 + (E[hD(xk)] − g(xk))2 + σ2 Here ▶ E (hD(xk) − E[hD(xk)])2 is the Variance of the model on xk how much the model’s value jumps around its average on xk . ▶ E[hD(xk)] − g(xk) is the Bias of the model. how much the average model’s value differs from the true values. ▶ σ2 is the noise variance. 454 455 B/V Decomposition Proof (Optional) Let us study E (hD(xk) − dk)2 . To simplify notation we drop the index k and write ED (hD(x) − d)2 . (hD(x) − d)2 = (hD(x) − E[hD(x)] + E[hD(x)] − d)2 = (hD(x) − E[hD(x)])2 + (E[hD(x)] − d)2 + 2 (hD(x) − E[hD(x)]) (E[hD(x)] − d) Now, just apply E to both sides. As E(d) = E(g(x) + ε) = g(x) E(d2 ) = E(g(x)+ε)2 = E(g(x)2 +2εg(x)+ε2 ) = g(x)2 +σ2 We obtain ... 456 B/V Decomposition Proof (Optional) ... this E (E[hD(x)] − d) 2 = E E[hD(x)]2 − 2dE[hD(x)] + d2 = E[hD(x)]2 − 2E(d)E[hD(x)] + Ed2 = E[hD(x)]2 − 2g(x)E[hD(x)] + g(x)2 + σ2 = E (E[hD(x)] − g(x)) 2 + σ2 and this E [2 (hD(x) − E[hD(x)]) (E[hD(x)] − d)] = E [2 (hD(x) − E[hD(x)]) (E[hD(x)] − g(x) − ε)] = E [2 (hD(x) − E[hD(x)]) (E[hD(x)] − g(x)) − 2 (hD(x) − E[hD(x)]) ε] = E [2 (hD(x) − E[hD(x)]) (E[hD(x)] − g(x))] = (E[hD(x)] − g(x)) E [2 (hD(x) − E[hD(x)])] = 0 Here, the third equality follows from the zero mean of ε. 457 Ensemble Methods vs Bias/Variance Suppose that we could somehow sample m independent datasets: D1, . . . , Dm. We train m models hD1 , . . . , hDm and consider the ensemble model hens(x) = 1 m m ℓ=1 hDℓ (x). Does the ensemble hens have a different bias-variance decomp.? ▶ The noise variance σ2 does not change. ▶ The Bias does not change (E[hDℓ (xk)] is independent of ℓ): E 1 m m ℓ=1 hDℓ (xk) − g(xk) = E[hDℓ (xk)] − g(xk) ▶ Only the Variance changes, it is reduced: E (hens(xk) − E[hens(xk)])2 = 1 m E(hDℓ (xk)−E[hDℓ (xk)])2 The equality follows from basic facts about the variance of sums of independent variables. Now, having m independently sampled datasets is a real luxury! 458 Bagging In practice, we use sampled subsets of a given dataset. One example of using different subsets of the training set is bagging (Bootstrap Aggregating). ▶ Bootstrap sampling = sampling data subsets with replacement ▶ Aggregating = majority voting (classification) or averaging (regression) 459 Bagging Decision Trees Left: A single decision tree (overfit) Right: A bagging ensemble of 500 trees 460 Summary of Bagging ▶ Reduces overfitting (variance) ▶ Can work with any type of classifier ▶ Easy to parallelize Just train each model independently, inference can also be parallelized. ▶ Loses (some) interpretability even for interpretable models (how could you read 500 decision trees?) 461 Random Forest Random forest = bagging ensemble of decision trees. Has all hyperparameters of decision trees + the number of trees The random forest algorithm of scikit-learn introduces the following randomness: When searching for a split attribute, look only into a randomly sampled subset of attributes (by default √ n of n attributes). This gives a greater diversity of the forests. The notion of the feature importance can be easily generalized to random forests (computed over all trees). 462 Boosting An ensemble method in which weak learners are trained sequentially. Each new learner tries to correct the error of the current ensemble. There are several variants of boosting. We just have a look at the basic idea behind AdaBoost (adaptive boosting). To implement this kind of learning algorithm, we need to be able to train models on weighted datasets. 463 Weighted Training We are going to use weights on samples. Note that these are different from the weights used in neural-like models! ▶ Linear regression: Assume a given dataset D = {(⃗x1, f1), . . . , (⃗xp, fp)} and weights a1, . . . , ap associated to examples from D. Let ⃗w be a vector of model weights. We minimize the weighted mean squared error E(⃗w) = 1 p p i=1 ai (⃗w · xi − fi )2 ▶ Decision trees: Given dataset D with p samples, we have weights a1, . . . , ap ∈ R (one weight for each example in D). Given a class c ∈ C, we compute the class probability pc as pc = {ai | i-th sample belongs to c} / p i=1 ai Everything else (Gini impurity, etc.) is the same. 464 AdaBoost Idea ▶ Train classifiers using weighted samples in the training dataset. ▶ Start with uniform weights, that is, each sample in the dataset has the same weight. ▶ In k-th iteration: ▶ train a new classifier hk on weighted samples, ▶ obtain a coefficient αk > 0 of the classifier hk , ▶ Consider the current ensemble classifier hk ens(x) = sign k ℓ=1 αℓhℓ(x) Shift the weights so that the “most wrong” training instances for the current ensemble have the largest weights. Consider a training example (x, c) with c ∈ {−1, 1}. Then hk ens misclassifies x iff c · k ℓ=1 αℓhℓ(x) < 0. The more negative this number is, the more wrong the ensemble classifier is. After K iterations, the ensemble classifier hK ens is the output. For details, see more advanced courses in Machine Learning. 465 Summary of Ensemble Methods ▶ Ensemble methods may improve “independently” trained models by grouping them and letting them decide collectively. ▶ Technically, the ensembling decreases the model variance. ▶ There are several approaches to ensembling: ▶ Bagging - training of several models on bootstrap subsamples of the training dataset (random forest is an example) ▶ Boosting - ensemble is built sequentially, the newly added model possibly solves the hardest instances ▶ There are many more algorithms (gradient boosting, etc.) 466 THE END 467