IB031 ´Uvod do strojov´eho uˇcen´ı
Tom´aˇs Br´azdil
1
Course Info
Resources:
▶ Lectures & tutorials (the main source)
▶ Many books, few perfect for introductory level
One relatively good, especially the first part:
A. G´eron. Hands-On Machine Learning with Scikit-Learn, Keras, and
TensorFlow: Concepts, Tools, and Techniques to Build Intelligent
Systems. O’Reilly Media; 3rd edition, 2022
▶ (Almost) infinitely many online courses, tutorials, materials,
etc.
2
Evaluation
The evaluation is composed of three parts:
▶ Mid-term exam: Written exam from the material of the first
half of the semester.
▶ End-term exam: The ”big” one containing everything from
the semester (with possibly more stress in the second half).
▶ Projects: During tutorials, you will work on larger projects (in
pairs).
Each part contributes the following number of points:
▶ Mid-term exam: 25
▶ End-term exam: 50
▶ Project: 25
To pass, you need to obtain at least 60 points.
3
Distinguishing Properties of the Course
▶ Introductory, prerequisites are held to a minimum
▶ Formal and precise: Be prepared for a complete and
“mathematical” description of presented methods.
I assume that you have basic knowledge of
▶ Elementary understanding of mathematical notation
(operations on sets, logic, etc.)
▶ Linear algebra: Vectors in Rn, operations on vectors
(including the dot product). Geometric interpretation!
▶ Calculus: Functions of multiple real variables, partial
derivatives, basic differential calculus.
▶ Probability: Notion of probability distribution, random
variables/vectors, expectation.
4
What Is Machine Learning?
Machine learning is the science (and art) of programming
computers so they can learn from data.
Here is a slightly more general definition:
Arthur Samuel, 1959
Machine learning is the field of study that allows computers to
learn without being explicitly programmed.
And a more engineering-oriented one:
Tom Mitchell, 1997
A computer program is said to learn from experience E concerning
some task T and some performance measure P if its performance
on T, as measured by P, improves with experience E.
5
Example
In the context of spam filtering:
▶ The task T is to flag spam in new emails.
▶ The experience E is represented by a set of emails labeled
either spam or ham by hand (the training data).
▶ The performance measure P could be the accuracy, which is
the ratio of the number of correctly classified emails and all
emails.
There are many more performance measures; we will study the basic ones
later.
In the context of housing price prediction:
▶ The task T is to predict prices of new houses based on their
basic parameters (size, number of bathrooms, etc.)
▶ The experience E is represented by information about existing
houses.
▶ The performance measure P could be, e.g., an absolute
difference between the predicted and real price.
6
Examples (cont.)
In the context of game playing:
▶ The task T is to play chess.
▶ The experience E is represented by a series of self-plays where
the computer plays against itself.
▶ The performance measure P is winning/losing the game.
Here, the trick is to spread the delayed and limited feedback about the
result of the game throughout the individual decisions in the game.
In the context of customer behavior:
▶ The task T is to group customers with similar shopping habits
in an e-shop.
▶ The experience E consists of lists of items individual
customers bought in the shop.
▶ The performance measure P?
Measure how ”nicely” the customers are grouped.
(whether people with similar habits, as seen by humans, fall
into the same group).
7
Comparison of Programming and Learning
How to code the spam filter?
▶ Examine what spam mails typically contain: Specific words
(”Viagra”), sender’s address, etc.
▶ Write down a rule-based system that detects specific features.
▶ Test the program on new emails and (most probably) go back
to look for more spam features.
8
Comparison of Programming and Learning
The machine learning way:
▶ Study the problem and collect lots of emails, labeling them
spam or ham.
▶ Train a machine learning model that reads an email and
decides whether it’s spam or ham.
▶ Test the model and (most probably) go back to collect more
data and adjust the model.
9
ML Solutions are Adaptive
Spam filter: Authors of spam might and will adapt to your spam
filter (possibly change the wording to pass through).
ML systems can be adjusted to new situations by retraining on
new data (unless the data becomes ugly).
10
ML for Human Understanding
Spam filter: A trained system can be inspected for notorious spam
features.
Some models allow direct inspection, such as decision trees or linear/logistic
regression models.
11
Usage of Machine Learning
Machine learning suits various applications, especially where
traditional methods fall short. Here are some areas where it excels:
▶ Solving complex problems where fine-tuning and rule-based
solutions are inadequate.
▶ Tackling complex issues that resist traditional problem-solving
approaches.
▶ Adapting to fluctuating environments through retraining on
new data.
▶ Gaining insights from large and complex datasets.
In summary, machine learning offers innovative solutions and
adaptability for today’s complex and ever-changing problems,
(sometimes) providing insights beyond the reach of traditional
approaches.
12
Types of Learning
There are main categories based on information available during
the training:
▶ Supervised learning
▶ Unsupervised learning
▶ Semi-supervised learning
▶ Self-supervised learning
▶ Reinforcement learning
13
Supervised Learning
Labels are available for all input data.
Typical supervised learning tasks are
▶ Classification where the aim is to classify inputs into (typically
few) classes
(e.g., the spam filter where the classes are spam/ham)
▶ Regression where a numerical value is output for a given input
(e.g., housing prices)
14
Unsupervised Learning
No labels are available for input data.
Typical unsupervised learning tasks are
▶ Clustering where inputs are grouped according to their
features
(e.g., clients of a bank grouped according to their age, wealth, etc.)
▶ Association where interesting relations and rules are
discovered among the features of inputs
(e.g., market basket mining where associations between various types of
goods are being learned from the behavior of customers)
▶ Dimensionality reduction reduce high-dimensional data to few
dimensions (e.g., images to few image features)
15
Semi-Supervised Learning
Labels for some data.
For example, Medical data, where elaborate diagnosis is available
only for some patients.
Combines supervised and unsupervised learning: e.g., clusters all
data and labels the unlabeled inputs with the most common labels
in their clusters.
16
Self-Supervised Learning
Generate labels from (unlabeled) inputs.
The goal is to learn typical features of the data.
It can be later modified to generate images, classify, etc.
17
Reinforcement Learning
Learn from performing actions and getting feedback from environment.
18
ML Applications Highlights
▶ ChatGPT (and similar generative models)
▶ The basis forms a generative language model, i.e., a
text-generating model trained on texts in a self-supervised way
▶ Currently extended to multimodal versions (text, image, sound)
▶ Machine translation, image captioning
▶ Google translate, etc.
▶ Typically (semi)-supervised learning,
▶ Various image recognition and processing tasks
▶ In medicine where it is slowly making its way into hospitals as
assistance tools
▶ Automotive, advertising, quality control etc., etc., etc.
▶ Science
▶ Chemistry & biology: E.g., prediction of features of chemical
compounds (Alpha-fold)
▶ Various ”table” data processing in finance, management, etc.
▶ Often straightforward methods (linear/logistic regression)
▶ Essential but not fancy
▶ Game playing: More fancy than useful, learning models
beating humans in several difficult games.
19
ML in Context
20
Supervised Learning
21
Example - Fruit Recognition
The goal: Create an automatic
system for fruit recognition,
concretely apple, lemon, and
mandarin.
Inputs: Measures of height and
width of each fruit.
Suppose we have a dataset of
dimensions of several fruits labeled
with the correct class.
22
Data
Use similarity to solve the problem.
23
KNN Classification
Given a new fruit.
What is it?
Find five closest
examples
Where is the machine learning?
24
KNN Classification
Given a new fruit.
What is it?
Find five closest
examples
Among the five closest:
▶ M = 4 mandarins
▶ A = 1 apples
▶ L = 0 lemons
Where is the machine learning?
24
KNN Classification
Given a new fruit.
What is it?
Find five closest
examples
Among the five closest:
▶ M = 4 mandarins
▶ A = 1 apples
▶ L = 0 lemons
It is a mandarin!
Where is the machine learning?
24
Learning in Fruit Classification with KNN
Learning:
Inference:
25
Fruit Classification Algorithm
Input: A fruit F with dimensions height, width
Output: mandarin, lemon, apple
1: Find K examples {E1, . . . , EK } in the dataset whose
dimensions are closest to the dimensions of the fruit F
2: Count the number of examples of each class in {E1, . . . , EK }
M mandarins in {E1, . . . , EK }
L lemons in {E1, . . . , EK }
A apples in {E1, . . . , EK }
3: if M ≥ L and M ≥ A then return mandarin
4: else if L ≥ A then return lemon
5: else return apple
6: end if
Does it work?
26
Testing the Model for Fruit Classification
Consider a test set of new instances (K = 5, d is Euclidean):
Perfect classification of new data! Just deploy and sell!!
27
K Nearest Neighbors
. . . on ideal data
28
Learning and Inference
Two crucial components of machine learning are the following:
Learning:
Creating model
Inference:
Using model
29
Training Data
Assume table training data, i.e., of the form
x11 x12 · · · x1n c1
x21 x22 · · · x2n c2
...
...
...
...
...
xp1 xp2 · · · xpn cp
Formally, we define training dataset
T = {(⃗xk, ck) | k = 1, . . . , p}
Here each ⃗xk ∈ Rn is an input vector and
ck ∈ C is the correct class.
T = {(4.0, 6.5), M),
(4.47, 7.13), M),
(6.49, 7.0), A),
. . .}
30
KNN: Learning
Consider the training set:
T = {(⃗xk, ck) | k = 1, . . . , p}
and memorize it exactly as it is.
Store in a table.
Possibly use a clever representation allowing fast computation of nearest
neighbors such as KDTrees (out of the scope of this lecture).
Also,
▶ determine the number of neighbors K ∈ N,
▶ and the distance measure d.
31
Inference in KNN
Assume a KNN ”trained” by memorizing
T = {(⃗xk, ck) ∈ Rn × C | k = 1, . . . , p}, a constant K ∈ N and
a distance measure d.
For d, consider Euclidean distance, but different norms may also be used to
define different distance measures.
Input: A vector ⃗z = (z1, . . . , zn) ∈ Rn
Output: A class from C
1: Find K indices of examples X = {i1, . . . , iK } ⊆ {1, . . . , p} with
minimum distance to ⃗z, i.e., satisfying
max d(⃗z, ⃗xℓ) | ℓ ∈ X ≤ min d(⃗z, ⃗xℓ) | ℓ ∈ {1, . . . , p}∖X
2: For every c ∈ C count the number #c of elements ℓ in X such
that cℓ = c
3: Return some
cmax ∈ arg max
c∈C
#c
A class cmax ∈ C which maximizes #c.
32
The resulting model
What exactly constitutes the model? The model consists of
▶ The trained parameters: In this case the memorized training
data.
▶ The hyperparameters set “from the outside”: In this case, the
number of neighbors K and the distance measure d.
Note that different settings of K lead to different classifiers (for
the same d):
33
In Practice
. . . to get an efficient solution:
▶ Deal with issues in the data
▶ Data almost always comes in weird formats, with
inconsistencies, missing values, wrong values, etc.
▶ Data rarely have the ideal form for a given learning model.
We need to ingest, validate, and preprocess the data.
▶ Deal with issues in the model
▶ In KNN, the training memorizes the example, but at least the
K can be tuned.
We need to tune the model.
▶ Deal with the wrong model by testing and validation in as
realistic conditions as possible.
▶ Deal with deployment - real-world application issues involving,
e.g., implementation in embedded devices with limited
resources.
34
Models Considered in This Course
Throughout this course, we will meet the following models:
▶ KNN (already did)
▶ Decision trees
▶ (Naive) Bayes classifier
▶ Clustering: K-means and hierarchical
▶ Linear and logistic regression
▶ Support Vector Machines (SVM)
▶ Kernel linear models
▶ Neural networks (light intro to feed-forward networks)
▶ Ensemble methods + random forests
▶ (maybe some reinforcement learning)
. . . but first, let us see the whole machine learning pipeline.
35
Machine Learning Pipeline
36
Fetch Data
Always start with
▶ The problem formulation & understanding.
For example, diagnosis of diabetes from medical records. What info is
(possibly) sufficient for such a diagnosis?
▶ Find data sources.
In our example, the sources are hospitals. It would be best to persuade
them to give you the data and sign a contract.
▶ Collect the data.
In our example, the data is possibly small (just tables with results of
tests). But for other diagnoses, you may include huge amounts of data
from MRI, CT, etc. Then, the collection itself might be a serious
technical problem.
▶ Integrate data from various sources.
A serious diagnostic system must be trained/tested on data from many
hospitals. You must blend the data from various sources (different
formats, etc.).
37
Fetch Data
For simple “toy” machine learning projects, you may fetch
prepared datasets from various databases on the internet.
The data should be stored in an identified location and versioned.
You will probably keep adding data and training models on the ever-changing
datasets. You have to be able to keep track of the changes and map training
data to particular models.
Tools such as ML Flow or Weights & Biases might be helpful.
Data Separation
At this point, you should randomize the ordering of the data and
select a test set to be used in model evaluation!
The test data are supposed to simulate the actual conditions, i.e., they should
be “unseen”.
Data Exploration
Compute basic statistics to identify missing values, outliers, etc.
38
Clean Data
The cleaning usually comprises the following steps:
▶ Fix or remove incorrect or corrupted values.
▶ Identify outliers and decide what to do with them.
Outliers may harm some training methods and are not “representative”.
However, sometimes, they naturally belong to the dataset, and expert
insight is needed.
▶ Fix formatting.
For example, the Date may be expressed in many ways, and a simple
Yes/No answer.
▶ Resolve missing values (by either removing the whole
examples or imputing)
Many methods have been developed for missing values imputation. It is a
susceptible issue because new values may strongly bias the model.
▶ Remove duplicates.
The above steps often affect the training and need expertise in the application
domain.
Later in this course, we will discuss techniques for data cleaning.
39
40
Prepare Data
Unlike cleaning, which is application-dependent, data
preparation/transformation is model-dependent. This usually subsumes:
▶ Scaling: Settings values of inputs to a similar range.
Some models, especially those utilizing distance, are sensitive to large
differences between input sizes.
▶ Encoding: Encode non-numeric data using real-valued vectors.
Many models, especially those based on geometry, work only with
numeric data. Non-numeric data such as Yes/No, Short/Medium/Long
must be encoded appropriately.
▶ Binning or Discretization Convert continuous features into
discrete bins to capture patterns in ranges.
Comment: Sometimes Normalization, that is changing the distribution of
inputs to resemble the normal distribution, is mentioned. However, this step is
typically not essential for machine learning itself. However, it is important to
use statistical inference to test the significance of learned parameters.
41
Prepare Data
▶ Feature selection Throw out input features that are too “similar”
to other features.
For example, if the temperature is measured both in Celsius and in
Kelvin, keep one of them. The relationship can, of course, be a more
complex (non-linear) correlation.
▶ Dimensionality reduction Transforming data from Rn
to Rm
where m << n.
Growing dimension means growing difficulty of training for all models.
Some models cease to work for high-dimensional data. The reduction
typically searches for a few important characteristic features of inputs.
▶ Feature aggregation Introducing new features using operations on
the original ones.
We will see kernel transformations later in this course, allowing simple
models to solve complex problems.
42
Train Model
Now the dataset has been cleaned; we may train a model.
Before training, we should split the dataset into
▶ training dataset on which the model will learn
▶ validation dataset on which we fine-tune hyperparameters
The resulting model is obtained after several iterations of the
above process.
43
Evaluate Model
Here, we use the test set that we separated during data fetching.
In some cases, a brand new test set can be generated.
patients are examined regularly, creating new records continuously.
In some cases, it is tough to obtain new data.
For example, new expensive and difficult measurements are needed to obtain
new data.
Critical issue: Make sure that you are truly testing
exactly the whole inference process.
Often, just a model is tested, and the testing and production inference engines
are separated. This leads to truly nasty errors in the production!
We will discuss various generic metrics helpful in measuring the
quality of the resulting model.
44
Deploy to Production
Deployment of machine learning models is a complex question,
application dependent.
The recently emerging area of MLOps is concerned with the
engineering side of the model deployment.
From the technical point of view, the typical issues solved by ML
Ops teams are
▶ how to extract/process data in real-time
▶ how much storage is required
▶ how to store/collect model (and data) artifacts/predictions
▶ how to set up APIs, tools, and software environments
▶ What the period of predictions (instantaneous or batch
predictions) should be
▶ how to set up hardware requirements (or cloud requirements
for on-cloud environments) by the computational resources
required
▶ how to set up a pipeline for continuous training and parameter
tuning 45
Deploy to Production
From the user’s point of view:
▶ How to get a sensible and valuable user output?
▶ AI researchers will be satisfied with tons of running text in
terminals.
▶ “Normal” people need a graphical interface with
understandable output.
▶ Experts working in other domains typically demand speed and
clarity at the extreme.
▶ How do you persuade users that the AI is working for them?
▶ Especially if safety is at stake, you need to have outstanding
arguments and explanations ready for end-users
▶ In many areas, the devices need to be certified (medicine,
automotive) for ML-based systems.
This complex subject will be only touched on in this course.
46
Monitor, collect Data
Deployed machine learning models must be constantly monitored.
Because of the influx of new data, ML models work in highly
dynamic environments.
For example, an image-processing medical diagnostic model suddenly
misdiagnosed a patient because a nurse marked the sample with a marker pen.
Every customer has a different infrastructure and may produce
data slightly differently.
Data for retraining and improvement should be stored.
Also, many areas allow the active learning where users provide
feedback for (continuous) retraining of the models.
47
Data
48
Data Science Example
You receive data from a medical researcher concerning a project
that you are eager to work on.
The data consists of a 1000 lines table with five columns:
012 232 33.5 0 10.7
020 121 16.9 2 210.1
027 165 24.0 0 427.6
· · ·
The aim is to predict the last field given the others.
The medical researcher does not elaborate further on the data, but
they seem to be pretty easy to work with, right?
After a few days, you have trained a model that predicts numbers
resembling the ones in the table.
You contact the medical researcher and discuss the results.
49
Model Discussion
Researcher: So, you got the data for all the patients?
Data Miner: Yes. I haven’t had much time for analysis, but I do
have a few interesting results.
Researcher: Amazing. There were so many data issues with this
set of patients that I couldn’t do much.
Data Miner: Oh? I didn’t hear about any possible problems.
Researcher: Well, first, there is field 5, the variable we want to
predict. It’s common knowledge among people who analyze this
type of data that results are better if you work with the log of the
values, but I didn’t discover this until later. Was it mentioned to
you?
Data Miner: No.
50
Model Dicsuccion
Researcher: But surely you heard about what happened to field
4? It’s supposed to be measured on a scale from 1 to 10, with 0
indicating a missing value, but because of a data entry error, all
10’s were changed into 0’s. Unfortunately, since some of the
patients have missing values for this field, it’s impossible to say
whether a 0 in this field is a real 0 or a 10. Quite a few of the
records have that problem.
Data Miner: Interesting. Were there any other problems?
Researcher: Yes, fields 2 and 3 are basically the same, but I
assume that you probably noticed that.
Data Miner: Yes, but these fields were only weak predictors of
field 5.
51
Model Discussion
Researcher: Anyway, given all those problems, I’m surprised you
were able to accomplish anything.
Data Miner: True, but my results are really quite good. Field 1 is
a very strong predictor of field 5. I’m surprised that this wasn’t
noticed before.
Researcher: What? Field 1 is just an identification number.
Data Miner: Nonetheless, my results speak for themselves.
Researcher: Oh, no! I just remembered. We assigned ID numbers
after we sorted the records based on field 5. There is a strong
connection, but it isn’t very sensible. Sorry.
OK, what’s the point?
You have to
Understand the task you want to solve and the data!
52
Data Objects
Data objects represent entities we work with (e.g., classify them).
For example, in cancer prediction, the data objects are patients. In
fruit classification, the data objects are individual fruits.
Data objects are described by attributes (or features or variables).
For example, the age, weight, genetic profile, and other patient
characteristics. Or the width and height of a fruit.
53
Attributes vs Features vs Variables
The name differs from field to field.
So, the following names are usually used as synonyms:
▶ Attributes - used mostly by database and data mining experts.
▶ Features - used mostly by machine learning experts.
▶ Variables - used mostly by statisticians.
One may make some distinctions
▶ Attributes represent information about the object without any
additional assumptions.
▶ Features assume that their values are somewhat characteristic
of the object.
▶ Variables assume that there is some process behind them
(typically a random process in the case of statistics).
54
Data Types - Categorical Attributes
Categorical attributes (nominal attributes) are symbols or names of
things.
▶ Each value represents some kind of category, code, or state.
▶ Values are not ordered and should not be used quantitatively
(in computer science, the values are known as enumerations).
▶ Examples:
hair color ∈ {black, brown, blond, red, auburn, gray, white}
marital status ∈ {single, married, divorced, widowed}
customer ID ∈ {0, 1, 2, . . .}
Even though the last one is usually expressed using numbers,
it should not be used quantitatively.
Binary attributes are categorical attributes with only two values.
55
DataTypes - Ordinal Attributes
Ordinal attribute is an attribute with values that have a meaningful
order or ranking among them.
Examples:
drink size ∈ {small, medium, large}
grades ∈ {A, B, C, D, E, F}
It can also be obtained by discretizing numeric quantities into
series of intervals.
Ordinal attributes do not allow arithmetic operations.
Categorical and ordinal attributes are called qualitative attributes.
Next, we look at numeric, i.e., quantitative attributes.
56
Data Types - Numeric Attributes
Numeric attributes are quantities represented by numbers.
Distinguish two types: Interval-scale and ratio-scale.
INTERVAL SCALE RATIO SCALE
Measurement
interval
Equal intervals between
consecutive points.
Equal intervals with
the presence of a true zero.
Absolute
zero
Lacks a true zero point.
Possesses a true
zero point.
Statistical
analysis
Limited to addition
and subtraction
Allows for meaningful
multiplication and division.
Meaningful
ratios
Ratios are not meaningful
due to the lack of zero.
Ratios are meaningful
due to the presence of zero.
Examples
IQ scores,
Celsius temperature,
NPS data, etc.
Height, weight,
income, etc.
57
Discrete vs Continuous Attributes
Often, two kinds of numeric attributes are distinguished:
▶ Discrete
A finite or countably infinite range of values, i.e., integers may
represent the values.
Some (but not all) authors count the qualitative (categorical, ordinal)
attributes among the discrete attributes.
▶ Continuous
An uncountably infinite range of values, typically an interval.
There are several more or less formal definitions of continuous attributes
in the literature. For example:
▶ All non-discrete variables.
▶ Have an infinite number of values between any two values.
▶ Their values are measured (??).
Deeper characteristics of data (statistical properties, etc.) will be
examined at tutorials.
58
Decision Trees
59
Decision Trees
▶ One of the widely used methods for machine learning.
▶ Intuitively simple, directly explainable.
▶ Basis for random forests (a powerful model).
▶ We will consider the ID3 algorithm.
Quinlan, 1979
▶ Various adjustments that appear in C4.5, CART, etc.
60
Consider the weather forecast for tennis playing. How would you
decide whether to play today?
How do we obtain such a tree based on experience/data?
61
Learning Decision Trees
Consider data represented as follows:
▶ A finite set of attributes A = {A1, . . . , An}.
▶ Each attribute A ∈ A has its set of values V (A).
We start with trees on discrete datasets, that is, assume V (A)
finite for all A ∈ A.
Objects to be classified are described by vectors of values of all
attributes:
⃗x = (x1, . . . , xn) ∈ V (A1) × · · · × V (An)
Given ⃗x and an attribute Ak we denote by Ak(⃗x) the value xk of
the attribute Ak in ⃗x.
Consider a set C of classes.
We consider a multiclass classification in general, i.e., C is an arbitrary finite
set.
62
Example
The tennis problem:
▶ The attributes are:
A1 = Outlook, A2 = Temperature, A3 = Humidity, A4 = Wind
63
Example
The tennis problem:
▶ The attributes are:
A1 = Outlook, A2 = Temperature, A3 = Humidity, A4 = Wind
▶ The sets of values of the attributes:
▶ V (A1) = {Sunny, Overcast, Rain}
▶ V (A2) = {Hot, Mild, Cool}
▶ V (A3) = {High, Normal}
▶ V (A4) = {Strong, Weak}
63
Example
The tennis problem:
▶ The attributes are:
A1 = Outlook, A2 = Temperature, A3 = Humidity, A4 = Wind
▶ The sets of values of the attributes:
▶ V (A1) = {Sunny, Overcast, Rain}
▶ V (A2) = {Hot, Mild, Cool}
▶ V (A3) = {High, Normal}
▶ V (A4) = {Strong, Weak}
▶ Consider
⃗x = (Overcast,Hot, Normal, Weak)
∈ V (A1) × V (A2) × V (A3) × V (A4)
63
Example
The tennis problem:
▶ The attributes are:
A1 = Outlook, A2 = Temperature, A3 = Humidity, A4 = Wind
▶ The sets of values of the attributes:
▶ V (A1) = {Sunny, Overcast, Rain}
▶ V (A2) = {Hot, Mild, Cool}
▶ V (A3) = {High, Normal}
▶ V (A4) = {Strong, Weak}
▶ Consider
⃗x = (Overcast,Hot, Normal, Weak)
∈ V (A1) × V (A2) × V (A3) × V (A4)
Then
A3(⃗x) = Humidity(⃗x) = Normal
A4(⃗x) = Wind(⃗x) = Weak
63
Example
The tennis problem:
▶ The attributes are:
A1 = Outlook, A2 = Temperature, A3 = Humidity, A4 = Wind
▶ The sets of values of the attributes:
▶ V (A1) = {Sunny, Overcast, Rain}
▶ V (A2) = {Hot, Mild, Cool}
▶ V (A3) = {High, Normal}
▶ V (A4) = {Strong, Weak}
▶ Consider
⃗x = (Overcast,Hot, Normal, Weak)
∈ V (A1) × V (A2) × V (A3) × V (A4)
Then
A3(⃗x) = Humidity(⃗x) = Normal
A4(⃗x) = Wind(⃗x) = Weak
▶ C = {Yes, No}
63
Decision Trees
Consider (directed, rooted) trees T = (T, E) where T is a set of nodes
and E ⊆ T × T is a set of directed edges.
Denote by Tleaf ⊆ T the set of all leaves of the tree and by Tint the set
T ∖ Tleaf of internal nodes.
A decision tree is
▶ a tree T = (T, E) where
▶ each leaf τ ∈ Tleaf is assigned a class C(τ) ∈ C,
▶ each internal node τ ∈ Tint is assigned an attribute A(τ) ∈ A,
▶ and there is a bijection between edges from τ and values of the
attribute A(τ). Given an edge (τ, τ′
) ∈ E we write V (τ, τ′
) to
denote the value of the attribute A(τ) assigned to the edge.
Inference: Given an input ⃗x, we traverse the tree from the root to
a leaf, always choosing edges labeled with values of attributes from
⃗x. The output is the class labeling the leaf.
64
Example
T = {O, H, W , z1, z2, z3, z4, z5}
Tleaf = {z1, z2, z3, z4, z5}, Tint = {O, H, W }
E = (O, H), (O, W ), (H, z1), (H, z2),
(O, z3), (W , z4), (W , z5)
C(z1) = C(z3) = No, C(z2) = C(z4) = Yes
A(O) = Outlook, A(H) = Humidity, A(W ) = Wind
V (O, H) = Sunny, V (O, z3) = Overcast, V (O, W ) = Rain
V (H, z1) = High, V (H, z2) = Normal
V (W , z4) = Strong, V (W , z5) = Weak
Inference: For (Rain, Hot, High, Strong) we reach z4, yielding No.
65
Training Dataset
Consider a training dataset
D = {(⃗xk, ck) | k = 1, . . . , p}
Here ⃗xk ∈ V (A1) × · · · × V (Ak) and ck ∈ C for every k.
Technically D can be a multiset containing several occurrences of the same
vector.
66
Index Outlook Temperature Humidity Wind PlayTennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
D = ((Sunny, Hot, High, Weak), No),
((Sunny, Hot, High, Strong), No)
· · ·
((Rain, Mild, High, Strong), No)
67
Learning Decision Trees
The learning algorithm ID3 works as follows:
▶ Start with the whole training dataset D.
▶ If there is just a single class in D, create a single node
decision tree that returns the class.
▶ Otherwise, identify an attribute A ∈ A which best classifies
the examples in D. For every v ∈ V (A) we obtain
Dv = {⃗x | ⃗x ∈ D, A(⃗x) = v}
We aim to have each Dv as pure as possible, that is, ideally,
to contain examples of just a single class.
▶ Finally,
▶ create a root node τ of a decision tree,
▶ assign the attribute A to τ,
▶ for every v ∈ V (A), recursively construct a decision tree with
a root τv using Dv ,
▶ for every v ∈ V (A) introduce an edge (τ, τv ) assigned v.
68
1: function ID3(dataset D, attribute set A)
2: Create a root node τ for the tree
3: if D = ∅ then
4: Return the single node τ assigned with a default class.
5: else if all examples in D are of the same class c then
6: Return the single-node tree, where τ is assigned c
7: else if set of attributes A is empty then
8: Return the single-node tree where τ is assigned
the most common class in D
9: else
10: Choose attribute A ∈ A best classifying examples in D.
11: Set the decision attribute for τ to A
12: for each value v ∈ D(A) do
13: Compute a decision tree ID3(Dv , A ∖ {A}) with root τv ,
14: add a new edge (τ, τv ) assigned v.
15: end for
16: end if
17: return τ
18: end function
69
Best Classifying Attribute
We aim to choose an attribute that best informs us about the class.
As a result, we would possibly use as few attributes as possible and obtain a
small tree containing only class-relevant decisions.
How to choose an attribute that best classifies examples in D?
There are several measures used in practice.
The most common are
▶ information gain
▶ Gini impurity decrease
70
Information Gain
The information gain is based on the notion of entropy.
We need some notation:
▶ Given a training dataset D and a class c ∈ C we denote by pc
the proportion of examples with class c in D.
▶ We define the entropy of D by
Entropy(D) =
c∈C
−pc log2 pc
▶ The information gain of an attribute A is then defined by
Gain(D, A) = Entropy(D) −
v∈V (A)
|Dv |
|D|
Entropy(Dv )
In every step of the ID3 algorithm, we choose an attribute
maximizing the information gain for the current dataset D.
71
Information Gain
The intuition behind information gain:
▶ Consider C = {0, 1} and p the proportion of examples of
class 1. p measures the “uncertainty” of the class:
▶ v∈V (A)
|Dv |
|D| Entropy(Dv ) is weighted uncertainty of classes
in each Dv (weighted by the relative size of Dv ).
▶ Gain(D, A) measures reduction in uncertainty of classes by
splitting D according to A.
72
Gini Impurity
▶ We define Gini impurity of D by
Gini(D) = 1 −
c∈C
p2
c
▶ The impurity decrease of an attribute A is then defined
similarly to the gain in the entropy case
ImpDec(D, A) = Gini(D) −
v∈V (A)
|Dv |
|D|
Gini(Dv )
In every step of the ID3 algorithm, we choose an attribute
maximizing the impurity decrease for the current dataset D.
73
Gini Impurity
What is the intuition behind Gini(D) ?
Assume we randomly independently choose objects from D.
1 − c∈C p2
c is the probability of choosing two objects of different
classes in two consecutive independent trials.
Indeed, pc is the probability of choosing an object of class c, p2
c the probability
of choosing objects of the class c twice, and c∈C p2
c the probability of
choosing two objects of the same class.
In what follows (and at the exam), we will work only with the Gini
impurity as it is easier to compute.
74
Example
Consider our tennis example (see the table).
▶ Consider the whole dataset D.
▶ pYes = 9/14
▶ pNo = 5/14
▶ Gini(D) = 1 − (9/14)2
− (5/14)2
= 0.45918
▶ For A = Outlook we get
▶ Gini(DSunny ) = 1 − (2/5)2
− (3/5)2
= 0.48
▶ Gini(DOvercast) = 1 − 12
− 02
= 0
▶ Gini(DRain) = 1 − (3/5)2
− (2/5)2
= 0.48
Thus
ImpDec(D, Outlook) =
0.459 − (5/14) · 0.48 − (4/14) · 0 − (5/14) · 0.48
= 0.117
▶ ImpDec(D, Temperature) = 0.018
▶ ImpDec(D, Humidity) = 0.091
▶ ImpDec(D, Wind) = 0.030
So the largest information gain is given by the Outlook.
75
Example
Going further on, consider D = DSunny . We get
▶ ImpDec(D, Temperature) = 0.279
▶ ImpDec(D, Humidity) = 0.48
▶ ImpDec(D, Wind) = 0.013
The best choice attribude after Sunny in Outlook is Humidity.
Now consider D = DRain.
▶ ImpDec(D, Temperature) = 0.013
▶ ImpDec(D, Humidity) = 0.013
▶ ImpDec(D, Wind) = 0.48
The best choice attribude after Rain in Outlook is Wind.
76
Continuous-Valued Attributes
What if values of an attribute A come from a continuous variable?
A is a numerical attribute that can take any value in an interval, such as
temperature, size, time, etc.
Consider an internal node τ ∈ Tint assigned such a continuous
attribute A. Then
▶ τ is assigned a threshold value called a cut point H ∈ R,
▶ there are two edges etrue, efalse from τ,
▶ etrue labeled with True and efalse labeled with False.
During inference, when considering
an example ⃗x in the node τ,
▶ evaluate A(⃗x) ≤ H,
▶ if A(⃗x) ≤ H, then follow etrue,
▶ else follow efalse.
In training, the cut point is chosen from the attribute values in the training set
using information gain/impurity decrease similar to discrete attributes.
77
Iris Example
Attributes
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
Classes (Variety)
Setosa, Versicolor, Virginica
78
Iris Example
The dataset (150 examples):
Sepal.Length Sepal.Width Petal.Length Petal.Width Variety
5.5 3.5 1.3 0.2 Setosa
6.8 2.8 4.8 1.4 Versicolor
6.7 3.1 4.7 1.5 Versicolor
6.9 3.1 5.1 2.3 Virginica
7.3 2.9 6.3 1.8 Virginica
5.4 3.7 1.5 0.2 Setosa
4.6 3.4 1.4 0.3 Setosa
6.2 2.8 4.8 1.8 Virginica
5.4 3.0 4.5 1.5 Versicolor
4.7 3.2 1.6 0.2 Setosa
6.7 3.3 5.7 2.1 Virginica
5.0 3.4 1.5 0.2 Setosa
5.0 3.0 1.6 0.2 Setosa
4.4 2.9 1.4 0.2 Setosa
6.0 3.4 4.5 1.6 Versicolor
5.1 3.5 1.4 0.2 Setosa
6.6 3.0 4.4 1.4 Versicolor
5.9 3.2 4.8 1.8 Versicolor
5.6 2.8 4.9 2.0 Virginica
· · · 79
Iris Example
80
Iris Example - Decision Tree
81
Iris Example - Decision Tree Boudaries
If the leaves are split further, the Depth = 2 boundary would be
added.
82
Attribute Importance Computation
How important are attributes for the trained tree T ? Depends on
▶ how close they are to the root of T ,
▶ how large information gain/decrease in impurity they give.
There are several formulae for computing the importance.
One example is mean decrease impurity defined as follows:
Consider a decision tree T trained on a dataset D using the ID3.
For every node τ of T , denote by D[τ] the subset of D which was
used in the ID3 procedure when the node τ was created (line 2).
Consider an attribute A and denote by T[A] ⊆ Tint the set of all
nodes of T assigned the attribute A by ID3 (line 11).
Then define the importance as the average decrease in Gini
impurity (i.e., average ImpDec) in the nodes of T[A]:
GiniImportance(A) =
τ∈T[A]
|D[τ]|
|D|
ImpDec(D[τ], A)
83
Decision Trees
Practical Issues
84
Practical Issues
▶ Data preprocessing
▶ Model tunning (overfitting and underfitting)
▶ Sensitivity to changes in data/hyperparameters
▶ Learning representation problems (the XOR)
85
Data Preprocessing
Little preprocessing is needed for decision trees.
Of course, ideally, clean up the data, but
▶ Missing values are not such a big issue
(considering a concrete example, exclude the attributes with
missing values)
▶ Outliers are not such a big issue either; the division of nodes
is done based on relative counts, not concrete values.
▶ Decision trees can cope with continuous and categorical
values directly (i.e., no encoding necessary)
Imbalanced classes might cause problems because of small
information gain/impurity decrease in splitting.
86
Imbalanced Classes
Consider a dataset D where
▶ there are two classes, C = {0, 1},
▶ 106 examples have the class 1,
▶ 100 examples have the class 0.
Then
▶ p1 = 106/(106 + 100) ≈ 1 and p0 = 100/106 ≈ 0,
▶ thus the Gini impurity 1 − p2
1 − p2
0 ≈ 0.
Consider an attribute A with V (A) = {a, b}.
Splitting D according to A gives to sets Da and Db.
What is the impurity decrease caused by the attribute?
ImpDec(D, A) = Gini(D) −
|Da|
|D|
Gini(Da) −
|Db|
|D|
Gini(Db)
For small |Da| (say ≤ 1000) we have small |Da|/|D|
For not so small Da we have Gini(Da) ≈ 0.
In both cases, the impurity decrease is very small.
87
Model Tuning - Over/Under Fitting
The behavior of the model on the training set:
▶ The left one is strongly overfitting. It would possibly not work
well on new data.
▶ The right one is strongly underfitting. It would probably give
poor classification results.
▶ The middle one seems good (but still needs to be tested on
fresh data).
88
Model Tuning - Overfitting in Decision Trees
See the overfitting on the left and the “nice” model on the right.
Both overfitting and underfitting are best avoided. But how do we
find out?
89
Model Tuning (In General)
Recall from the first lecture:
The validation should be done on a validation set separated from
the training set.
We will discuss more sophisticated techniques later.
What hyperparameters to set? (see the next slide)
What to observe? In the case of decision trees, one should observe
the difference between performance measures (e.g., classification
accuracy) on the training and validation sets.
The too-large difference implies an improperly fitting model.
90
How to Fit Decision Trees?
There are several approaches available for decision trees.
Generally, the overfitting can be either prevented or resolved.
▶ Pre-pruning: Build the tree so it does not overfit by restricting
its size.
▶ Post-pruning: Overfit with a large tree and remove subtrees to
make it smaller.
▶ Ensemble methods: Fit several different trees and let them
classify together (e.g., using majority voting).
The post-pruning approach has been more successful in practice
than the pre-pruning because it is usually hard to say when to stop
growing the tree.
We shall meet this controversy also in deep learning, where recent history
shows a similar phenomenon.
The ensemble methods will be covered later when we discuss
random forests.
91
Pre-Prunning - Hyperparamaters
Hyperparameters controlling the size of the tree:
▶ Maximum depth - do not grow the tree beyond the max depth
The deeper the tree, the more complex models you can create ⇒
overfitting. Low depth may restrict expressivity.
▶ Minimum number of examples to split a node - if D[τ] is
small, τ becomes a leaf (labeled with the majority class)
Higher values prevent a model from learning relations specific only for a
few examples. Too high values may result in underfitting.
▶ Minimum number of examples required to be in a leaf
Similar to the previous one. A higher number means we cannot have very
specific branches concerned with particular combinations of values.
▶ Minimum information gain/impurity decrease
A small impurity decrease means that the split does not contribute too
much to the classification (their proportions after a split are similar to
proportions before a split). However, keep in mind that it is weighted
average impurity after the split.
92
Post-Pruning - Reduced Error Pruning
Train a large tree and then remove nodes that make classification worse
on the validation set.
Given a decision tree T and its internal node τ ∈ Tint, we denote by T−τ
the tree obtained from T by removing the subtree rooted in τ, i.e., τ is a
leaf of T−τ .
1: Train T to maximum fit on the training dataset.
2: while true do
3: Err[T ] ← the error of T on the validation set.
4: for τ ∈ Tint do
5: Err[T−τ ] ← the error of T−τ on the validation set.
6: end for
7: if Err[T ] ≤ min{Err[T−τ ] | τ ∈ Tint}] then return T
8: else
9: T ← argmin{Err[T−τ ] | τ ∈ Tint}
10: end if
11: end while
The error Err[T ] can be any measure of the “badness” of the decision
tree T . For example, 1 − Accuracy.
93
Other Pruning Methods
There are more pruning methods.
▶ Rule Post-Pruning:
▶ Transform the tree into a set of rules.
Rules correspond to paths in the tree; they have a form of
implication: Specific values of attributes imply a class.
▶ Remove the attribute conditions from the premises of the
implications.
This gives a more refined pruning: Instead of removing the whole subtree,
each path of the subtree can be pruned individually.
▶ Using cost complexity measure: Evaluate trees not only based
on the classification error but also based on their size.
Typically introduce regularization into the error functions:
Given a decision tree T
Errα(T ) = Err(T ) + α|T |
The original paper by Breiman et al. (1984) defined Err(T )
to be the misclassification rate on the training dataset, and
|T | is the number of nodes of the tree T . 94
Sensitivity to Small Changes and Randomness
▶ Decision trees are sensitive to small changes in data and
hyperparameters.
Several attributes may provide (almost) identical information gain but
divide the training dataset very differently.
▶ Some implementations choose attributes partially in random
(sci-kit-learn). You may get completely different trees.
A solution is to train an ensemble of many decision trees and then
use majority voting for classification.
This is the fundamental idea behind random forests (see later
lectures).
95
Iris - Illustration
Decision trees trained on the Iris dataset.
Iris Setosa is perfectly separated by many choices for the first split.
96
Axis Sensitivity
The decision makes divisions along particular axes:
That is, rotated data may result in a completely different model.
That is why decision trees are often preceded by the principal
component analysis (PCA) transformation, which aligns data along
the axes of maximum data variance.
97
XOR Training Problem
Consider the following training dataset:
An ideal decision tree would look like this:
98
Attempts at Training on XOR
Max depth = 2:
The problem: Both information gain and decrease in impurity
consider only the relationship of a single attribute and the class.
However, there is no relationship between a single attribute and
the class; both attributes need to be considered together! 99
More Attempts at Training on XOR
Max depth = 3:
It’s better but still fails occasionally.
100
Advantages of Decision Trees
▶ Simple to understand and interpret; trees can be visualized.
▶ Uses a white box model, where conditions are easily explained
by boolean logic.
▶ Can approximate an arbitrary (reasonable) boundary and
capture complex non-linear relationships between attributes.
▶ Capable of handling multi-class problems.
▶ Little data preparation, unlike other techniques requiring
normalization, dummy variables, or missing value removal.
▶ Handles numerical and categorical data.
▶ Not sensitive to outliers since the splitting is based on the
proportion of examples within the split ranges and not on
absolute values.
▶ The cost of using a well-balanced tree is logarithmic in the
number of data points used to train it.
101
Disadvantages of Decision Trees
▶ Overfitting: Trees can be over-complex and not generalize
well, needing pruning or limits on tree depth.
▶ Instability: Small data variations can result in very different
trees. This is mitigated in ensemble methods.
▶ Non-smooth predictions: Decision trees make piecewise
constant approximations, which are not suitable for
extrapolation.
▶ Difficulty expressing certain concepts, such as XOR, parity, or
multiplexer problems (see the next slide).
▶ Bias in trees: Decision trees can create biased trees if some
classes dominate. Balancing the dataset is recommended.
▶ Learning optimal trees is NP-complete: Heuristic algorithms
like greedy algorithms are used, which do not guarantee
globally optimal trees. Ensemble methods can help.
102
History of Decision Trees
▶ Hunt and colleagues use exhaustive search decision-tree
methods (CLS) to model human concept learning in the
1960’s.
▶ In the late 70’s, Quinlan developed ID3 with the information
gain heuristic to learn expert systems from examples.
▶ Simultaneously, Breiman, Friedman, and colleagues develop
CART (Classification and Regression Trees), similar to ID3.
▶ In the 1980s, various improvements were introduced to handle
noise, continuous features, missing features, and improved
splitting criteria. Various expert-system development tools
results.
▶ Quinlan’s updated decision-tree package (C4.5) released in
1993.
103
Comment on Regression Trees
Decision trees can also be used to approximate functions. Assign a
function value to the leaves instead of classes.
Here, “mse” is the mean-squared-error.
104
Comment on Regression Trees
Intuitively, for every subinterval of x1, the value (the red line) is at
the average y over the subinterval.
How are the subintervals being set?
105
Regression Trees
A regression tree is a decision tree whose leaves are labeled by
values from R.
We follow the same procedure as in decision trees during inference
on an input ⃗x.
How exactly are such trees trained?
We aim to minimize the mean squared error.
Assume, for the moment, that we want to train a single-node tree.
What will be the best labeling value for the single node?
Now, consider the training for arbitrary trees.
Assume ordinal attributes.
The algorithm also works for discrete attributes; ordinal attributes, however,
allow us to make binary splits.
The training procedure is the same as for the decision trees, except
that the splits and cut points are selected differently. 106
Regression Trees
Given a dataset D = {(⃗x1, d1), . . . , (⃗xp, dp)}, we denote by ¯D the
average desired value in D, that is ¯D = 1
p
p
k=1 dk.
Consider a dataset D and let us find a cut point for A in D.
We are looking for a value H of the attribute A such that the split:
D≤H = {(⃗x, d) ∈ D | A(⃗x) ≤ H} D>H = {(⃗x, d) ∈ D | A(⃗x) > H}
Minimizes the following split error:
1
|D≤H|
(⃗x,d)∈D≤H
d − ¯D≤H
2
+
1
|D>H|
(⃗x,d)∈D>H
d − ¯D>H
2
Denote by MinE(D) the minimum of the split error over all
attributes A and all cut points H.Compute
∆ =

 1
|D|
(⃗x,d)∈D
d − ¯D
2

 − MinE
If ∆ is large enough, split on A and H that minimize the split
error. Otherwise, stop splitting and label the leaf with ¯D. 107
Regression Tress
Without any lower bound on the number of examples in the leaves,
the algorithm will eventually overfit by splitting into (possibly)
singleton leaves.
108
Probabilistic Classification
109
Probabilistic Classification – Idea
Imagine that
▶ I look out of a window and see a bird,
▶ it is black, approx. 25 cm long and has a rather yellow beak.
My daughter asks: What kind of bird is this?
My usual answer: This is probably a kind of blackbird (kos ˇcern´y in
Czech).
Here probably means that out of my extensive catalog of four
kinds of birds that I can recognize, ”blackbird” gets the highest
degree of belief based on features of this particular bird.
Frequentists might say that the largest proportion of birds with similar features
I have ever seen were blackbirds.
The degree of belief (Bayesian), or the relative frequency
(frequentists), is the probability.
110
Basic Discrete Probability Theory
▶ A finite or countably infinite set Ω of possible outcomes, Ω is
called sample space.
Experiment: Roll one dice once. Sample space: Ω = {1, . . . , 6}
▶ Each element ω of Ω is assigned a ”probability” value f (ω),
here f must satisfy
▶ f (ω) ∈ [0, 1] for all ω ∈ Ω,
▶ ω∈Ω f (ω) = 1.
If the dice is fair, then f (ω) = 1
6
for all ω ∈ {1, . . . , 6}.
▶ An event is any subset E of Ω.
▶ The probability of a given event E ⊆ Ω is defined as
P(E) =
ω∈E
f (ω)
Let E be the event that an odd number is rolled, i.e., E = {1, 3, 5}. Then
P(E) = 1
2
.
▶ Basic laws: P(Ω) = 1, P(∅) = 0, given disjoint sets A, B we
have P(A ∪ B) = P(A) + P(B), P(Ω ∖ A) = 1 − P(A).
111
Conditional Probability and Independence
▶ P(A | B) is the probability of A given B (assume P(B) > 0)
defined by
P(A | B) = P(A ∩ B)/P(B)
(We assume that B is all and only information known.)
A fair dice: what is the probability that 3 is rolled assuming that an odd
number is rolled? ... and assuming that an even number is rolled?
▶ A and B are independent if P(A ∩ B) = P(A) · P(B).
It is easy to show that if P(B) > 0, then
A, B are independent iff P(A | B) = P(A).
112
Random Variables and Random Vectors
▶ A random variable X is a function X : Ω → R.
A dice: X : {1, . . . , 6} → {0, 1} such that X(n) = n mod 2.
▶ A random vector is a function X : Ω → Rd .
We use X = (X1, . . . , Xd ) where Xi is a random variable
returning the i-th component of X.
▶ Consider random variables X1, X2 and Y . The variables X1, X2
are conditionally independent given Y if for all x1, x2 and y we
have that
P(X1 = x1, X2 = x2 | Y = y) =
P(X1 = x1 | Y = y) · P(X2 = x2 | Y = y)
113
Random Vectors – Example
Let Ω be a space of colored geometric shapes that are divided into
two categories (111 and 000).
Assume a random vector X = (Xcolor , Xshape, Xcat) where
▶ Xcolor : Ω → {red, blue},
▶ Xshape : Ω → {circle, square},
▶ Xcat : Ω → {111,000}.
The following tables give probability distribution of values:
category 111:
circle square
red 0.2 0.02
blue 0.02 0.01
category 000:
circle square
red 0.05 0.3
blue 0.2 0.2
114
Random Vectors – Example
Example:
P(red, circle,111) = P(Xcolor = red, Xshape = circle, Xcat = 111) = 0.2
”Summing over” all possible values of some variable(s) gives
the distribution of the rest:
P(red, circle) = P(Xcolor = red, Xshape = circle)
= P(red, circle,111) + P(red, circle,000)
= 0.2 + 0.05 = 0.25
P(red) = 0.2 + 0.02 + 0.05 + 0.3 = 0.57
Thus also, all conditional probabilities can be computed:
P(111 | red, circle) =
P(red, circle,111)
P(red, circle)
=
0.2
0.25
= 0.8
115
Bayesian Classification
Let Ω be a sample space (a universum) of all objects that can be
classified. We assume a probability P on Ω.
We consider the problem of binary classification:
▶ Let Y be the random variable for the category which takes
values in {000,111}.
▶ Let X be the random vector describing n features of a given
instance, i.e., X = (X1, . . . , Xn)
▶ Denote by ⃗x ∈ Rn
values of X,
▶ and by xi ∈ R values of Xi .
Bayes classifier: Given a vector of feature values ⃗x,
CBayes
(⃗x) :=
111 if P(Y = 111 | X = ⃗x) ≥ P(Y = 000 | X = ⃗x)
000 otherwise.
Intuitively, CBayes assigns to ⃗x the most probable category it might
be in.
116
Bayesian Classification – Example
Imagine a conveyor belt with apples and apricots.
A machine is supposed to correctly distinguish apples from apricots
based on their weight and diameter.
That is,
▶ Y ∈ {111,000}
(here our interpretation is 111 = apple, 000 = appricot)
▶ X = (Xweight, Xdiam)
We are given a fruit of a diameter of 5cm that weighs 40g.
The Bayes classifier compares P(Y = 111 | X = (40g, 5cm)) with
P(Y = 000 | X = (40g, 5cm)) and selects the more probable
category given the features.
Crucial question: Is such a classifier good?
There are other classifiers, e.g., one which compares the weight divided by 10
with the diameter and decides based on the answer, or maybe a classifier that
sums the weight and the diameter and compares the result with a constant, etc.
117
Bayes Classifier
Let C be an arbitrary classifier, that is a function that to every
feature vector ⃗x ∈ Rn assigns a class from {000,111}.
Define the error of the classifier C by
EC = P(Y ̸= C)
(Here we slightly abuse notation and apply C to samples, technically we apply
the composition C ◦ X of C and X which first determines the features using X
and then classifies according to C).
Theorem
The Bayes classifier CBayes minimizes EC , that is
ECBayes = min
C is a classifier
EC
118
Practical Use of Bayes Classifier
The crucial problem: The probability P is not known!
In particular, where to get P(Y = 111 | X = ⃗x) ?
Note that P(Y = 000 | X = ⃗x) = 1 − P(Y = 111 | X = ⃗x)
Given no other assumptions, this requires a table showing
the probability of the category 111 for each possible feature vector ⃗x.
Where do you get these probabilities?
In some cases, the probabilities might come from the knowledge of
the solved problem (e.g., applications in physics might be
supported by a theory giving the probabilities).
In most cases, however, P is estimated from sampled data by
¯P(Y = 111 | X = ⃗x) =
number of samples with Y = 111 and X = ⃗x
number of samples with X = ⃗x
(We use ¯P to denote an estimate of P from data.)
119
Estimating P
Consider a problem with X = (X1, X2, X3) where each Xi returns
either 0 or 1. What might the data look like?
Part of the data table:
Y X1 X2 X3
111 1 0 1
111 0 1 1
000 1 0 1
000 0 0 1
111 0 0 0
000 1 1 1
· · ·
All data with X1 = 1, X2 = 0, X3 = 1:
Y X1 X2 X3
111 1 0 1
111 1 0 1
000 1 0 1
000 1 0 1
111 1 0 1
111 1 0 1
Estimate: ¯P(111 | 1, 0, 1) = 2/3
The probability table and the necessary data are typically too
large!
Concretely, if all X1, . . . , Xn are binary, there are 2n probabilities
P(Y = 111 | X = ⃗x), one for each possible ⃗x ∈ {0, 1}n.
120
Let’s Look at It the Other Way Round
Theorem (Bayes,1764)
P(A | B) =
P(B | A) · P(A)
P(B)
Proof.
P(A | B) =
P(A ∩ B)
P(B)
=
P(A∩B)
P(A) · P(A)
P(B)
=
P(B | A) · P(A)
P(B)
121
Bayesian Classification
Determine the category for ⃗x by computing
P(Y = y | X = ⃗x) =
P(Y = y) · P(X = ⃗x | Y = y)
P(X = ⃗x)
for both y ∈ {000,111} and deciding whether or not the following
holds:
P(Y = 111 | X = ⃗x) ≥ P(Y = 000 | X = ⃗x)
So, to make the classifier, we need to compute the following:
▶ The prior P(Y = 111) (then P(Y = 000) = 1 − P(Y = 111))
▶ The conditionals P(X = ⃗x | Y = y) for y ∈ {000,111} and for
every ⃗x
122
Estimating the Prior and Conditionals
▶ P(Y = 111) can be easily estimated from data by
¯P(Y = 111) =
number of samples with Y = 111
number of all samples
▶ If the dimension of features is small, P(X = ⃗x | Y = y) can
be estimated from data similarly as P(Y = 111 | X = ⃗x) by
¯P(X = ⃗x | Y = y) =
number of samples with Y = y and X = ⃗x
number of samples with Y = y
Unfortunately, for higher dimensional data too many samples
are needed to estimate all P(X = ⃗x | Y = y) (there are too
many ⃗x’s).
So where is the advantage of using the Bayes thm.??
We introduce independence assumptions about the features!
123
Naive Bayes
▶ We assume that features are (conditionally) independent given
the category. That is for all ⃗x = (x1, . . . , xn) and y ∈ {000,111}
we assume:
P(X = x | Y = y) = P(X1 = x1, · · · , Xn = xn | Y )
=
n
i=1
P(Xi = xi | Y = y)
▶ Therefore, we only need to specify P(Xi = xi | Y = y) for
each possible pair of a feature-value xi and y ∈ {000,111}.
Note that if all Xi are binary (values in {0, 1}), this requires
specifying only 2n parameters:
P(Xi = 1 | Y = 111) and P(Xi = 1 | Y = 000) for each Xi
as P(Xi = 0 | Y = y) = 1 − P(Xi = 1 | Y = y) for y ∈ {000,111}.
Compared to specifying 2n
parameters without any independence assumption.
124
Estimating the marginal probabilities
Estimate the probabilities P(Xi = xi | Y = y) by
¯P(Xi = xi | Y = y) =
number of samples with Xi = xi and Y = y
number of samples with Y = y
Example: Consider a problem with X = (X1, X2, X3) where each
Xi returns either 0 or 1. The data is
Y X1 X2 X3
111 1 0 1
111 0 1 1
000 1 0 1
000 0 0 1
111 0 0 0
000 1 1 1
¯P(X1 = 1 | Y = 111) = 1/3 ¯P(X1 = 1 | Y = 000) = 2/3
¯P(X2 = 1 | Y = 111) = 1/3 ¯P(X2 = 1 | Y = 000) = 1/3
¯P(X3 = 1 | Y = 111) = 2/3 ¯P(X3 = 1 | Y = 000) = 1
125
Naive Bayes – Example
Consider classification of geometric shapes:
X1 ∈ {small, medium, large}
X2 ∈ {red, blue, green}
X3 ∈ {square, triangle, circle}
Assume that we have already estimated the following probabilities:
Y = 111 Y = 000
¯P(Y ) 0.5 0.5
¯P(small | Y ) 0.4 0.4
¯P(medium | Y ) 0.1 0.2
¯P(large | Y ) 0.5 0.4
¯P(red | Y ) 0.9 0.3
¯P(blue | Y ) 0.05 0.3
¯P(green | Y ) 0.05 0.4
¯P(square | Y ) 0.05 0.4
¯P(triangle | Y ) 0.05 0.3
¯P(circle | Y ) 0.9 0.3
Does (medium, red, circle) belong to the category 111 ?
126
Y = 111 Y = 000
¯P(Y ) 0.5 0.5
¯P(medium | Y ) 0.1 0.2
¯P(red | Y ) 0.9 0.3
¯P(circle | Y ) 0.9 0.3
Denote ⃗x = (medium, red, circle).
P(Y = 111 | X = ⃗x) =
= P(111) · P(medium | 111) · P(red | 111) · P(circle | 111) / P(X = ⃗x)
.
= 0.5 · 0.1 · 0.9 · 0.9 / P(X = ⃗x) = 0.0405/P(X = ⃗x)
P(Y = 000 | X = ⃗x) =
= P(000) · P(medium | 000) · P(red | 000) · P(circle | 000) / P(X = ⃗x)
.
= 0.5 · 0.2 · 0.3 · 0.3 / P(X = ⃗x) = 0.009/P(X = ⃗x)
(Note that we used the estimates ¯P of P to finish the computation above.)
Apparently,
P(Y = 111 | X = ⃗x)
.
= 0.0405/P(X = ⃗x) > 0.009/P(X = ⃗x)
.
= P(000 | X = ⃗x)
So we classify ⃗x to the category 111.
127
Estimating Probabilities in Practice
We already know that P(Xi = xi | Y = y) can be estimated by
¯P(Xi = xi | Y = y) = ℓy,xi / ℓy
where
▶ ℓy,xi = number of samples with Y = y and Xi = xi
▶ ℓy = number of samples with Y = y
Problem: If, by chance, a rare value xi of a feature Xi never
occurs in the training data, we get
¯P(Xi = xi | Y = y) = 0 for both y ∈ {000,111}
But then ¯P(X = x) = 0 for x containing the value xi for Xi , and
thus ¯P(Y = y | X = x) is not well defined.
Moreover, ¯P(Y = y) · ¯P(X = x | Y = y) = 0 (for y ∈ {000,111}) so
even this cannot be used for classification.
128
Probability Estimation Example
Training data:
Size Color Shape Class
small red circle 111
large red circle 111
small red triangle 000
large blue circle 000
Estimated probabilities:
Y = 111 Y = 000
¯P(Y ) 0.5 0.5
¯P(small | Y ) 0.5 0.5
¯P(medium | Y ) 0 0
¯P(large | Y ) 0.5 0.5
¯P(red | Y ) 1 0.5
¯P(blue | Y ) 0 0.5
¯P(green | Y ) 0 0
¯P(square | Y ) 0 0
¯P(triangle | Y ) 0 0.5
¯P(circle | Y ) 1 0.5
Note that ¯P(medium | 111) = P(medium | 000) = 0 and thus also
¯P(medium, red, circle) = 0.
So what is ¯P(111 | medium, red, circle) ?
129
Smoothing
▶ To account for estimation from small samples, probability
estimates are adjusted or smoothed.
▶ Laplace smoothing adds one to every count of feature values
˜P(Xi = xi | Y = y) =
ℓy,xi + 1
ℓy + vi
where
▶ ℓy = number of training samples with Y = y,
▶ ℓy,xi
= number of training samples with Y = y and Xi = xi ,
▶ vi is the number of all distinct values of the variable Xi .
To understand note that
ℓy =
xi is a value of Xi
ℓy,xi
and thus
¯P(Xi = xi | Y = y) = ℓy,xi /
xi is a value of Xi
ℓy,xi
˜P(Xi = xi | Y = y) = (ℓy,xi + 1) /
xi is a value of Xi
(ℓy,xi + 1)
130
Laplace Smoothing Example
▶ Assume training set contains 10 samples of category 111:
▶ 4 small
▶ 0 medium
▶ 6 large
▶ Estimate parameters as follows
▶ ˜P(small | 111) = (4 + 1)/(10 + 3) = 0.384
▶ ˜P(medium | 111) = (0 + 1)/(10 + 3) = 0.0769
▶ ˜P(large | 111) = (6 + 1)/(10 + 3) = 0.538
131
Continuous Features
Ω may be (potentially) continuous, Xi may assign a continuum of
values in R.
▶ The probabilities are computed using probability density
p : R → R+.
A random variable X : Ω → R+
has a density p : R → R+
if for every
interval [a, b] we have
P(a ≤ X ≤ b) =
b
a
p(x)dx
Usually, P(Xi | Y = y) is used to denote the density of Xi
conditioned on Y = y.
▶ The densities P(Xi | Y = y) are usually estimated using
Gaussian densities as follows:
▶ Estimate the mean µiy and the standard deviation σiy based
on training data.
▶ Then put
¯P(Xi | Y = y) =
1
σiy
√
2π
exp
−(Xi − µiy )2
2σ2
iy
132
Comments on Naive Bayes
▶ Tends to work well despite rather a strong assumption of
conditional independence of features.
▶ Experiments show that it is quite competitive with other
classification methods.
Even if the probabilities are not accurately estimated, it often picks the
correct maximum probability category.
▶ Directly constructs a model from parameter estimates that are
calculated from the training data.
▶ Typically handles outliers and noise well.
▶ Missing values are easy to deal with (simply average overall
missing values in feature vectors).
133
Bayesian Networks (Basic Information)
In the Naive Bayes, we have assumed that all features X1, . . . , Xn
are independent.
This is usually not realistic.
E.g. Variables ”rain” and ”grass wet” are (usually) strongly dependent.
What if we return some dependencies?
(But now in a well-defined sense.)
Bayesian networks are a graphical model that uses a directed
acyclic graph to specify dependencies among variables.
134
Bayesian Networks – Example
Now, e.g.,
P(C, S, W , B, A) = P(C) · P(S) · P(W | C) · P(B | C, S) · P(A | B)
Now, we may, e.g., infer the probability P(C = T | A = T) that we sit in the
wrong chair, assuming that our back aches.
We have to store only 10 numbers as opposed to 25
− 1 possible
probabilities for all vectors of values of C, S, W , B, A.
135
Bayesian Networks – Learning & Naive Bayes
Many algorithms have been developed for learning:
▶ the structure of the graph of the network,
▶ the conditional probability tables.
The methods are based on maximum-likelihood estimation,
gradient descent, etc.
Automatic procedures are usually combined with expert knowledge.
Can you express the naive Bayes for Y , X1, . . . , Xn using a
Bayesian network?
136
Classifier Evaluation
137
Classifier
Assume binary classification into two classes {0, 1}.
Consider a classification dataset:
{(⃗xk, ck) | k = 1, . . . , p}
Here ⃗xk is a vector of attributes/features and ck ∈ {0, 1} for all k.
Consider a sequence of predictions generated by a classifier:
h1, . . . , hp ∈ {0, 1}
Here each hk has been predicted for the k-the example (⃗xk, ck).
How good are the predictions h1, . . . , hp w.r.t. c1, . . . , cp?
There are many possible metrics ...
I will call the class 1 positive and the class 0 negative.
Note that the class 0 is not negative in the numerical sense but in the absence
of something (e.g., predicted illness).
138
Confusion Matrix for Binary Classifier
Predicted
1 0
Actual
1 TP FN
0 FP TN
▶ TP = number of correctly classified examples with actual class 1
TP = |{k | hk = 1 ∧ ck = 1}|
▶ TN = number of correctly classified examples with actual class 0
TN = |{k | hk = 0 ∧ ck = 0}|
▶ FP = number of incorrectly classified examples with actual class 0
FP = |{k | hk = 1 ∧ ck = 0}|
▶ FN = number of correctly classified examples with actual class 1
FN = |{k | hk = 0 ∧ ck = 1}|
139
Example
Given a sample of 12 individuals, eight have cancer, and four are
cancer-free.
Assume that we have trained a classifier with the following results:
Index 1 2 3 4 5 6 7 8 9 10 11 12
Actual 1 1 1 1 1 1 1 1 0 0 0 0
Predicted 0 0 1 1 1 1 1 1 1 0 0 0
Result FN FN TP TP TP TP TP TP FP TN TN TN
Actual condition Predicted condition
Cancer Non-cancer
Cancer TP = 6 FN = 2
Non-cancer FP = 1 TN = 3
Total 8 + 4 = 12
140
Terminology
▶ TP aka hit
▶ TN aka correct rejection
▶ FP aka type I error, false alarm, overestimation
▶ FN aka type II error, miss, underestimation
Usually, TP, TN, FP, and FN are used to denote the individual
examples of a particular kind and the number of these examples.
In what follows, we also use
▶ P = TP + FN of all cases with the actual class 1
▶ N = TN + FP of all cases with the actual class 0
▶ PP = TP + FP of all cases with the predicted class 1
▶ PN = TN + FN of all cases with the predicted class 0
Note that P + N = PP + PN is the number of all cases.
There is a large number of derived metrics. We consider some of
the most used in practice.
141
Accuracy
Accuracy =
TP + TN
P + N
Intuitively, Accuracy is the proportion of correctly classified cases
w.r.t. all cases.
Example: Consider our cancer predictor with the confusion matrix
Actual condition Predicted condition
Cancer Non-cancer
Cancer TP = 6 FN = 2
Non-cancer FP = 1 TN = 3
Total 8 + 4 = 12
The Accuracy is
ACC =
TP + TN
P + N
=
6 + 3
12
=
3
4
142
Accuracy - Imbalanced Classes
Accuracy can be misleading when the classes are imbalanced:
▶ Consider 100 cases, 90 in the class 0 and 10 in the class 1,
▶ consider a classifier that returns 1 for a single sample of class
1 and 0 for all other samples.
Actual Predicted
Pos Neg
Pos 1 9
Neg 0 90
Total 90 + 10 = 100
The Accuracy is 91/100 > 0.9. Pretty good, right?
However, the classifier is pretty bad in the positive cases.
In the case of cancer prediction, such a classifier would be a disaster.
143
Precision & Recall
To mitigate the defect of the Accuracy, we may compute the
following metrics:
Precision =
TP
PP
(= how often is predicted positive actually positive)
Precision is also known as positive predictive value (PPV)
Recall =
TP
P
(= how often is actually positive predicted positive)
Recall is also known as true positive rate, sensitivity, hit rate, and power.
144
Precision & Recall - Example
Example: In our cancer example:
Actual condition Predicted condition
Cancer Non-cancer
Cancer TP = 6 FN = 2
Non-cancer FP = 1 TN = 3
Total 8 + 4 = 12
▶ Precision measures how often is the patient predicted to be ill
truly ill (in our case, 6/7)
▶ Recall measures how often is an ill patient found to be ill (in
our case, 6/8)
145
Precision & Recall - Imbalanced Classes
▶ Consider 100 cases, 90 in the class 0 and 10 in the class 1,
▶ consider a classifier that returns 1 for a single sample of class
1 and 0 for all other samples.
Actual Predicted
Pos Neg
Pos 1 9
Neg 0 90
Total 90 + 10 = 100
Precision = 1
Recall =
1
10
You can see that the predictor is very precise (on the class 1) but
useless due to the weak Recall.
146
Precision & Recall - Relative Importance
Let us get back to our cancer example:
Actual condition Predicted condition
Cancer Non-cancer
Cancer TP = 6 FN = 2
Non-cancer FP = 1 TN = 3
Total 8 + 4 = 12
Consider Precision and Recall.
By now, you should remember what they measure.
Which of the two is more important in medicine?
Which of the two is more important for plagiarism detectors?
Can we get a single number summarizing both Precision and
Recall?
For example, to compare two classifiers.
147
F1 Score
F1 score is the harmonic mean of Recall and Precision:
F1 =
2
Recall−1
+ Precision−1
=
2TP
2TP + FP + FN
Compare the arithmetic (left) and harmonic (right) mean:
The harmonic mean prefers the two values closer to each other.
For example, the harmonic mean of 2/3 and 1/3 is (approx) 0.44444.
148
F1 Score - Examples
Consider the cancer example:
Actual condition Predicted condition
Cancer Non-cancer
Cancer TP = 6 FN = 2
Non-cancer FP = 1 TN = 3
Total 8 + 4 = 12
Here F1 = 2TP
2TP+FP+FN = (2 · 6)/((2 · 6) + 1 + 2) = 0.8.
Our imbalanced example:
Actual Predicted
Pos Neg
Pos 1 9
Neg 0 90
Total 90 + 10 = 100
Here F1 = 2TP
2TP+FP+FN = (2 · 1)/((2 · 1) + 0 + 9) = 0.18.
Note that the average of Precision and Recall is 0.55, which would give us a
much less severe warning that the classifier is bad. 149
Imbalanced Classes Once More
Note that the standard definitions of Precision and Recall for
binary classifiers reveal only part of the truth.
In particular, true negatives are not used in the definition of F1.
Consider
Actual Predicted
Pos Neg
Pos 90 0
Neg 9 1
Total 90 + 10 = 100
Precision = 90/99 Recall = 90/90
F1 =
2TP
2TP + FP + FN
= (2 · 90)/(2 · 90 + 9 + 0) = 0.95
All great, except that the classifier sucks on the negative cases.
If you are concerned with the negative cases, swap the classes and compute
another set of metrics.
150
F1 Score
▶ F1 is often used as a summary score for binary classifiers
instead of Accuracy.
Works better with imbalanced classes.
▶ Criticised for giving Precision and Recall the same importance.
▶ Is not symmetric, ignores true negatives, i.e., is misleading for
some cases of imbalanced classes.
▶ Fowlkes-Mallows index is a geometric mean of Precision and
Recall (used in clustering).
The geometric mean is between the arithmetic and harmonic mean. For
example, the geometric mean of 2/3 and 1/3 is (approx) 0.4714.
151
More Derived Metrics
You can see that the negative predictive value becomes the
Precision when we swap the classes (and vice versa).
152
More Derived Metrics
Note that specificity becomes Recall when we swap the classes
(and vice versa).
For example, medical doctors communicate in terms of sensitivity
and specificity.
153
Actual condition Predicted condition
Cancer Non-cancer
Cancer TP = 6 FN = 2
Non-cancer FP = 1 TN = 3
Total 8 + 4 = 12
TPR = Sensitivity = Recall = TP/P = 6/8
How often is positive predicted positive?
TNR = Specificity = TN/N = 3/4
How often is negative predicted negative?
FPR = Prob. of false alarm = FP/N = 1/4
How often is negative predicted positive?
FNR = Miss rate = FN/P = 2/8
How often is positive predicted negative?
154
Evaluating Multi-class Classifiers
155
Classification Into Multiple Classes
Assume classification into classes from a finite set C.
Consider a classification dataset:
{(⃗xk, ck) | k = 1, . . . , p}
Here ⃗xk is a vector of attributes/features and ck ∈ C for all k.
Consider a sequence of predictions generated by a classifier:
h1, . . . , hp ∈ C
Here each hk has been predicted for the k-the example (⃗xk, ck).
How good are the predictions h1, . . . , hp w.r.t. c1, . . . , cp?
There are many possible metrics ...
Consider an arbitrary (finite) number of classes in C.
156
Confusion Matrix
Assume that C = {1, . . . , m}.
Now, given two classes i, j ∈ C we denote by Mij the number of
samples of class i classified into the class j.
Formally,
Mij = |{k | ck = i ∧ hk = j}|
Actual Predicted
1 · · · j · · · m
1 M11 · · · M1j · · · M1m
...
...
...
...
i Mi1 · · · Mij · · · Mim
...
...
...
...
m Mm1 · · · Mmj · · · Mmm
157
Example
Actual Predicted
big big
big big
small big
medium medium
big small
big big
small small
small small
medium medium
medium small
small small
big big
medium small
small medium
big big
158
Example
Actual Predicted
big big
big big
small big
medium medium
big small
big big
small small
small small
medium medium
medium small
small small
big big
medium small
small medium
big big
Actual Predicted
big medium small
big 5 0 1
medium 0 2 2
small 1 1 3
Note that the diagonal counts the
correctly classified samples.
The off-diagonal elements
correspond to misclassified samples.
158
Metrics
We can easily generalize Accuracy, Precision, Recall, and F1-score
from the binary classification to multiple classes.
Notation
▶ Mi• = m
j=1 Mij
▶ M•j = m
i=1 Mij
▶ M•• = m
i=1
m
j=1 Mij
Now, the metrics:
Accuracy =
m
k=1 Mkk
M••
For a given class i ∈ C:
Precision[i] =
Mii
M•i
Recall[i] =
Mii
Mi•
F1[i] =
2 ∗ Precision[i] ∗ Recall[i]
Precision[i] + Recall[i]
Note that Precision, Recall, and F1 can be defined only for a given class!
159
Example
Actual Predicted
big medium small
big 5 0 1
medium 0 2 2
small 1 1 3
Compute the metrics.
160
Example
Accuracy = (5+2+3)/15 = 0.66
Precision[big] = 5/6
Precision[medium] = 2/3
Precision[small] = 3/6
Recall[big] = 5/6
Recall[medium] = 2/4
Recall[small] = 3/5
Actual Predicted
big medium small
big 5 0 1
medium 0 2 2
small 1 1 3
F1[big] =
2 ∗ (5/6) ∗ (5/6)
(5/6) + (5/6)
= 5/6 = 0.83
F1[medium] = 0.57
F1[medium] = 0.54
How do you get a single number out of these? Average Precision, Recall, and
F1 are usually computed, but one needs to be careful about the variance.
161
162
Machine learning/data mining is needed to understand the matrix.
163
Probabilistic Classifier Evaluation
164
Binary Probabilistic Classifier
Assume binary classification into two classes {0, 1}.
Consider a classification dataset:
{(⃗xk, ck) | k = 1, . . . , p}
Here ⃗xk is a vector of attributes/features and ck ∈ C for all k.
Consider a sequence of predictions generated by a classifier.
Now the classifier returns probability of class 1 for a given input:
h1, . . . , hp ∈ [0, 1]
Here each hk has been predicted for the k-the example (⃗xk, ck).
How to interpret the predictions h1, . . . , hp?
How good are the predictions h1, . . . , hp w.r.t. c1, . . . , cp?
165
Probabilistic Classifier
Let us fix predictions h1, . . . , hp.
Given a threshold T ∈ [0, 1] we define
hT
k =
1 if hk ≥ T
0 if hk < T
For every T we can compute all the metrics (Precision, Recall,
etc.)
Given a metric MET and a threshold T, we denote by MET[T] the
metric MET evaluated on hT
1 , . . . , hT
p .
We obtain
TP[T] = |{k | hT
k = 1 ∧ ck = 1}|
and
TN[T], FP[T], FN[T], Accuracy[T], Precision[T], Recall[T], F1[T], . . .
However, all metrics are now functions of the threshold T.
166
Thresholded Classifier Metrics
Index 1 2 3 4 5 6 7 8 9 10 11 12
Actual 1 1 1 1 1 0 0 1 1 0 0 0
Predicted .98 .95 .9 .86 .66 .48 .43 .42 .36 .15 .1 .05
T=0.5 TP TP TP TP TP TN TN FN FN TN TN TN
T=0.42 TP TP TP TP TP FP FP TP FN TN TN TN
T=0.1 TP TP TP TP TP FP FP TP TP FP FP TN
For example, consider T = 0.42, then
TP[T] = 6 FP[T] = 2 FN[T] = 1 TN[T] = 3
Accuracy[T] =
3 + 6
12
Precision[T] =
6
6 + 2
Recall[T] =
6
6 + 1
F1[T] =
2 · 6/8 · 6/7
6/8 + 6/7
= 0.8
167
Receiver Operating Characteristic (ROC)
Consider two metrics for a given T:
TPR[T] =
TP[T]
P[T]
(True Positive Rate)
FPR[T] =
FP[T]
N[T]
(False Positive Rate)
ROC curve is then a function ROC : [0, 1] → [0, 1]2 defined by
ROC(T) = (TPR[T], FPR[T])
Observe that
ROC(0) = (1, 1)
Because the classifier with T = 0 simply classifies everything as
positive, i.e., into the class 1.
Both TPR[T] and FPR[T] are non-increasing in T.
168
Index 1 2 3 4 5 6 7 8 9 10 11 12
Actual 1 1 1 1 1 0 0 1 1 0 0 0
Predicted .98 .95 .9 .86 .66 .48 .43 .42 .36 .15 .1 .05
▶ 0.00 ≤ T ≤ 0.05: TPR = 1 and FPR = 1
▶ 0.05 < T ≤ 0.10: TPR = 1 and FPR = 4/5
▶ 0.10 < T ≤ 0.15: TPR = 1 and FPR = 3/5
▶ 0.15 < T ≤ 0.36: TPR = 1 and FPR = 2/5
▶ 0.36 < T ≤ 0.42: TPR = 6/7 and FPR = 2/5
▶ 0.42 < T ≤ 0.43: TPR = 5/7 and FPR = 2/5
▶ 0.43 < T ≤ 0.48: TPR = 5/7 and FPR = 1/5
▶ 0.48 < T ≤ 0.66: TPR = 5/7 and FPR = 0
▶ 0.66 < T ≤ 0.86: TPR = 4/7 and FPR = 0
▶ 0.86 < T ≤ 0.90: TPR = 3/7 and FPR = 0
▶ 0.90 < T ≤ 0.95: TPR = 2/7 and FPR = 0
▶ 0.95 < T ≤ 0.98: TPR = 1/7 and FPR = 0
▶ 0.98 < T ≤ 1.00: TPR = 0 and FPR = 0
169
ROC
Index 1 2 3 4 5 6 7 8 9 10 11 12
Actual 1 1 1 1 1 0 0 1 1 0 0 0
Predicted .98 .95 .9 .86 .66 .48 .43 .42 .36 .15 .1 .05
170
Iris Dataset - A Classifier
Example from the scikit-learn manual - SVM classifier trained in
Iris 171
Using ROC and Threshold
Search for the best threshold at the elbow of the ROC curve.
172
ROC - Explanation
The larger the area under the ROC curve (ROC-AUC), the better.
ROC-AUC ranges from 0 to 1. ROC-AUC ≈ 0.5 indicates random
guessing.
173
ROC-AUC
Index 1 2 3 4 5 6 7 8 9 10 11 12
Actual 1 1 1 1 1 0 0 1 1 0 0 0
Predicted .98 .95 .9 .86 .66 .48 .43 .42 .36 .15 .1 .05
ROC-AUC = 0.8857
174
Iris - ROC-AUC
ROC-AUC = 0.79
175
ROC-AUC - Probabilistic Interpretation
How is the ROC-AUC connected with the samples?
Consider our cancer detection example:
Index 1 2 3 4 5 6 7 8 9 10 11 12
Actual 1 1 1 1 1 0 0 1 1 0 0 0
Predicted .98 .95 .9 .86 .66 .48 .43 .42 .36 .15 .1 .05
AUC has a probabilistic explanation:
Consider the following experiment:
▶ Choose randomly a patient i from positive patients
Each positive patient has the same probability of being chosen.
▶ Choose randomly a patient j from negative patients
Each negative patient has the same probability of being chosen.
▶ Check if hi > hj .
The ROC-AUC is the probability of succeeding in the hi > hj test.
176
Summary
We have discussed various metrics that can be used to evaluate
the quality of a classifier.
The metrics summarize the results of evaluation on a given
dataset.
We have discussed metrics for evaluating
▶ binary classifiers,
Accuracy, Precision, Recall, F1, and few more
▶ multi-class classifiers,
Accuracy, Precision, Recall, F1
▶ probabilistic classifiers,
parametrized metrics, ROC-AUC
There are still several questions unanswered:
▶ When to use the metrics.
▶ How to estimate the influence of sampling the dataset.
177
Use of Evaluation Metrics
In our case, the following scenarios are typical:
▶ Final test: Evaluate the model on the test set (separated at
the beginning of training) and then compute the metrics. May
inform the user about the quality of the model.
▶ Validation: Evaluate models on a separate validation set and
use the metrics to compare models.
There are (at least) two scenarios in which this happens:
▶ Hyperparameter fine-tuning.
▶ Comparison of different models (e.g., KNN and decision trees).
Keep in mind that the metrics are artificial, and the results of the
model are roughly summarized.
It would be best if you always strived to test the proper
functionality of your model in as natural conditions as possible.
For example, a model for medical diagnosis should be evaluated by medical
doctors who may observe many features of its behavior that are difficult to
express quantitatively.
178
How to Estimate Significance
Machine learning models are typically trained on (pseudo) random
samples of data objects.
For example, a set of patients treated by the concrete hospital.
However, the purpose of testing/evaluation is to get information
about the whole population (i.e., all possible patients).
How do we estimate how much specific properties of the given
sample influence our model?
This is a challenging question; methods of inferential statistics are
needed to get the answer.
We will consider these issues in some later lecture. Concretely,
▶ Bias-variance tradeoff
▶ Statistical tests for testing
▶ significance of the metrics values,
▶ paired t-tests for comparing models.
179
How to Compare Classifiers
Let us consider two classifiers. How do you compare them?
Accuracies and F1 scores can be compared easily (they are just
numbers).
How to compare (Precision1, Recall1) of the fist classifier with
(Precision2, Recall2) of the second classifier?
Thresholding
▶ Introduce a threshold 0 ≤ t ≤ 1
▶ Demand, one of the two metrics (typically the Recall), to be
at least t. That is
Recall1 ≥ t Recall2 ≥ t
▶ Compare the values of the other metric numerically. In our
case, decide whether
Precision1 ≥ Precision2
(Still need to be concerned about the statistical significance.)
180
Example
Actual Predicted
condition condition
Canc. Non-canc.
Cancer 6 2
Non-canc. 1 3
Total 8 + 4 = 12
Actual Predicted
condition condition
Canc. Non-canc.
Cancer 5 3
Non-canc. 0 4
Total 8 + 4 = 12
Precision1 =
6
7
Recall1 =
6
8
Precision2 =
5
5
= 1 Recall2 =
5
8
Consider a threshold t on the Recall.
The second classifier is better if the threshold t is 5/8, then the
second classifier is better.
If the threshold t is 6/8, then the second classifier is unacceptable.
181
Numerical features
▶ Throughout this lecture we assume that all features are
numerical, i.e., feature vectors belong to Rn.
▶ Most non-numerical features can be conveniently transformed
to numerical ones.
For example:
▶ Colors {blue, red, yellow} can be represented by
{(1, 0, 0), (0, 1, 0), (0, 0, 1)}
(one-hot encoding)
▶ Words can be embedded into vector spaces by various means
(word2vec etc.)
▶ A black-and-white picture of x × y pixels can be encoded as
a vector of xy numbers that capture the shades of gray of
the pixels.
(Even though this is not the best way of representing images.)
182
Basic Problems
We consider two basic problems:
▶ (Binary) classification
Our goal: Classify inputs into
two categories.
183
Basic Problems
We consider two basic problems:
▶ (Binary) classification
Our goal: Classify inputs into
two categories.
▶ Regression
Our goal: Find
a (hypothesized) functional
dependency in data.
183
Linear Models
Binary Classification
184
Binary classification in Rn
Our goal:
▶ Given a set D of training examples of the form (⃗x, c) where
⃗x ∈ Rn and c ∈ {0, 1},
▶ construct a model h : Rn → {0, 1} that is consistent with D,
i.e.,
h(⃗x) = c for all training examples (⃗x, c) ∈ D
Comments:
▶ In practice, we often do not strictly demand h(⃗x) = c for all training
examples (⃗x, c) ∈ D (often it is impossible)
▶ We are more interested in good generalization, that is how well h
classifies new instances that do not belong to D.
(Recall that we usually evaluate accuracy of the resulting hypothesized
function h on a test set.)
185
Models
We consider two kinds of hypothesis spaces:
▶ Linear (affine) classifiers (this lecture)
▶ Non-linear classifiers (kernel SVM, neural networks) (later
lectures)
186
Linear Classifier – Example
187
Length and Scalar Product of Vectors
▶ We consider vectors ⃗x = (x1, . . . , xn) ∈ Rm.
▶ Euclidean metric on vectors: ||⃗x|| = n
i=1 x2
i
The distance between two vectors (points) ⃗x, ⃗y is ||⃗x − ⃗y||.
▶ Scalar product ⃗x · ⃗y of vectors ⃗x = (x1, . . . , xn) and
⃗y = (y1, . . . , yn) defined by
⃗x · ⃗y =
n
i=1
xi yi
▶ Recall that ⃗x · ⃗y = ||⃗x|| ||⃗y|| cos θ where θ is the angle between
⃗x and ⃗y. That is ⃗x · ⃗y is the length of the projection of ⃗y on ⃗x
multiplied by ||⃗x||.
▶ Note that ⃗x · ⃗x = ||⃗x||
2
188
Linear Classifier
A linear classifier h[⃗w] is determined by a vector of weights
⃗w = (w0, w1, . . . , wn) ∈ Rn+1 as follows:
Given ⃗x = (x1, . . . , xn) ∈ Rn,
h[⃗w](⃗x) :=
1 w0 + n
i=1 wi · xi ≥ 0
0 w0 + n
i=1 wi · xi < 0
More succinctly:
h(⃗x) = sgn w0 +
n
i=1
wi · xi where sgn(y) =
1 y ≥ 0
0 y < 0
We define separating hyperplane determined by ⃗w as the set of all
⃗x ∈ Rn satisfying w0 + n
i=1 wi · xi = 0.
189
190
190
190
190
Linear Classifier – Geometry
191
Linear Classifier – Notation
Given ⃗x = (x1, . . . , xn) ∈ Rn we define an augmented feature vector
˜x = (x0, x1, . . . , xn) where x0 = 1
This makes the notation for the linear classifier more succinct:
h[⃗w](⃗x) = sgn(⃗w · ˜x)
192
Linear Classifier – Learning
0
0
0 0
1
1
1
▶ classification in the plane using
a linear classifier
▶ if a point is incorrectly
classified, the learning algorithm
turns the line (hyperplane) to
improve the classification
193
Perceptron Learning
▶ Given a training set
D = {(⃗x1, c1) , (⃗x2, c2)) , . . . , (⃗xp, cp))}
Here ⃗xk = (xk1 . . . , xkn) ∈ Rn and ck ∈ {0, 1}.
Recall that ˜xk = (xk0, xk1 . . . , xkn) where xk0 = 1.
▶ A weight vector ⃗w ∈ Rn+1 is consistent with D if
h[⃗w](⃗xk) = sgn(⃗w · ˜xk) = ck for all k = 1, . . . , p
D is linearly separable if there is a vector ⃗w ∈ Rn+1 which is
consistent with D.
▶ Our goal is to find a consistent ⃗w assuming that D is linearly
separable.
194
Perceptron – Learning Algorithm
Online learning algorithm:
Idea: Cyclically go through the training examples in D and adapt weights.
Whenever an example is incorrectly classified, turn the hyperplane so that the
example becomes closer to its correct half-space.
Compute a sequence of weight vectors ⃗w(0), ⃗w(1), ⃗w(2), . . ..
▶ ⃗w(0) is randomly initialized close to ⃗0 = (0, . . . , 0)
▶ In (t + 1)-th step, ⃗w(t+1) is computed as follows:
⃗w(t+1)
= ⃗w(t)
− ε · h[⃗w(t)
](⃗xk) − ck · ˜xk
= ⃗w(t)
− ε · sgn ⃗w(t)
· ˜xk − ck · ˜xk
Here k = (t mod p) + 1, i.e., the examples are considered
cyclically, and 0 < ε ≤ 1 is a learning rate.
Theorem (Rosenblatt)
If D is linearly separable, then there is t∗ such that ⃗w(t∗) is
consistent with D.
195
Example
Training set:
D = {((2, −1), 1), ((2, 1), 1), ((1, 3), 0)}
That is
⃗x1 = (2, −1)
⃗x2 = (2, 1)
⃗x3 = (1, 3)
˜x1 = (1, 2, −1)
˜x2 = (1, 2, 1)
˜x3 = (1, 1, 3)
c1 = 1
c2 = 1
c3 = 0
Assume that the initial vector ⃗w(0) is ⃗w(0) = (0, −1, 1).
Consider ε = 1.
196
Example: Separating by ⃗w(0)
−1 1 2 3
−3
−2
−1
1
2
3
4
⃗x1
⃗x2
⃗x3
Denoting ⃗w(0) =
(w0, w1, w2) = (0, −1, 1)
the blue separating line is given
by w0 + w1x1 + w2x2 = 0.
The red vector normal to
the blue line is (w1, w2).
The points on the side of
(w1, w2) are assigned 1 by the
classifier, the others zero.
(In this case ⃗x3 is assigned one
and ⃗x1, ⃗x2 are assigned zero, all
of this is inconsistent with
c1 = 1, c2 = 1, c3 = 0.)
197
Example: Computing ⃗w(1)
We have
⃗w(0)
· ˜x1 = (0, −1, 1) · (1, 2, −1) = 0 − 2 − 1 = −3
thus
sgn ⃗w(0)
· ˜x1 = 0
and thus
sgn ⃗w(0)
· ˜x1 − c1 = 0 − 1 = −1
(I.e., ⃗x1 is not correctly classified, and ⃗w(0)
is not consistent with D.)
Hence,
⃗w(1)
= ⃗w(0)
− sgn ⃗w(0)
· ˜x1 − c1 · ˜x1
= ⃗w(0)
+ ˜x1
= (0, −1, 1) + (1, 2, −1)
= (1, 1, 0)
198
Example: Separating by ⃗w(1)
−1 1 2 3
−3
−2
−1
1
2
3
4
⃗x1
⃗x2
⃗x3
199
Example: Computing ⃗w(2)
We have
⃗w(1)
· ˜x2 = (1, 1, 0) · (1, 2, 1) = 1 + 2 = 3
thus
sgn ⃗w(1)
· ˜x2 = 1
and thus
sgn ⃗w(1)
· ˜x2 − c2 = 1 − 1 = 0
(I.e., ⃗x2 is currently correctly classified by ⃗w(1)
. However, as we will see, ⃗x3 is
not well classified.)
Hence,
⃗w(2)
= ⃗w(1)
= (1, 1, 0)
200
Example: Computing ⃗w(3)
We have
⃗w(2)
· ˜x3 = (1, 1, 0) · (1, 1, 3) = 1 + 1 = 2
thus
sgn ⃗w(2)
· ˜x3 = 1
and thus
sgn ⃗w(2)
· ˜x3 − c3 = 1 − 0 = 1
(This means that ⃗x3 is not well classified, and ⃗w(2)
is not consistent with D.)
Hence,
⃗w(3)
= ⃗w(2)
− sgn ⃗w(2)
· ˜x3 − c3 · ˜x3
= ⃗w(2)
− ˜x3
= (1, 1, 0) − (1, 1, 3)
= (0, 0, −3)
201
Example: Separating by ⃗w(3)
−1 1 2 3
−3
−2
−1
1
2
3
4
⃗x1
⃗x2
⃗x3
202
Example: Computing ⃗w(4)
We have
⃗w(3)
· ˜x1 = (0, 0, −3) · (1, 2, −1) = 3
thus
sgn ⃗w(3)
· ˜x1 = 1
and thus
sgn ⃗w(3)
· ˜x1 − c1 = 1 − 1 = 0
(I.e., ⃗x1 is currently correctly classified by ⃗w(3)
. However, we shall see that ⃗x2 is
not.)
Hence,
⃗w(4)
= ⃗w(3)
= (0, 0, −3)
203
Example: Computing ⃗w(5)
We have
⃗w(4)
· ˜x2 = (0, 0, −3) · (1, 2, 1) = −3
thus
sgn ⃗w(4)
· ˜x2 = 0
and thus
sgn ⃗w(4)
· ˜x2 − c2 = 0 − 1 = −1
(I.e., ⃗x2 is not correctly classified, and ⃗w(4)
is not consistent with D.)
Hence,
⃗w(5)
= ⃗w(4)
− sgn ⃗w(4)
· ˜x2 − c2 · ˜x2
= ⃗w(4)
+ ˜x2
= (0, 0, −3) + (1, 2, 1)
= (1, 2, −2)
204
Example: Separating by ⃗w(5)
−1 1 2 3
−3
−2
−1
1
2
3
4
⃗x1
⃗x2
⃗x3
205
Example: The result
The vector ⃗w(5) is consistent with D:
sgn ⃗w(5)
· ˜x1 = sgn ((1, 2, −2) · (1, 2, −1)) = sgn(7) = 1 = c1
sgn ⃗w(5)
· ˜x2 = sgn ((1, 2, −2) · (1, 2, 1)) = sgn(3) = 1 = c2
sgn ⃗w(5)
· ˜x3 = sgn ((1, 2, −2) · (1, 1, 3)) = sgn(−3) = 0 = c3
206
Perceptron – Learning Algorithm
Batch learning algorithm:
Compute a sequence of weight vectors ⃗w(0), ⃗w(1), ⃗w(2), . . ..
▶ ⃗w(0) is randomly initialized close to ⃗0 = (0, . . . , 0)
▶ In (t + 1)-th step, ⃗w(t+1) is computed as follows:
⃗w(t+1)
= ⃗w(t)
− ε ·
p
k=1
h[⃗w(t)
](⃗xk) − ck · ˜xk
= ⃗w(t)
− ε ·
p
k=1
sgn ⃗w(t)
· ˜xk − ck · ˜xk
Here 0 < ε ≤ 1 is a learning rate.
207
Linear Regression
208
Linear Regression – Oaks in Wisconsin
This example is from How to Lie with Statistics by Darrell Huff (1954)
209
Linear Regression – Oaks in Wisconsin
This example is from How to Lie with Statistics by Darrell Huff (1954)
NO!
209
Linear Regression – Oaks in Wisconsin
This example is from How to Lie with Statistics by Darrell Huff (1954)
possibly YES!
209
Linear Regression
Our goal:
▶ Given a set D of training examples of the form (⃗x, f ) where
⃗x ∈ Rn and f ∈ R,
▶ construct a model function h : Rn → R such that
h(⃗x) ≈ f for all training examples (⃗x, f ) ∈ D
Here ≈ means that the values are somewhat close to each
other w.r.t. an appropriate error function E.
In what follows we use the squared error defined by
E =
1
2
(⃗x,f )∈D
(h(⃗x) − f )2
Our goal is to minimize E.
The main reason is that this function has nice mathematical properties (as
opposed, e.g., to (⃗x,f )∈D |h(⃗x) − f | ).
210
Linear Function Approximation
▶ Given a set D of training examples:
D = {(⃗x1, f1) , (⃗x2, f2) , . . . , (⃗xp, fp)}
Here ⃗xk = (xk1 . . . , xkn) ∈ Rn and fk ∈ R.
▶ Our goal: Find ⃗w so that h[⃗w](⃗xk) = ⃗w · ˜xk is close to fk for
every k = 1, . . . , p.
Recall that ˜xk = (xk0, xk1 . . . , xkn) where xk0 = 1.
▶ Squared Error Function:
E(⃗w) =
1
2
p
k=1
(⃗w · ˜xk − fk)2
=
1
2
p
k=1
n
i=0
wi xki − fk
2
211
Error function
212
Gradient of the Error Function
Consider the gradient of the error function:
∇E(⃗w) =
∂E
∂w0
(⃗w), . . . ,
∂E
∂wn
(⃗w) =
p
k=1
(⃗w · ˜xk − fk) · ˜xk
What is the gradient ∇E(⃗w)? It is a vector in Rn+1
which points in the
direction of the steepest ascent of E (its length corresponds to the steepness).
Note that here the vectors ˜xk are fixed parameters of E!
Fact:
If ∇E(⃗w) = ⃗0 = (0, . . . , 0), then ⃗w is a global minimum of E.
This follows from the fact that E is a convex
paraboloid that has a unique extreme, which is a
minimum.
213
Gradient of the error function
Consider n = 1, which means that ⃗w = (w0, w1) and we write x
instead of ⃗x since ⃗x ∈ Rn = R1 = R.
Then the model is h[⃗w](x) = w0 + w1 · x.
Consider a concrete training set:
T = {(2, 1), (3, 2), (4, 5)}
= {(x1, f1), (x2, f2), (x3, f3)}
The augmented feature vectors are: (1, 2), (1, 3), (1, 4).
E(w0, w1) = 1
2[(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2]
∂E
∂w0
= (w0 +w1 ·2−1)·1+(w0 +w1 ·3−2)·1+(w0 +w1 ·4−5)·1
∂E
∂w1
= (w0 +w1 ·2−1)·2+(w0 +w1 ·3−2)·3+(w0 +w1 ·4−5)·4
∇E(⃗w) = ( ∂E
∂w0
, ∂E
∂w1
) =
(w0+w1·2−1)·(1, 2)+(w0+w1·3−2)·(1, 3)+(w0+w1·4−5)·(1, 4)
214
Function Approximation – Learning
Gradient Descent:
▶ Weights ⃗w(0) are initialized randomly close to ⃗0.
▶ In (t + 1)-th step, ⃗w(t+1) is computed as follows:
⃗w(t+1)
= ⃗w(t)
− ε · ∇E(⃗w(t)
)
= ⃗w(t)
− ε ·
p
k=1
⃗w(t)
· ˜xk − fk · ˜xk
= ⃗w(t)
− ε ·
p
k=1
h[⃗w(t)
](⃗xk) − fk · ˜xk
Here 0 < ε ≤ 1 is a learning rate.
Note that the algorithm is almost similar to the batch perceptron algorithm!
Proposition
For sufficiently small ε > 0 the sequence ⃗w(0), ⃗w(1), ⃗w(2), . . .
converges (component-wisely) to the global minimum of E.
215
Training set:
D = {(x1, f1), (x2, f2), (x3, f3)} = {(0, 0), (2, 1), (2, 2)}
Note that input vectors are one dimensional, so we write them as numbers.
That is
x1 = 0
x2 = 2
x3 = 2
˜x1 = (1, 0)
˜x2 = (1, 2)
˜x3 = (1, 2)
f1 = 0
f2 = 1
f3 = 2
Assume that the initial vector ⃗w(0) is ⃗w(0) = (w
(0)
0 , w
(0)
1 ) = (0, 2).
Consider ε = 1
10.
216
217
Training set:
D = {(x1, f1), (x2, f2), (x3, f3)} = {(0, 0), (2, 1), (2, 2)} Augmented
input vectors: ˜x1 = (1, 0), ˜x2 = (1, 2), ˜x1 = (1, 2)
∇E(⃗w) =
∂E
∂w0
(⃗w),
∂E
∂w1
(⃗w) = (w0 + w1 · x1 − f1) · ˜x1
+ (w0 + w1 · x2 − f2) · ˜x2
+ (w0 + w1 · x3 − f3) · ˜x3
For ⃗w(0) = (0, 2) we have
∇E(⃗w(0)
) =(0 + 2 · 0 − 0) · (1, 0)
+ (0 + 2 · 2 − 1) · (1, 2)
+ (0 + 2 · 2 − 2) · (1, 2) = (3, 6) + (2, 4) = (5, 10)
Finally, ⃗w(1) is computed by
⃗w(1)
= ⃗w(0)
− ε · ∇E(⃗w(0)
) = (0, 2) −
1
10
· (5, 10) = (−1/2, 1)
218
219
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Linear Regression - Animation
220
Finding the Minimum in Dimension One
Assume n = 1. Then, the error function E is
E(w0, w1) =
1
2
p
k=1
(w0 + w1xk − fk)2
Minimize E w.r.t. w0 a w1:
∂E
∂w0
= 0 ⇔ w0 = ¯f − w1 ¯x ⇔ ¯f = w0 + w1 ¯x
where ¯x = 1
p
p
k=1 xk a ¯f = 1
p
p
k=1 fk
∂E
∂w1
= 0 ⇔ w1 =
1
p
p
k=1(fk − ¯f )(xk − ¯x)
1
p
p
k=1(xk − ¯x)2
i.e. w1 = cov(f , x)/var(x)
221
Effect of Outliers
222
Effect of Outliers
222
Effect of Outliers
222
Effect of Outliers
222
Effect of Outliers
222
Maximum Likelihood vs Least Squares (Dim 1)
Fix a training set D = {(x1, f1) , (x2, f2) , . . . , (xp, fp)}
Assume that each fk has been generated randomly by
fk = (w0 + w1 · xk ) + ϵk
where w0, w1 are unknown weights, and ϵk are independent, normally
distributed noise values with mean 0 and some variance σ2
How ”probable” is it to generate the correct f1, . . . , fp ?
223
Maximum Likelihood vs Least Squares (Dim 1)
How ”probable” is it to generate the correct f1, . . . , fp ?
The following conditions are equivalent:
▶ w0, w1 minimize the squared error E
▶ w0, w1 maximize the likelihood (i.e., the “probability”) of generating
the correct values f1, . . . , fp using fk = (w0 + w1 · xk ) + ϵk
223
Comments on Linear Models
▶ Linear models are parametric, i.e., they have a fixed form with
a small number of parameters that need to be learned from
data (as opposed, e.g. to decision trees where the structure is
not fixed in advance).
▶ Linear models are stable, i.e., small variations in the training
data have only limited impact on the learned model. (tree
models typically vary more with the training data).
▶ Linear models are less likely to overfit (low variance) the
training data but sometimes tend to underfit (high bias).
▶ Linear models are prone to outliers.
224
Unsupervised Learning
225
Clustering
Often data form clusters based on
some notion of similarity.
This means that the data distribution is multimodal,
i.e., contains several regions of higher probability mass.
We aim to group data into clusters
of “similar” examples without using
any additional information.
(no supervision).
226
Motivation
Clustering is useful, e.g., in
▶ Customer segmentation based on their purchases.
▶ Data exploration - identify patterns in data
▶ Semi-supervised learning - cluster labeled examples with the
unlabeled ones
▶ Search engines - searching for images similar to a given image
▶ Image segmentation
▶ ...
227
Segmentation
228
Clustering Problem
Consider a dataset
D = {⃗x1, . . . , ⃗xp}
Note that no target class/value is provided.
Clustering is a partition U = {U1, . . . , UK } of D into K clusters.
How do we identify the clusters?
Assume that we have a distance measure d measuring how far
apart the objects being clustered. We want close objects to be
clustered together.
For concreteness:
▶ We stick with numerical features, which means that the
dataset D = {⃗x1, . . . , ⃗xp} contains vectors ⃗xi ∈ Rn.
▶ Assume the Euclidean distance d.
Note that clustering may be based on completely different
similarity/dissimilarity measures and non-numerical data. 229
K-Means Clustering
230
K-means clustering
The K-means clustering model consists of
▶ The number of clusters K
▶ K cluster prototypes ⃗m1, . . . , ⃗mK ∈ Rn
▶ An assignment qij ∈ {0, 1} for i = 1, . . . , p and j = 1, . . . , K
of inputs ⃗xi to clusters Uj so that
j
qij = 1 for i = 1, . . . , p
A given assignment {qij } induces a clustering
U = {U1, . . . , UK } where ⃗xi ∈ Uj iff qij = 1
How good is a given model?
231
Error Function
Measure the distance of inputs ⃗xi to their cluster prototypes ⃗mj
E({qij }, { ⃗mj }) =
p
i=1
K
j=1
qij d(⃗xi , ⃗mj )2
We aim to minimize this error, i.e., to find proper positions of
cluster prototypes and their assignment to minimize the total
squared distance of examples to their prototypes.
232
K-means Clustering Algorithm
The Problem: Minimize
E({qij }, { ⃗mj }) =
p
i=1
K
j=1
qij d(⃗xi , ⃗mj )2
w.r.t. {qij }, { ⃗mj }
Note that
▶ If we fix { ⃗mj }, we can minimize E({qij }, { ⃗mj }) by setting
qij = 1 iff ⃗mj is the closest prototype to ⃗xi .
▶ If we fix {qij }, we can minimize E({qij }, { ⃗mj }) by letting each
⃗mj to minimize the total squared distance to its prototypes:
i
qij d(⃗xi , ⃗mj )2
This is achieved by putting each prototype ⃗mj into the
centroid of all inputs it represents:
⃗mj =
1
p
i=1 qij
p
i=1
qij⃗xi
Note that p
i=1 qij is the size of the cluster represented by ⃗mj .
233
K-Means Clustering Algorithm
Algorithm 1 K-means clustering
1: Initialize K cluster centers ⃗m1, ⃗m2, . . . , ⃗mK randomly
2: repeat
3: for each data point ⃗xi do
4: Assign ⃗xi to the nearest centroid, i.e., set qij = 1 for
j = arg min
j
d(⃗xi , ⃗mj )2
5: end for
6: for each cluster prototype ⃗mj do
7: Update ⃗mj to be the centroid of all points assigned to it
⃗mj =
1
p
i=1 qij
p
i=1
qij⃗xi
8: end for
9: until convergence
234
Example
235
Example
Lines 3-5: Assign examples to the prototypes.
235
Example
Lines 6-8: Move the prototypes to the centroids of their examples.
235
Example
Lines 3-5: Assign examples to the prototypes.
235
Example
Lines 6-8: Move the prototypes to the centroids of their examples.
235
236
Convergence of K-means Clustering
Every step of K-means reduces the error E({qij }, { ⃗mj }):
▶ We always assign an input vector to the closest prototype.
▶ We always move the prototype to be “closest” to the input
vectors it represents.
Convergence can be tested by computing the error and checking
whether it has not changed in the last step.
This will always happen after finitely many steps.
There are only finitely many possible assignments to qij , and we always
minimize the distance of inputs to their assigned centers.
Example error development
during training. Blue circles mean
reassignment, and red circles
mean moving prototypes.
237
Setting K - the Elbow Method
K-means clustering minimizes the inertia measure:
E({qij }, { ⃗mj }) =
p
i=1
k
j=1
qij d(⃗xi , ⃗mj )2
That is the sum of squared distances of all examples of D to the
cluster prototypes.
Note that the error does not consider the distance between the
centers of the clusters.
Still, it is a valid measure that can be used to select the number of
clusters.
238
Elbow Method
The following method for setting up the hyperparameters can be
used in general. Let us illustrate the elbow method on K-means
clustering with the inertia measure.
Consider the following data:
239
Elbow Method
We could choose four clusters because adding more leads only to
small decrements in the inertia.
240
Bad Behavior
Minimizing E({qij }, { ⃗mj }) starting from random positions of
prototypes does not always produce “nice” results.
Some runs correspond to apparently bad solutions to the clustering
problem even though a better solution exists.
Possible solution: Start the algorithm several times with random
initialization of the prototypes.
241
Properties of K-means Clustering
▶ Prototype initialization is a big issue in K-means. There are
various strategies. For example:
▶ Start with all centers in a single corner.
▶ Include randomness in the setting of centers throughout the
algorithm.
▶ Initialize sequentially, always fit prototypes, and then choose a
new one as far away from the others as possible.
▶ Use hierarchical clustering (next slides) to find clusters and
initialize K-means with their centroids.
▶ Empty clusters may occur - need to resolve, e.g., by assigning
the farthest point from any current prototype.
▶ As the squared error is behind the basic method, outliers may
strongly affect its behavior (as in the linear regression case).
▶ Other problematic properties of data include
▶ non-convex clusters
▶ clusters of different sizes
▶ non-linearly separable clusters
▶ overlapping clusters
242
243
Agglomerative Clustering
244
Agglomerative Clustering
Consider a dataset
D = {⃗x1, . . . , ⃗xp}
Here ⃗xi ∈ Rn for all i = 1, . . . , p. Assume a distance d
(e.g., Euclidean).
Idea:
▶ Start by merging the closest examples (w.r.t. d)
▶ Incrementally build larger clusters by merging smaller clusters.
More concretely:
▶ Maintain a set of clusters
▶ Initially, each ⃗xi in its own cluster
▶ Repeat until only one cluster is left:
▶ Pick two closest clusters
▶ Merge them into a new cluster
How do we determine the closest clusters?
245
Closest Clusters
246
Closest Clusters
246
Closest Clusters
246
Distance Between Clusters
247
Distance Between Clusters
247
Distance Between Clusters
247
Which One is Closer?
248
Distance Between Clusters
Consider two clusters Uj , Uk ⊆ D.
single linkage(Uj , Uk)
= min{d(⃗x, ⃗z) | ⃗x ∈ Uj , ⃗z ∈ Uk}
complete linkage(Uj , Uk)
= max{d(⃗x, ⃗z) | ⃗x ∈ Uj , ⃗z ∈ Uk}
average linkage(Uj , Uk)
=
1
|Uj ||Uk|
⃗x∈Uj ⃗z∈Uk
d(⃗x, ⃗z)
Each linkage can result in a different
clustering.
249
Agglomerative Hierarchical Clustering Algorithm
Maintain a set of clusters
Initially, each ⃗xi in its own cluster
repeat
Pick two closest clusters
Using the distance measure d and single, average, or complete linkage.
Merge them into a new cluster
until only one cluster is left
250
Example
251
Example - Single Linkage
252
Example - Single Linkage
d(3, 6) = 0.11
which is the minimum
distance between points.
252
Example - Single Linkage
252
Example - Single Linkage
d(2, 5) = 0.14
which is the second
smallest distance.
252
Example - Single Linkage
252
Example - Single Linkage
d(2, 3) = 0.15 =
min{d(2, 3), d(2, 6), d(5, 3), d(5, 6)}
which is smaller than
d(1, 2) = 0.24,
d(1, 3) = 0.22,
d(4, 2) = 0.2,
d(4, 3) = 0.16,
d(4, 1) = 0.37
the min. distances of points
in all other pairs of clusters.
252
Example - Single Linkage
252
Example - Single Linkage
d(4, 3) = 0.15
= min{d(4, 3), d(4, 5), d(4, 2), d(4, 6)}
which is smaller than
d(1, 3) = 0.22, the distance
of 1 to the cluster 3.
252
Example - Single Linkage
252
Example - Single Linkage
252
Example - Single Linkage
252
Example - Average Linkage
253
Example - Average Linkage
253
Example - Average Linkage
253
Example - Average Linkage
d(2, 5) = 0.14
which is second smallest
distance.
253
Example - Average Linkage
253
Example - Average Linkage
The average distance between 4
and both points of {3, 6} is
1
2
(d(4, 3) + d(4, 6)) = 0.19
which is smaller than the average
distance between all points of
clusters 1, 2:
d(5, 2) + d(5, 3) + d(2, 3) + d(2, 6)
4
(equal to 0.205), and the average
distance of 1 to any cluster.
253
Example - Average Linkage
253
Example - Average Linkage
The average distance
between clusters 2, 3 is 0.26
which is smaller than
the average distance of 1 to
any of the two clusters 1, 2
(the average distances are
0.273 and 0.29).
253
Example - Average Linkage
253
Example - Average Linkage
253
Example - Complete Linkage
254
Example - Complete Linkage
d(3, 6) = 0.11
which is the minimum
distance between points.
254
Example - Complete Linkage
254
Example - Complete Linkage
d(2, 5) = 0.14
which is the second
smallest distance.
254
Example - Complete Linkage
254
Example - Complete Linkage
d(4, 6) = 0.22 =
max{d(4, 3), d(4, 6)}
which is smaller than
d(4, 5) = 0.29,
d(1, 5) = 0.34,
d(1, 6) = 0.23,
d(5, 6) = 0.39,
d(4, 1) = 0.37
the max distances of points
in all other pairs of clusters.
254
Example - Complete Linkage
254
Example - Complete Linkage
d(1, 5) = 0.34
which is smaller than
d(1, 4) = 0.37,
d(5, 6) = 0.39,
which are the maximum
distances of points in all
other pairs of clusters.
254
Example - Complete Linkage
254
Example - Complete Linkage
254
Example - Complete Linkage
254
Properties of Agglomerative Hierarchical Clustering
▶ Provides hierarchy of clusters - different cut levels provide
different levels of coarseness of clusters
▶ Compared with k-means, it does not depend on the
initialization and may provide better clusters than k-means.
▶ Lack of global objective function
▶ The agglomerative hierarchical clustering uses local criteria to
decide which clusters to merge.
▶ Agglomerative clustering has a “rich get richer” behavior that
leads to uneven cluster sizes
▶ Merging decision cannot be undone - bad for noisy data
▶ Computationally expensive.
255
256
State level statistics for US
257
State level statistics for US
257
State level statistics for US
257
State level statistics for US
257
State level statistics for US
257
Cluster Validation
258
Cluster Validity
For supervised classification (= we have class labels) we have a
variety of measures to evaluate how good our model is:
Accuracy, Precision, Recall, F1, etc.
For cluster analysis (=unsupervised learning), the analogous
question is:
How to evaluate the “goodness” of the resulting clusters?
Keep in mind that the dataset can be large and high-dimensional.
Visualization might be difficult.
259
Random points:
K-means Hierarchical
260
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e.,
distinguishing whether non-random structure exists in the
data (e.g., to avoid overfitting).
2. Internal Validation: Evaluating how well the cluster analysis
results fit the data without reference to external information.
3. External Validation: Compare the cluster analysis results to
externally known class labels (class labels).
4. Compare clusterings to determine which is better.
5. Determining the ‘correct’ number of clusters.
For 2, 3, and 4, we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters.
261
Measures of Cluster Validity
Numerical measures applied to judge various aspects of cluster
validity are classified into the following three types.
▶ Internal Index: Used to measure the goodness of a clustering
structure without respect to external information.
▶ External Index: Used to measure the extent to which cluster
labels match externally supplied class labels.
▶ Relative Index: Used to compare two different clusterings or
clusters.
262
Internal Index
Consider a dataset
D = {⃗x1, . . . , ⃗xp}
Assume that a clustering algorithm produced a partition
U = {U1, . . . , UK } of D into K clusters.
No other information has been provided.
We aim to measure the clustering’s “niceness” (??)
Assume that we have a distance measure d measuring how far
apart the objects being clustered.
For concreteness:
▶ We stick with numerical features, which means that the
dataset D = {⃗x1, . . . , ⃗xp} contains vectors ⃗xi ∈ Rn.
▶ Assume the Euclidean distance d.
Note that the validity measures may be based on completely different
similarity/dissimilarity measures and non-numerical data.
263
Cohesion and Separation
Consider a dataset D = {⃗x1, . . . , ⃗xp} and its clustering
U = {U1, . . . , UK }.
Let us utilize the concept of distance to cluster prototypes and
consider the distance between prototypes.
To measure the dissimilarity of examples, we use the notion of
proximity defined on pairs of vectors ⃗x, ⃗z:
proximity(⃗x, ⃗z)
The proximity might be, e.g.,
▶ the distance d(⃗x, ⃗z),
▶ the square of the distance, that is d(⃗x, ⃗z)2,
▶ any other notion of dissimilarity based on the application.
We consider the notions of cohesion (proximity of examples within
clusters) and separation (proximity of clusters).
264
Prototype-based Cohesion and Separation
Prototype-based cohesion = the similarity of examples within a
given cluster to a prototype of the cluster (e.g., centroid).
Given a cluster Uj ∈ U and its prototype ⃗mj ∈ Rn,
cohesion(Uj ) =
⃗x∈Uj
proximity(⃗x, ⃗mj )
Note that the prototype does not have to be an element of Uj .
Intuitively, cohesion is the proximity of cluster’s examples and a point
somewhere “between” all examples of the cluster.
265
Prototype-based Cohesion and Separation
Prototype-based separation = dissimilarity of prototypes of
different clusters.
Given a cluster Uj ∈ U, its prototype ⃗mj ∈ Rn, and a prototype of
all examples ⃗m ∈ Rn (e.g. the centroid of all examples)
separation(Uj ) = proximity( ⃗mj , ⃗m)
Intuitively, separation is the proximity of the cluster’s examples to the dataset’s
center.
265
Prototype-based Cohesion and Separation
Summarize the prototype-based cohesion and separation as follows:
cohesion(U) =
K
j=1
cohesion(Uj )
=
K
j=1 ⃗x∈Uj
proximity(⃗x, ⃗mj )
separation(U) =
K
j=1
|Uj |separation(Uj )
=
K
j=1
|Uj |proximity( ⃗mj , ⃗m)
If proximity(⃗x, ⃗z) is defined as d(⃗x, ⃗z)2
then the cohesion is the inertia.
There is an interesting relationship between the above measures
and the squared distances to the prototype of the whole dataset ⃗m.
266
Prototype-based Cohesion and Separation
Consider a dataset D = {⃗x1, . . . , ⃗xp} and its clustering
U = {U1, . . . , UK } of D.
Consider proximity(⃗x, ⃗z) = d(⃗x, ⃗z)2 and all prototypes to be
centroids.
Let ⃗m ∈ Rn be the centroid of all examples:
⃗m =
1
|D|
p
i=1
⃗xi
Define
TSS =
p
i=1
proximity(⃗xi , ⃗m) =
p
i=1
d(⃗xi , ⃗m)2
The following holds:
TSS = cohesion(U) + separation(U)
Note that TSS is determined by D.
267
Silouette Score
Silhouette score can be used to measure both qualities of
clustering from the point of view of individual examples and from
the point of view of the overall clustering.
silhouette(⃗x) =
b − a
max{a, b}
268
Silhouette Score
Consider a clustering U = {U1, . . . , Uk} and ⃗x ∈ Uj .
If |Uj | > 1 we define
a(⃗x) =
1
|Uj | − 1
⃗z∈Uj ∖{⃗x}
d(⃗x, ⃗z)
b(⃗x) = min
k̸=j
1
|Uk|
⃗z∈Uk
d(⃗x, ⃗z)
If |Uj | > 1 we define
silhouette(⃗x) =
b(⃗x) − a(⃗x)
max{a(⃗x), b(⃗x)}
Else, we define silhouette(⃗x) = 0.
269
Example
270
Example
271
Silhouette for Clusters and Clusterings
We have defined the silhouette for a single ⃗x ∈ D.
To obtain the silhouette score for a whole cluster Uj or for D we
summarize using simple averaging:
silhouette(Uj ) =
1
|Uj |
⃗x∈Uj
silhouette(⃗x)
silhouette(D) =
1
|D|
⃗x∈D
silhouette(⃗x)
272
Example
The colored graphs on the left are silhouette scores of the
individual elements of clusters.
The red vertical line denotes the silhouette score for the whole
dataset.
273
Example
The colored graphs on the left are silhouette scores of the
individual elements of clusters.
The red vertical line denotes the silhouette score for the whole
dataset.
273
Example
The colored graphs on the left are silhouette scores of the
individual elements of clusters.
The red vertical line denotes the silhouette score for the whole
dataset.
273
Example
The colored graphs on the left are silhouette scores of the
individual elements of clusters.
The red vertical line denotes the silhouette score for the whole
dataset.
273
Example
The colored graphs on the left are silhouette scores of the
individual elements of clusters.
The red vertical line denotes the silhouette score for the whole
dataset.
273
External Index
Consider a supervised learning dataset
D = {(⃗x1, c1), . . . , (⃗xp, cp)}
Here ci ∈ C is a class of ⃗xi .
Assume that a clustering algorithm produced a partition
U = {U1, . . . , UK } of D into K clusters.
We measure how the clustering conforms with the given classes.
274
Purity
Consider the clustering to be a classification model.
Define a classifier h : D → C such that given ⃗xi ∈ Uj ∈ U
h(⃗xi ) = the most frequent class in Uj
Now we can measure the Accuracy of h.
Accuracy of h is called purity.
Intuitively, it is the proportion of non-majority class elements in clusters.
Is it a good measure?
Probably not; many clusters lead to high purity (each element in
its own cluster means purity = 1).
275
Classifier Point of View
Given ⃗xi , denote by U(⃗xi ) the cluster Uj ∈ U containing ⃗xi .
Distinguish the following types of pairs of examples:
▶ TP = number of examples of the same class and the same cluster
TP = |{(i, j) | U(⃗xi ) = U(⃗xj ) ∧ ci = cj )}
▶ TN = number of examples of different classes and different clusters
TN = |{(i, j) | U(⃗xi ) ̸= U(⃗xj ) ∧ ci ̸= cj )}
▶ FP = number of examples of different classes and the same cluster
FP = |{(i, j) | U(⃗xi ) = U(⃗xj ) ∧ ci ̸= cj )}
▶ FN = number of examples of the same class and different clusters
FN = |{(i, j) | U(⃗xi ) ̸= U(⃗xj ) ∧ ci = cj )}
Now, we may apply all the measures from the supervised model.
276
Example
TP =
4
2
+
5
2
+
2
2
= 6 + 10 + 1 = 17
TN = 4 ∗ 5 + 1 ∗ 2 = 22
FP = 1 ∗ 4 + 5 ∗ 2 = 14
FN = 1 ∗ 5 + 4 ∗ 2 = 13
Cluster
same diff
Class
same TP=17 FN=13
diff FP=14 TN=22
277
Rand Index
Accuracy (in this area known as Rand index) is
RandInd = Accuracy =
TP + TN
TP + TN + FP + FN
In our example,
RandInd = (17 + 22)/(17 + 22 + 14 + 13) = 0.59
Here, note that the Rand index considers the purity and the
number of clusters.
Note that the Rand index can be used to compare two clusterings: Simply
consider class labels to be indicators of clusters.
Similarly, we may compute the other measures such as Precision,
Recall, and F1 with all their benefits and limitations.
278
Logistic Regression
279
What about classification using regression?
Binary classification: Desired outputs 0 and 1
... we want to capture the probability distribution of the classes
280
What about classification using regression?
Binary classification: Desired outputs 0 and 1
... we want to capture the probability distribution of the classes
... does not capture the probability well (it is not probability at all)
280
What about classification using regression?
Binary classification: Desired outputs 0 and 1
... we want to capture the probability distribution of the classes
... logistic sigmoid 1
1+e−(⃗w·˜x) is much better!
280
Logistic Regression
Logistic regression model h[⃗w] is determined by a vector of
weights ⃗w = (w0, w1, . . . , wn) ∈ Rn+1 as follows:
Given ⃗x = (x1, . . . , xn) ∈ Rn,
h[⃗w](⃗x) :=
1
1 + e−(w0+ n
k=1 wk xk )
=
1
1 + e−⃗w·˜x
Here
˜x = (x0, x1, . . . , xn) where x0 = 1
is the augmented feature vector.
281
But what is the meaning of the sigmoid?
The model gives probability h[⃗w](⃗x) of the class 1 given an input ⃗x.
But why do we model such probability using 1/(1 + e−⃗w·˜x) ??
282
But what is the meaning of the sigmoid?
The model gives probability h[⃗w](⃗x) of the class 1 given an input ⃗x.
But why do we model such probability using 1/(1 + e−⃗w·˜x) ??
Denote by ¯h the probability P(Y = 1 | X = ⃗x), i.e., the “true”
probability of the class 1 given features ⃗x.
The probability ¯h cannot be easily modeled using a linear function
(the probabilities are between 0 and 1).
282
But what is the meaning of the sigmoid?
The model gives probability h[⃗w](⃗x) of the class 1 given an input ⃗x.
But why do we model such probability using 1/(1 + e−⃗w·˜x) ??
Denote by ¯h the probability P(Y = 1 | X = ⃗x), i.e., the “true”
probability of the class 1 given features ⃗x.
What about odds of the class 1?
odds(¯h) =
¯h/(1 − ¯h)
Better, at least it is unbounded on one side ...
282
But what is the meaning of the sigmoid?
The model gives probability h[⃗w](⃗x) of the class 1 given an input ⃗x.
But why do we model such probability using 1/(1 + e−⃗w·˜x) ??
Denote by ¯h the probability P(Y = 1 | X = ⃗x), i.e., the “true”
probability of the class 1 given features ⃗x.
What about log odds (aka logit) of the class 1?
logit(¯h) =
log(¯h/(1 − ¯h))
Looks almost linear, at least for probabilities not too close to 0 or 1 ...
282
But what is the meaning of the sigmoid?
Assume that ¯h is the actual probability of the class 1 for an
“object” with features ⃗x ∈ Rn. Put
log(¯h/(1 − ¯h)) = ⃗w · ˜x
Then
log((1 − ¯h)/¯h)) = −⃗w · ˜x
and
(1 − ¯h)/¯h = e−⃗w·˜x
and
¯h =
1
1 + e−⃗w·˜x
= h[⃗w](⃗x)
If we model log odds using a linear function, the probability is obtained by
applying the logistic sigmoid on the result of the linear function.
283
Logistic Regression
▶ Given a set D of training samples:
D = {(⃗x1, c1) , (⃗x2, c2) , . . . , (⃗xp, cp)}
Here ⃗xk = (xk1 . . . , xkn) ∈ Rn and ck ∈ {0, 1}.
Recall that h[⃗w](⃗xk) = 1 / 1 + e−⃗w·˜xk where
˜xk = (xk0, xk1 . . . , xkn), here xk0 = 1
Our goal: Find ⃗w such that for every k = 1, . . . , p we have
that h[⃗w](⃗xk) ≈ ck
▶ Binary Cross-entropy:
E(⃗w) = −
p
k=1
ck log(h[⃗w](⃗xk)) + (1−ck) log(1−h[⃗w](⃗xk))
284
Gradient of the Error Function
Consider the gradient of the error function:
∇E(⃗w) =
∂E
∂w0
(⃗w), . . . ,
∂E
∂wn
(⃗w) =
p
k=1
(h[⃗w](⃗xk) − ck)·˜xk
Fact 1
If ∇E(⃗w) = ⃗0 = (0, . . . , 0), then ⃗w is a global minimum of E.
This follows from the fact that E is convex.
Using the squared error with the logistic sigmoid would lead to a
non-convex error with several minima!
285
Logistic Regression – Learning
Gradient Descent:
▶ Weights ⃗w(0) are initialized randomly close to ⃗0.
▶ In (t + 1)-th step, ⃗w(t+1) is computed as follows:
⃗w(t+1)
= ⃗w(t)
− ε · ∇E(⃗w(t)
)
= ⃗w(t)
− ε ·
p
k=1
h[⃗w(t)
](⃗xk) − ck · ˜xk
Here 0 < ε ≤ 1 is the learning rate.
Note that the algorithm is almost similar to the batch perceptron algorithm!
Proposition
For sufficiently small ε > 0, the sequence ⃗w(0), ⃗w(1), ⃗w(2), . . .
converges (in a component-wise manner) to the global minimum of
the error function E.
286
Logistic Regression - Using the Trained Model
We have already trained our logistic regression model, i.e., we have
a vector of weights ⃗w = (w0, w1, . . . , wn).
The model is the function h[⃗w] which for a given feature vector
⃗x = (x1, . . . , xn) returns the probability
h[⃗w](⃗x) =
1
1 + e−(w0+ n
k=1 wk xk )
that ⃗x belongs to the class 1.
To decide whether a given ⃗x belongs to the class 1 we use h[⃗w] as
a Bayes classifier: Assign ⃗x to the class 1 iff h[⃗w](⃗x) ≥ 1/2.
Other thresholds can also be used depending on the application and properties
of the model. In such a case, given a threshold ξ ∈ [0, 1], assign ⃗x to
the class 1 iff h[⃗w](⃗x) ≥ ξ.
287
Maximum Likelihood vs Cross-entropy (Dim 1)
Fix a training set D = {(x1, c1) , (x2, c2) , . . . , (xp, cp)}
Generate a sequence c′
1, . . . , c′
p ∈ {0, 1}p
where each c′
k has been
generated independently by the Bernoulli trial generating 1 with
probability
h[w0, w1](xk ) =
1
1 + e−(w0+w1·xk )
and 0 otherwise.
Here w0, w1 are unknown weights.
How “probable” is it to generate the correct classes c1, . . . , cp ?
The following conditions are equivalent:
▶ w0, w1 minimize the binary cross-entropy E
▶ w0, w1 maximize the likelihood (i.e., the “probability”) of generating
the correct values c1, . . . , cp using the above described Bernoulli
trials (i.e., that c′
k = ck for all k = 1, . . . , p)
Note that the above equivalence is a property of the cross-entropy and is not
dependent on the “implementation” of h[w0, w1](xk ) using the logistic sigmoid.
288
Support Vector Machines (SVM)
289
SVM Idea – Which Linear Classifier is the Best?
Benefits of maximum margin:
▶ Intuitively, the maximum margin is good w.r.t. generalization.
▶ Only the support vectors (those on the margin) matter; others
can, in principle, be ignored.
290
Support Vector Machines (SVM)
Notation:
▶ ⃗w = (w0, w1, . . . , wn) a vector of weights,
▶ ⃗w = (w1, . . . , wn) a vector of all weights except w0,
▶ ⃗x = (x1, . . . , xn) a (generic) feature vector.
▶ ˜x = (x0, x1, . . . , xn) an augmented feature vector where x0 = 1.
Consider a linear classifier:
h[⃗w](⃗x) :=
1 w0 +
n
i=1 wi · xi = ⃗w · ˜x ≥ 0
−1 w0 +
n
i=1 wi · xi = ⃗w · ˜x < 0
The distance of ⃗x from the separating hyperplane determined by ⃗w is
d[⃗w](⃗x) =
|⃗w · ˜x|
∥⃗w∥
Recall that ⃗w · ˜x is positive for ⃗x on the side to which ⃗w points and negative on
the opposite side.
291
292
Margin
▶ Given a training set
D = {(⃗x1, y1) , (⃗x2, y2) , . . . , (⃗xp, yp)}
Here ⃗xk = (xk1 . . . , xkn) ∈ X ⊆ Rn
and yk ∈ {−1, 1}.
▶ Assume that D is linearly separable, let ⃗w be consistent with D.
Margin of ⃗w is twice the minimum distance between feature vectors
⃗xk and the separating hyperplane determined by ⃗w, i.e.,
2 min
k
d[⃗w](⃗xk ) = 2 min
k
|⃗w · ˜xk |
∥⃗w∥
▶ Our goal is to find ⃗w consistent with D that maximizes the margin.
Note that to maximize the margin it suffices to maximize mink
|⃗w·˜xk |
∥⃗w∥
over
⃗w consistent with D.
293
Finding the Maximum Margin Classifier
We want to maximize the minimum distance of the feature vectors
⃗xk from the separating hyperplane determined by ⃗w.
Formally, we use the following:
To maximize the margin, find ⃗w maximizing
min
k
|⃗w · ˜xk|
||⃗w||
(= the distance of closest ⃗xk ’s to the sep. hyperplane)
over the following constraints
⃗w · ˜xk > 0 for all k satisfying yk = 1
⃗w · ˜xk < 0 for all k satisfying yk = −1
(the contraints make sure that ⃗w is consistent with the training set D)
294
To maximize the margin, find ⃗w maximizing
min
k
|⃗w · ˜xk|
||⃗w||
over the following constraints
⃗w · ˜xk > 0 for all k satisfying yk = 1
⃗w · ˜xk < 0 for all k satisfying yk = −1
Can be made more succinct:
To maximize the margin, find ⃗w maximizing
min
k
yk · ⃗w · ˜xk
∥⃗w∥
over min
k
(yk · ⃗w · ˜xk) > 0
The reason is that ⃗w is consistent with D. That is, ⃗w · ˜xk > 0 for yk = 1, and
⃗w · ˜xk < 0 for yk = −1.
295
To maximize the margin, find ⃗w maximizing
min
k
yk · ⃗w · ˜xk
∥⃗w∥
over min
k
(yk · ⃗w · ˜xk) > 0
Observation: For every ⃗w satisfying mink(yk · ⃗w · ˜xk) > 0 there is
⃗w′ satisfying mink(yk · ⃗w′ · ˜xk) = 1 such that
min
k
yk · ⃗w · ˜xk
∥⃗w∥
= min
k
yk · ⃗w′ · ˜xk
∥⃗w′
∥
Proof: Just consider ⃗w′ = ⃗w/ξ where ξ = mink(yk · ⃗w · ˜xk).
To maximize the margin, find ⃗w maximizing
min
k
yk · ⃗w · ˜xk
∥⃗w∥
over min
k
(yk · ⃗w · ˜xk) = 1
296
297
298
To maximize the margin, find ⃗w maximizing
min
k
yk · ⃗w · ˜xk
∥⃗w∥
over min
k
(yk · ⃗w · ˜xk) = 1
can be further simplified to
To maximize the margin, find ⃗w maximizing
1
∥⃗w∥
over min
k
(yk · ⃗w · ˜xk) = 1
299
To maximize the margin, find ⃗w maximizing
1
∥⃗w∥
over min
k
(yk · ⃗w · ˜xk) = 1
Can be adjusted by loosening the constraints:
To maximize the margin, find ⃗w maximizing
1
∥⃗w∥
over min
k
(yk · ⃗w · ˜xk) ≥ 1
If the latter is solved by ⃗w′ with mink(yk · ⃗w′ · ˜xk) > 1, then
min
k
yk · ⃗w′ · ˜xk
⃗w′ >
1
⃗w′ ≥
1
||⃗w||
=
mink yk · ⃗w · ˜xk
||⃗w||
For all ⃗w satisfying mink(yk · ⃗w · ˜xk) = 1, which contradicts the
fact that the maximum margin is attained by such a ⃗w.
300
To maximize the margin, find ⃗w maximizing
1
∥⃗w∥
over min
k
yk · ⃗w · ˜xk ≥ 1
Can be turned into
To maximize the margin, find ⃗w minimizing
||⃗w|| over min
k
yk · ⃗w · ˜xk ≥ 1
And, finally,
To maximize the margin, find ⃗w minimizing
⃗w · ⃗w over yk · ⃗w · ˜xk ≥ 1 for all k
Indeed, just note that ||⃗w|| = ⃗w · ⃗w.
301
SVM – Optimization
Assume a given training set
D = {(⃗x1, y1)) , (⃗x2, y2) , . . . , (⃗xp, yp)}
Here ⃗xk = (xk1 . . . , xkn) ∈ X ⊆ Rn and yk ∈ {−1, 1}.
(recall ˜xk = (xk0, xk1, . . . , xkn) where xk0 = 1)
Margin maximization as a quadratic optimization problem:
Find ⃗w minimizing
⃗w · ⃗w
under the constraints
yk · ⃗w · ˜xk ≥ 1 for all k
Support vectors are vectors ⃗xk closest to the optimal separating
hyperplane, i.e., those satisfying yk · ⃗w · ˜xk = 1 for a minimizing ⃗w.
302
Example
Training set:
D = {((0, 0), −1), ((1, 1), 1), ((0, 3), 1)}
That is
⃗x1 = (0, 0)
⃗x2 = (1, 1)
⃗x3 = (0, 3)
˜x1 = (1, 0, 0)
˜x2 = (1, 1, 1)
˜x3 = (1, 0, 3)
y1 = −1
y2 = 1
y3 = 1
303
304
Find ⃗w minimizing w2
1 + w2
2 under the constraints
(−1) · (1w0 + 0w1 + 0w2) = −w0 ≥ 1
1 · (1w0 + 1w1 + 1w2) = w0 + w1 + w2 ≥ 1
1 · (1w0 + 0w1 + 3w2) = w0 + 3w2 ≥ 1
It can be solved using a quadratic programming solver.
To solve by hand, assume that we know that ⃗x1 and ⃗x2 are support
vectors.
Find ⃗w minimizing w2
1 + w2
2 under the constraints
−w0 = 1
w0 + w1 + w2 = 1
w0 + 3w2 ≥ 1
Note that the equality constraints correspond to our assumption that ⃗x1 and ⃗x2
are support vectors.
305
Find ⃗w minimizing w2
1 + w2
2 under the constraints
−w0 = 1
w0 + w1 + w2 = 1
w0 + 3w2 ≥ 1
Can be transformed to
Find ⃗w minimizing w2
1 + w2
2 under the constraints
w1 + w2 = 2
3w2 ≥ 2
306
Find ⃗w minimizing w2
1 + w2
2 under the constraints
w1 + w2 = 2
3w2 ≥ 2
Substituting w2 = 2 − w1 into the quadratic function we obtain
w2
1 + (2 − w1)2
= w2
1 + w2
1 − 4w1 + 4 = 2w2
1 − 4w1 + 4
substituting w2 = 2 − w1 into the inequality 3w2 ≥ 2 we obtain
6 − 3w1 ≥ 2
This reduces our problem to
Find ⃗w minimizing 2w2
1 −4w1 +4 under the constraint w1 ≤ 4
3
307
Find ⃗w minimizing 2w2
1 −4w1 +4 under the constraint w1 ≤ 4
3
Is solved by
w1 = 1
From w2 = 2 − w1 we obtain
w2 = 2 − 1 = 1
From −w0 = 1 we obtain
w0 = −1
The final model is
h[⃗w](⃗x) = −1 + x1 + x2
The separating hyperplane is determined by
−1 + x1 + x2 = 0
308
309
SVM – Optimization
▶ Need to optimize a quadratic function subject to linear
constraints.
▶ Quadratic optimization problems are a well-known class of
mathematical programming problems for which efficient
methods (and tools) exist.
But why has the SVM been so successful?
... the improvement by finding the maximum margin classifier does not seem to
be so strong ... right?
The answer lies in their ability to deal with non-linearly separable
sets efficiently using the kernel trick (see a later lecture).
310
Comments on Algorithms
▶ The main bottleneck of SVM is quadratic programming (QP)
complexity. A naive QP solver has cubic complexity.
▶ For small problems, any general-purpose optimization
algorithm can be used.
▶ For large problems, this is usually not possible; many methods
avoiding direct solutions have been devised.
▶ These methods usually decompose the optimization problem
into a sequence of smaller ones. Intuitively,
▶ starts with a (smaller) subset of training examples.
▶ Find an optimal solution using any solver.
▶ Afterwards, only support vectors matter in the solution! Leave
only them in the training set, and add new training examples.
▶ This iterative procedure decreases the (general) cost function.
311
Soft-margin SVM
Trade-off few misclassifications with a wide margin for the rest.
Find ⃗w minimizing
⃗w · ⃗w + C
k
ζk C is a hyperparameter
under the constraints
yk · ⃗w · ˜xk ≥ 1 − ζk for all k
ζk ≥ 0 for all k
Which is the same as the following unconstrained optimization:
Find ⃗w minimizing the hinge loss
⃗w · ⃗w + C
k
max(0, 1 − yk · ⃗w · ˜xk)
312
Hard vs Soft Margin SVM
Source: Dishaa Agarwal
https://www.analyticsvidhya.com/blog/2021/04/insight-into-svm-support-vector-machine-along-with-code/
313
Comments on SVM
▶ SVMs were originally proposed by Boser, Guyon, and Vapnik
in 1992, and gained increasing popularity in the late 1990s.
▶ SVMs are currently among the best performers for several
classification tasks ranging from text to genomic data.
▶ SVMs can be applied to complex data types beyond feature
vectors (e.g., graphs, sequences, relational data) by designing
kernel functions for such data.
▶ SVM techniques have been extended to several tasks, such as
regression [Vapnik et al. ’97], principal component analysis
[Sch¨olkopf et al. ’99], etc.
314
Kernel Methods
315
Quadratic Decision Boundary
Left: The original set, Right: Transformed using the square of features.
Right: the green line is a separating hyperplane in transformed space.
Left: the green ellipse maps exactly to the green line.
How to classify (in the original space): First, transform a given feature
vector by squaring the features, then use a linear classifier.
316
Another Solution
Mapping from R2 to R3 so that there is ”more space” for linear
separation.
317
Do We Need to Map Explicitly?
In general, mapping to (much) higher feature space helps
(there are more ”degrees of freedom” so linear separability might
get a chance).
However, the complexity of learning grows (quickly) with
dimension.
Sometimes it’s even beneficial to map to infinite-dimensional
spaces.
To avoid explicit construction of the higher dimensional feature
space, we use the so-called kernel trick.
But first, we need to dualize our learning algorithm.
318
Linear Regression
▶ Given a set D of training examples:
D = {(⃗x1, f1) , (⃗x2, f2) , . . . , (⃗xp, fp)}
Here ⃗xk = (xk1 . . . , xkn) ∈ Rn and fk ∈ R.
▶ Our goal: Find ⃗w so that h[⃗w](⃗xk) = ⃗w · ˜xk is close to fk for
every k = 1, . . . , p.
Recall that ˜xk = (xk0, xk1 . . . , xkn) where xk0 = 1.
▶ Squared Error Function:
E(⃗w) =
1
2
p
k=1
(⃗w · ˜xk − fk)2
=
1
2
p
k=1
n
i=0
wi xki − fk
2
319
Regularized Linear Regression
Regularized Squared Error Function:
E(⃗w) =
1
2
p
k=1
(⃗w · ˜xk − fk)2
+ ⃗w · ⃗w
Intuition: the added term ⃗w · ⃗w prevents growth of weights.
The Representer Theorem: The weight vector ⃗w∗ minimizing
the regularized squared error function can be written as
⃗w∗
=
p
i=1
αi fi ˜xi Here α1, . . . , αp are suitable coefficients.
Substituting this expression for weights in E gives
E′
(⃗w∗
) =
1
2
p
k=1
p
i=1
αi fi ˜xi · ˜xk − fk
2
+
p
i=1
p
j=1
αi αj fi fj ˜xi · ˜xj
and we minimize E′ w.r.t. α1, . . . , αp. What is this good for??
320
Given a set D of training examples:
D = {(⃗x1, f1) , (⃗x2, f2) , . . . , (⃗xp, fp)}
Here ⃗xk = (xk1 . . . , xkn) ∈ Rn and fk ∈ R.
Find α1, . . . , αp minimizing dual regularized squared error
E′
(⃗w∗
) =
1
2
p
k=1
p
i=1
αi fi ˜xi · ˜xk − fk
2
+
p
i=1
p
j=1
αi αj fi fj ˜xi · ˜xj
The resulting coefficients α1, . . . , αp give a weight vector
⃗w∗
=
p
i=1
αi fi ˜xi
which in turn gives a linear model
h[⃗w∗
](⃗x) = ⃗w∗
˜x =
p
i=1
αi fi ˜xi · ˜x
Note that all ˜x,˜xi ,˜xj ,˜xk occur in dot products with themselves!
321
Find ⃗α = (α1, . . . , αp) minimizing dual regularized squared error
E′
(⃗w∗
) =
1
2
p
k=1
p
i=1
αi fi ˜xi · ˜xk − fk
2
+
p
i=1
p
j=1
αi αj fi fj ˜xi · ˜xj
Linear model: h[⃗α](⃗x) =
p
i=1 αi fi ˜xi · ˜x
Do we need to use the dot product in the above procedure? NO!
Find ⃗α = (α1, . . . , αp) minimizing kernel dual regularized squared
error
E′
(⃗w∗
) =
1
2
p
k=1
p
i=1
αi fi κ(˜xi ,˜xk )) − fk
2
+
p
i=1
p
j=1
αi αj fi fj κ(˜xi ,˜xj )
Non-linear model: h[⃗α](⃗x) =
p
i=1 αi fi κ(˜xi ,˜x)
Here κ is a kernel function. But now what is the trick?
The trick is that suitable kernel functions κ correspond to dot products in
transformed spaces!
322
Recall the Quadratic Decision Boundary
Left: The original set, Right: Transformed using the square of features.
Right: the green line is a separating hyperplane in the transformed space.
Left: the green ellipse maps exactly to the green line.
How to classify (in the original space): Transform a given feature vector
by squaring the features, then use a linear classifier.
323
Kernel Trick
For simplicity, assume bivariate data: ˜xk = (1, xk1, xk2).
The corresponding instance in the quadratic feature space is (1, x2
k1, x2
k2).
Consider two instances ˜xk = (1, xk1, xk2) and ˜xℓ = (1, xℓ1, xℓ2). Then
the scalar product of their corresponding instances (1, x2
k1, x2
k2) and
(1, x2
ℓ1, x2
ℓ2), resp., in the quadratic feature space is
1 + x2
k1x2
ℓ1 + x2
k2x2
ℓ2
which resembles (but is not equal to)
(˜xk ·˜xℓ)2
= (1 + xk1xℓ1 + xk2xℓ2)2
=
= 1 + x2
k1x2
ℓ1 + x2
k2x2
ℓ2 + 2xk1xℓ1xk2xℓ2 + 2xk1xℓ1 + 2xk2xℓ2
But now consider a mapping ϕ to R6
defined by
ϕ(˜xk ) = (1, x2
k1, x2
k2,
√
2xk1xk2,
√
2xk1,
√
2xk2)
Then ϕ(˜xk ) · ϕ(˜xℓ) = (˜xk · ˜xℓ)2
THE Idea: Using the kernel κ(˜xk ,˜xℓ) = (˜xk · ˜xℓ)2
in the kernel, dual
regularized squared error corresponds to using the regularized squared
error after the transformation ϕ.
324
Quadratic Decision Boundary
Given a set D of training examples:
D = {(⃗x1, f1) , (⃗x2, f2) , . . . , (⃗xp, fp)}
Assume that fi ∈ {1, −1} indicates the class of ⃗xi .
Yes, I know that squared error regression should not be used for classification!
Considering κ(˜xk,˜xℓ) = (˜xk · ˜xℓ)2 in our kernel dual regularized
squared error we obtain
Find ⃗α = α1, . . . , αp minimizing
E′
(⃗w) =
1
2
p
k=1
p
i=1
αi fi (˜xi · ˜xk )2
− fk
2
+
p
i=1
p
j=1
αi αj fi fj (˜xi · ˜xj )2
Non-linear classifier: h[⃗α](⃗x) = p
i=1 αi fi (˜xi · ˜x)2
Intuitively, minimizing E′ in R2 gives a separating hyperplane for
the input vectors transformed into R5. This means, that in R2 it
searches for a quadratic (i.e., non-linear) boundary.
325
Examples of Kernels
▶ Linear: κ(˜xℓ,˜xk) = ˜xℓ · ˜xk
The corresponding mapping ϕ(˜x) = ˜x is identity (no
transformation).
▶ Polynomial of power m: κ(˜xℓ,˜xk) = (˜xℓ · ˜xk)m
The corresponding mapping assigns to ˜x ∈ Rn+1
the vector ϕ(˜x) in
R(n+m
m )+1
.
▶ Gaussian (radial-basis function): κ(˜xℓ,˜xk) = e−
∥˜xℓ−˜xk∥2
2σ2
The corresponding mapping ϕ maps ˜x to an infinite-dimensional
vector ϕ(˜x) which is, in fact, a Gaussian function; a combination of
such functions for support vectors is then the separating
hypersurface.
▶ · · ·
Choosing kernels remains to be the black magic of kernel methods. They
are usually chosen based on trial and error (of course, experience and
additional insight into data help).
A similar trick can be done with (soft-margin) support vector machines.
326
Neural Networks
327
(Primitive) Mathematical Model of Neuron
σ
ξ
x1 x2 xn
y
328
Formal neuron
σ
ξ
x1 x2 xn
x0 = 1
y
w1 w2
· · ·
wn
w0
▶ x1, . . . , xn real inputs
▶ x0 special input, always 1
▶ w0, w1, . . . , wn real weights
▶ ξ = w0 + n
i=1 wi xi inner
potential;
In general, other potentials are considered
(e.g. Gaussian), more on this in PV021.
▶ y output defined by y = σ(ξ)
where σ is an activation function.
We consider several activation functions.
e.g., linear threshold function
σ(ξ) = sgn(ξ) =
1 ξ ≥ 0 ;
0 ξ < 0.
329
Formal Neuron vs Linear Models
▶ If σ is a linear threshold function
σ(ξ) =
1 ξ ≥ 0 ;
0 ξ < 0.
We obtained a linear classifier.
▶ If σ is identity, i.e., σ(ξ) = ξ, we obtain a linear (affine)
function.
▶ If σ(ξ) = 1/(1 + e−ξ) we obtain the logistic regression.
Also, other activation functions are used in neural networks!
330
Activation Functions
331
Multilayer Perceptron (MLP)
Input
Hidden
Output
· · ·
· · ·
▶ Neurons are organized in layers
(input layer, output layer, possibly
several hidden layers)
▶ Layers are numbered from 0;
The input is 0-th
▶ Neurons in the ℓ-th layer are connected
with all neurons in the ℓ + 1-th layer
Intuition: The network computes a function: Assign input values to the
input neurons and 0 to the rest. Proceed upwards through the layers, one
layer per step. In the ℓ-th step consider output values of neurons in
ℓ − 1-th layer as inputs to neurons of the ℓ-th layer. Compute output
values of neurons in the ℓ-th layer.
332
Example
1 1
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function: linear
threshold
σ(ξ) =
1 ξ ≥ 0 ;
0 ξ < 0.
333
Example
1 1
σ 11 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function: linear
threshold
σ(ξ) =
1 ξ ≥ 0 ;
0 ξ < 0.
333
Example
0 0
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function: linear
threshold
σ(ξ) =
1 ξ ≥ 0 ;
0 ξ < 0.
333
Example
0 0
σ 01 σ1 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function: linear
threshold
σ(ξ) =
1 ξ ≥ 0 ;
0 ξ < 0.
333
Example
1 0
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function: linear
threshold
σ(ξ) =
1 ξ ≥ 0 ;
0 ξ < 0.
333
Example
1 0
σ 11 σ1 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function: linear
threshold
σ(ξ) =
1 ξ ≥ 0 ;
0 ξ < 0.
333
Example
1 0
σ 11 σ1 1
σ
1
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function: linear
threshold
σ(ξ) =
1 ξ ≥ 0 ;
0 ξ < 0.
333
Example
0 1
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function: linear
threshold
σ(ξ) =
1 ξ ≥ 0 ;
0 ξ < 0.
333
Example
0 1
σ 11 σ1 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function: linear
threshold
σ(ξ) =
1 ξ ≥ 0 ;
0 ξ < 0.
333
Example
0 1
σ 11 σ1 1
σ
1
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function: linear
threshold
σ(ξ) =
1 ξ ≥ 0 ;
0 ξ < 0.
333
Classical Example – ALVINN
▶ One of the first autonomous car
driving systems (in the 90s)
▶ ALVINN drives a car
▶ The net has 30 × 32 = 960 input
neurons (the input space is R960
).
▶ The value of each input captures
the shade of gray of
the corresponding pixel.
▶ Output neurons indicate where to
turn (to the center of gravity).
Source: http://jmvidal.cse.sc.edu/talks/ann/alvin.html
334
A Bit of History
▶ Perceptron (Rosenblatt et al., 1957)
▶ Single layer (i.e., no hidden layers), the activation function
linear threshold
(i.e., a bit more general linear classifier)
▶ Perceptron learning algorithm
▶ Used to recognize digits
▶ Adaline (Widrow & Hof, 1960)
▶ Single layer, the activation function identity
(i.e., a bit more linear function)
▶ Online version of the gradient descent
▶ Used a new circuitry element called memristor which was able
to ”remember” history of current in form of resistance
In both cases, the expressive power is somewhat limited- it can only
express linear decision boundaries and linear (affine) functions.
335
A Bit of History
One of the famous (counter)-examples: XOR
0
(0, 0)
1
(0, 1)
1
(0, 1)
0
(1, 1)
x1
x2
No perceptron can distinguish between ones and zeros.
336
XOR vs Multilayer Perceptron
0
(0, 0)
1
(0, 1)
1
(0, 1)
0
(1, 1)
P1 P2
x1
x2
σ1 σ 1
σ1
−22 2 −2
1
−1
1
3
−2
(Here, σ is a linear threshold function.)
P1 : −1 + 2x1 + 2x2 = 0 P2 : 3 − 2x1 − 2x2 = 0
The output neuron performs an intersection of half-spaces.
337
Boolean functions
Activation function: unit step function σ(ξ) =
1 ξ ≥ 0 ;
0 ξ < 0.
σ
x1 x2 xn
x0 = 1
y = AND(x1, . . . , xn)
1 1
· · ·
1
−n
σ
x1 x2 xn
x0 = 1
y = OR(x1, . . . , xn)
1 1
· · ·
1
−1
σ
x1
x0 = 1
y = NOT(x1)
−1
0
338
Non-linear separation
x1 x2
y
▶ Consider a three-layer network; each neuron has
the unit step activation function.
▶ The network divides the input space in two
subspaces according to the output (0 or 1).
▶ The first (hidden) layer divides the input space
into half-spaces.
▶ The second layer may, e.g., make intersections
of the half-spaces ⇒ convex sets.
▶ The third layer may, e.g., make unions of some
convex sets.
339
Example
Consider a triangle T in R2 determined by three vertices
(−1, −1), (1, −1), (−1, 2).
Give an example of a multilayer perceptron (MLP) with two input
neurons and a single output neuron computing the function
F : R2 → {0, 1} defined as follows:
F(x1, x2) = 1 iff (x1, x2) lies either inside, or on the border of T
All activation functions in the network should be
σ(ξ) = sgn(ξ) =
1 ξ ≥ 0 ;
0 ξ < 0.
Homework: Consider F(x1, x2) = 1 iff (x1, x2) lies inside of T
(but not on the border)
340
341
341
341
341
341
341
Expressive Power of MLP
Cybenko’s theorem:
▶ Two-layer networks with a single output neuron and a single layer of
hidden neurons (with the logistic sigmoid as the activation function)
can
▶ approximate with arbitrarily small error any ”reasonable”
boundary
a given input is classified as 1 iff the output value of the network is
≥ 1/2.
▶ approximate with arbitrarily small error any ”reasonable”
function from [0, 1] to (0, 1).
Here, ”reasonable” means that it is pretty tough to find a function that is
not reasonable.
So, multilayer perceptrons are sufficiently powerful for any application.
But for a long time, at least throughout the 60s and 70s, nobody
well-known knew any efficient method for training multilayer networks!
An efficient way of using the gradient descent was published in 1986!
342
MLP – Notation
▶ X set of input neurons
▶ Y set of output neurons
▶ Z set of all neurons (tedy X, Y ⊆ Z)
▶ individual neurons are denoted by indices, e.g., i, j.
▶ ξj is the inner potential of the neuron j when the computation
is finished.
▶ yj is the output value of the neuron j when the computation
is finished.
(we formally assume y0 = 1)
▶ wji is the weight of the arc from the neuron i to the neuron j.
▶ j← is the set of all neurons from which there are edges to j
(i.e. j← is the layer directly below j)
▶ j→ is the set of all neurons with edges from j.
(i.e. j→
is the layer directly above j)
343
MLP – Notation
▶ Inner potential of a neuron j:
ξj =
i∈j←
wji yi
▶ A value of a non-input neuron j ∈ Z \ X when the computation is
finished is
yj = σj (ξj )
Here σj is an activation function of the neuron j.
(yj is determined by weights ⃗w and a given input ⃗x, so it’s sometimes
written as yj [⃗w](⃗x) )
▶ Fixing weights of all neurons, the network computes a function
F[⃗w] : R|X|
→ R|Y |
as follows: Assign values of a given vector
⃗x ∈ R|X|
to the input neurons, evaluate the network, then F[⃗w](⃗x)
is the vector of values of the output neurons.
Here, we implicitly assume a fixed ordering on input and output vectors.
344
MLP – Learning
▶ Given a set D of training examples:
D = ⃗xk, ⃗dk k = 1, . . . , p
Here ⃗xk ∈ R|X| and ⃗dk ∈ R|Y |. We write dkj to denote the
value in ⃗dk corresponding to the output neuron j.
▶ Error Function: E(⃗w) where ⃗w is a vector of all weights in
the network. The choice of E depends on the solved task
(classification vs regression etc.).
Example (Squared error):
E(⃗w) =
p
k=1
Ek(⃗w)
where
Ek(⃗w) =
1
2
j∈Y
(yj [⃗w](⃗xk) − dkj )2
345
MLP – Batch Gradient Descent
The algorithm computes a sequence of weights ⃗w(0), ⃗w(1), . . ..
▶ weights ⃗w(0) are initialized randomly close to 0
▶ in the step t + 1 (here t = 0, 1, 2 . . .) is ⃗w(t+1) computed as
follows:
w
(t+1)
ji = w
(t)
ji + ∆w
(t)
ji
where
∆w
(t)
ji = −ε(t) ·
∂E
∂wji
(⃗w(t)
)
is the weight change wji and 0 < ε(t) ≤ 1 is the learning rate
in the step t + 1.
Note that ∂E
∂wji
(⃗w(t)
) is a component of ∇E, i.e., the weight change in the step
t + 1 can be written as follows: ⃗w(t+1)
= ⃗w(t)
− ε(t) · ∇E(⃗w(t)
).
346
Illustration of Gradient Descent – XOR
Source: Pattern Classification (2nd Edition); Richard O. Duda, Peter E. Hart, David G. Stork
347
Stochastic Gradient Descent (SGD)
Assume that E(⃗w) =
p
k=1 Ek (⃗w) where Ek (⃗w) is an error w.r.t.
the single training example (⃗xk , ⃗dk ).
▶ weights in ⃗w(0)
are randomly initialized to values close to 0
▶ in the step t + 1 (here t = 0, 1, 2 . . .), weights ⃗w(t+1)
are computed
as follows:
▶ Choose (randomly) a set of training examples T ⊆ {1, . . . , p}
▶ Compute
⃗w(t+1)
= ⃗w(t)
+ ∆⃗w(t)
where
∆⃗w(t)
= −ε(t) ·
k∈T
∇Ek (⃗w(t)
)
▶ 0 < ε(t) ≤ 1 is a learning rate in step t + 1
▶ ∇Ek (⃗w(t)
) is the gradient of the error of the example k
Note that the random choice of the minibatch is typically implemented by
randomly shuffling all data and then choosing minibatches sequentially.
348
Comments on Training Algorithm
▶ Not guaranteed to converge to zero training error, may converge to
a local minimum or oscillate indefinitely.
▶ In practice, does converge to low error for many large networks on
(big) real data.
▶ Many epochs (thousands) may be required, hours or days of training
for large networks.
There are many issues concerning learning efficiency (data normalization,
selection of activation functions, weight initialization, learning rate,
efficiency of the gradient descent itself, etc.) – see PV021.
349
Overfitting
▶ Due to their expressive power, neural networks are quite
sensitive to overfitting.
▶ Keep a hold-out validation set and test the error of the
network on this set after every epoch. Stop training when
additional epochs increase the validation error.
The validation error can be measured by completely different means than
the training error E.
350
Hidden Neurons Representations
Trained hidden neurons can be seen as newly constructed features.
E.g., in a two-layer network used for classification, the hidden layer transforms
the input so that important features become explicit (and hence the result may
become linearly separable).
Consider a two-layer MLP, 64-2-3, for classifying letters (three output
neurons, each corresponding to one of the letters).
351
Optimal Architecture?
▶ For MLP: Too few hidden neurons prevent the network from
adequately fitting the data. Too many hidden units can result
in overfitting.
(There are advanced methods that prevent overfitting even for rich
models, such as regularization, where the error function penalizes
overfitting – see PV021.)
▶ There are (almost) infinitely many types of architectures of
neural networks (convolutional, recurrent, transformers,
adversarial, etc.) suitable for various tasks.
▶ Transfer learning: Start with a known solution to a related
problem.
Simplified view: Preserve lower parts of the network trained to
solve the related problem (feature extractors). Add your top
part and then train only the new top part (or train the whole
network but carefully).
352
How to Choose Activation Functions & Error
▶ Hidden neurons: ”Almost” linear activations such as (leaky)
ReLU (y = max(0, ξ))
Better than sigmoidal that saturate more often.
▶ Output neurons: Single output:
▶ Regression: Typically ”linear” output, i.e., no activation on the
output neuron.
▶ Binary classification: Logistic sigmoid y = 1/(1 + e−ξ
)
▶ Error: Single output:
▶ Regression: (Mean) squared error
▶ Binary classification: Binary cross-entropy
For multiple outputs and classification, use softmax output and cross-entropy.
353
Applications
▶ Image recognition, segmentation, etc.
▶ Machine translation and other text processing
▶ Text generation, image generation, movie generation, theatre
plays generation
▶ Text to Speech and vice versa
▶ Finance, business predictions, fraud detection
▶ Game playing (backgammon is a classic example, AlphaGo is
the famous one, computer games are the big ones, bridge is
the hard one)
▶ (artificial brain and intelligence)
▶ ...
Text and image processing are possibly the most advanced deep
learning applications.
354
ALVINN
355
ALVINN
▶ Two layer MLP, 960 − 4 − 30 (sometimes 960 − 5 − 30)
▶ Inputs correspond to pixels.
▶ Sigmoidal activation function (logistic sigmoid).
▶ Direction corresponds to the center of gravity.
, I.e., output neurons are considered as points of mass evenly distributed
along a line. The weight of each neuron corresponds to its value.
356
ALVINN – Training
Trained while driving.
▶ A camera captured the road from the front window, approx.
25 pictures per second
▶ Training examples (⃗xk, ⃗dk) where
▶ ⃗xk = image of the road
▶ ⃗dk ≈ corresponding direction of the steering wheel set by the
driver
▶ the values ⃗dk computed using Gaussian distribution:
dki = e−D2
i /10
Where Di is the distance between the i-th output from the
one corresponding to the steering wheel’s real direction.
(The authors claimed this approach is better than the binary
output because similar road directions induce similar driver
reactions.)
357
Selection of Training Examples
Naive approach: just take images from the camera.
Problems:
▶ A too-good driver never teaches the network how to solve
deviations from the right track. A couple of harsh solutions:
▶ turn the learning off momentarily, deviate from the right track,
then turn on the learning and let the network learn how to
solve the situation.
▶ let the driver go crazy! (a bit dangerous, expensive, unreliable)
▶ Images are very similar (the network sees the road from the
right lane), overfitting.
358
Selection of Training Examples
Problems with too good driver were solved as follows:
▶ every image of the road has been has been transformed to 15
slightly different copies
The repetitiveness of images was solved as follows:
▶ the system has a buffer of 200 images (including the 15 copies of
the current one), in every round trains on these images
▶ afterward, a new image is captured, 15 copies made, and these new
15 substitute 15 selected from the buffer (10 with the smallest
training error, five randomly)
359
ALVINN – Training
▶ gradient descent
▶ constant learning rate (possibly different for each neuron – see
PV021)
▶ some other optimizations (see PV021)
The result:
▶ Training took 5 minutes, and the speed was 4 miles per hour
(The speed was limited by the hydraulic controller of the steering wheel,
not the learning algorithm.)
▶ ALVINN was able to go through roads it had never ”seen”
and in different weather
360
ALVINN – Weight Learning
round 0
round 10
round 20
round 50
h1 h2 h3 h4 h5
Here h1, . . . , h5 are values of hidden neurons.
361
Backpropagation
362
How to Compute the Gradient?
To implement a single step of the gradient descent, we need to
compute the partial derivatives ∂E
∂wij
of E w.r.t. all weights wij .
To simplify consider for now a restricted case:
▶ Single element training set
D = {(x, d)}
where x ∈ R and d ∈ R are numbers
(i.e., not vectors as in the general case).
▶ The error function is the squared error.
▶ Assume that 1 is the output neuron,
▶ which means that y1 = y1[⃗w](x) is the output of the network
with weights ⃗w and the input x,
▶ and the error on D is then
E(⃗w) =
1
2
(y1 − d)2
363
364
∂E
∂w12
364
∂E
∂w12
=
∂E
∂y1
∂y1
∂w12
364
∂E
∂w12
=
∂E
∂y1
∂y1
∂w12
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂w12
364
∂E
∂w12
=
∂E
∂y1
∂y1
∂w12
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂w12
=
∂E
∂y1
σ′
1(ξ1)y2
Here σ′
1 is just the plain derivative of σ′
as a
function of a single variable
364
∂E
∂w12
=
∂E
∂y1
∂y1
∂w12
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂w12
=
∂E
∂y1
σ′
1(ξ1)y2
Here σ′
1 is just the plain derivative of σ′
as a
function of a single variable
∂E
∂y1
= y1 − d
Note that if σ1 is identity, we obtain exactly the
gradient from the linear regression method.
Considering σ1 equal to the logistic sigmoid and E
the cross-entropy, we get the logistic regression
gradient.
364
∂E
∂w12
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂w12
=
∂E
∂y1
σ′
1(ξ1)y2
∂E
∂w23
=
∂E
∂y2
∂y2
∂ξ2
∂ξ2
∂w23
=
∂E
∂y2
σ′
2(ξ2)y3
∂E
∂y1
= y1 − d
∂E
∂y2
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂y2
=
∂E
∂y1
σ′
1(ξ1)w12
365
∂E
∂w12
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂w12
=
∂E
∂y1
σ′
1(ξ1)y2
∂E
∂w23
=
∂E
∂y2
∂y2
∂ξ2
∂ξ2
∂w23
=
∂E
∂y2
σ′
2(ξ2)y3
∂E
∂w34
=
∂E
∂y3
∂y3
∂ξ3
∂ξ3
∂w34
=
∂E
∂y3
σ′
3(ξ3)y4
∂E
∂y1
= y1 − d
∂E
∂y2
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂y2
=
∂E
∂y1
σ′
1(ξ1)w12
∂E
∂y3
=
∂E
∂y2
∂y2
∂ξ2
∂ξ2
∂y3
=
∂E
∂y2
σ′
2(ξ2)w23
366
∂E
∂w12
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂w12
=
∂E
∂y1
σ′
1(ξ1)y2
∂E
∂w13
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂w13
=
∂E
∂y1
σ′
2(ξ2)y3
∂E
∂w24
=
∂E
∂y2
∂y2
∂ξ2
∂ξ2
∂w24
=
∂E
∂y2
σ′
2(ξ2)y4
∂E
∂w34
=
∂E
∂y3
∂y3
∂ξ3
∂ξ3
∂w34
=
∂E
∂y3
σ′
3(ξ3)y4
∂E
∂w45
=
∂E
∂y4
∂y4
∂ξ4
∂ξ3
∂w45
=
∂E
∂y4
σ′
4(ξ4)y5
367
∂E
∂y1
= y1 − d
∂E
∂y2
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂y2
=
∂E
∂y1
σ′
1(ξ1)w12
∂E
∂y3
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂y3
=
∂E
∂y1
σ′
1(ξ1)w13
368
∂E
∂y1
= y1 − d
∂E
∂y2
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂y2
=
∂E
∂y1
σ′
1(ξ1)w12
∂E
∂y3
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂y3
=
∂E
∂y1
σ′
1(ξ1)w13
368
∂E
∂y1
= y1 − d
∂E
∂y2
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂y2
=
∂E
∂y1
σ′
1(ξ1)w12
∂E
∂y3
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂y3
=
∂E
∂y1
σ′
1(ξ1)w13
368
∂E
∂y1
= y1 − d
∂E
∂y2
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂y2
=
∂E
∂y1
σ′
1(ξ1)w12
∂E
∂y3
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂y3
=
∂E
∂y1
σ′
1(ξ1)w13
368
∂E
∂y1
= y1 − d
∂E
∂y2
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂y2
=
∂E
∂y1
σ′
1(ξ1)w12
∂E
∂y3
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂y3
=
∂E
∂y1
σ′
1(ξ1)w13
∂E
∂y4
368
∂E
∂y1
= y1 − d
∂E
∂y2
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂y2
=
∂E
∂y1
σ′
1(ξ1)w12
∂E
∂y3
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂y3
=
∂E
∂y1
σ′
1(ξ1)w13
∂E
∂y4
=
∂E
∂y2
∂y2
∂y4
+
∂E
∂y3
∂y3
∂y4
368
∂E
∂y1
= y1 − d
∂E
∂y2
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂y2
=
∂E
∂y1
σ′
1(ξ1)w12
∂E
∂y3
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂y3
=
∂E
∂y1
σ′
1(ξ1)w13
∂E
∂y4
=
∂E
∂y2
∂y2
∂y4
+
∂E
∂y3
∂y3
∂y4
=
∂E
∂y2
∂y2
∂ξ2
∂ξ2
∂y4
+
∂E
∂y3
∂y3
∂ξ3
∂ξ3
∂y4
368
∂E
∂y1
= y1 − d
∂E
∂y2
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂y2
=
∂E
∂y1
σ′
1(ξ1)w12
∂E
∂y3
=
∂E
∂y1
∂y1
∂ξ1
∂ξ1
∂y3
=
∂E
∂y1
σ′
1(ξ1)w13
∂E
∂y4
=
∂E
∂y2
∂y2
∂y4
+
∂E
∂y3
∂y3
∂y4
=
∂E
∂y2
∂y2
∂ξ2
∂ξ2
∂y4
+
∂E
∂y3
∂y3
∂ξ3
∂ξ3
∂y4
=
∂E
∂y2
σ′
2(ξ2)w24 +
∂E
∂y3
σ′
3(ξ3)w34
368
MLP – Gradient Computation
Under our simplifying assumptions D = {(x, d)} and E = (y1 − d)2
the gradient computaton proceeds as follows:
Applying the chain rule, we obtain
∂Ek
∂wji
=
∂Ek
∂yj
· σ′
j (ξj ) · yi
where (after more applications of the chain rule)
∂Ek
∂y1
= y1 − d
Keep in mind that 1 is the only output neuron which means that y1 is the value
of the network.
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σ′
r (ξr ) · wrj for j ∈ Z ∖ (Y ∪ X)
Here yr = y[⃗w](x) where ⃗w are the current weights and x is the example input.
369
MLP – Gradient Computation – General!
Let us drop our simplifying assumptions!
▶ Given a set D of training examples:
D = ⃗xk, ⃗dk k = 1, . . . , p
Here ⃗xk ∈ R|X| and ⃗dk ∈ R|Y |. We write dkj to denote the
value in ⃗dk corresponding to the output neuron j.
▶ Error Function: E(⃗w) where ⃗w is a vector of all weights in
the network. The choice of E depends on the solved task
(classification vs regression, etc.).
Example (Squared error):
E(⃗w) =
p
k=1
Ek(⃗w)
where
Ek(⃗w) =
1
2
j∈Y
(yj [⃗w](⃗xk) − dkj )2
370
MLP – Gradient Computation
For every weight wji we have (obviously)
∂E
∂wji
=
p
k=1
∂Ek
∂wji
So now it suffices to compute ∂Ek
∂wji
, that is the error for a fixed training
example (⃗xk , ⃗dk ).
Applying the chain rule, we obtain
∂Ek
∂wji
=
∂Ek
∂yj
· σ′
j (ξj ) · yi
where (more applications of the chain rule)
∂Ek
∂yj
is computed directly for the output neurons j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σ′
r (ξr ) · wrj for j ∈ Z ∖ (Y ∪ X)
(Here yr = y[⃗w](⃗xk ) where ⃗w are the current weights and ⃗xk is the input of the
k-th training example.)
371
MLP – Backpropagation
Input: A training set D = ⃗xk , ⃗dk k = 1, . . . , p and
the current vector of weights ⃗w.
Note that the backprop. is repeated in every iteration of the gradient descent!
▶ Evaluate all values yi of neurons using the standard bottom-up
procedure with the input ⃗xk .
▶ For every training example (⃗xk , ⃗dk ) compute ∂Ek
∂yj
using
backpropagation through layers top-down :
▶ For all j ∈ Y compute ∂Ek
∂yj
by taking the derivative of the error.
e.g., in the case of the squared error, we have ∂Ek
∂yj
= yj − dkj .
▶ In the layer ℓ, assuming that ∂Ek
∂yr
has been computed for all
neurons r in the layer ℓ + 1, compute
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σ′
r (ξr ) · wrj
for all j from the ℓ-th layer. Here σ′
r is the derivative of σr .
▶ Put ∂Ek
∂wji
= ∂Ek
∂yj
· σ′
j (ξj ) · yi
Output: ∂E
∂wji
=
p
k=1
∂Ek
∂wji
.
372
MLP Learning Example
Training set:
D = {(x, d)} = {(1, 1)}
That is
y3 = x = 1
d = 1
Error cross-entropy:
E(⃗w) = −(d log(y1) + (1 − d) log(1 − y1))
= − log(y1)
Assume the initial weight vector
⃗w(0) = (w
(0)
12 , w
(0)
23 ) = (1
4, 2).
Consider the learning rate ε = 0.5.
373
MLP Learning Example – Gradient Descent
To make the gradient descent step:
w
(1)
12 = w
(0)
12 − ε
∂E
∂w12
(⃗w(0)
)
w
(1)
23 = w
(0)
23 − ε
∂E
∂w23
(⃗w(0)
)
we need to compute the partial derivatives
∂E
∂w12
and ∂E
∂w23
.
374
MLP Learning Example – Forward Pass
We have x = 1, w
(0)
12 = 1/4, w
(0)
23 = 2
First, compute the forward pass
y3 = x = 1
ξ2 = w
(0)
23 y3 = 2y3 = 2
y2 = max{0, ξ2} = 2
ξ1 = w
(0)
12 y2 =
1
4
2 =
1
2
y1 = 1/(1 + e−ξ1
) = 1/(1 + e−(1/2)
)
= 0.6225
375
MLP Learning Example – Backward Pass
We have w
(0)
12 = 1/4, w
(0)
23 = 2, y1 =
0.6225, y2 = 2, y3 = 1.
Proceed with the backward pass:
∂E
∂y1
=
∂(− log(y1))
∂y1
= −
1
y1
= −1.6065
Since σ′
1 = σ1(1 − σ1)
∂E
∂y2
=
∂E
∂y1
σ′
1(ξ1)w
(0)
12
=
∂E
∂y1
σ1(ξ1)(1 − σ1(ξ1))w
(0)
12
=
∂E
∂y1
y1(1 − y1)w
(0)
12
= −1.6065 · 0.6225 · 0.3775 · (1/4)
= −0.09438
376
MLP Learning Example – The Gradient
We have
w
(0)
12 = 1/4, w
(0)
23 = 2, y1 = 0.6225, y2 =
2, y3 = 1, ∂E
∂y1
= −1.6065, ∂E
∂y2
= −0.09438.
Compute derivatives of E w.r.t. weights:
∂E
∂w12
=
∂E
∂y1
σ′
1(ξ1)y2
=
∂E
∂y1
y1(1 − y1)y2
= −0.755
∂E
∂w23
=
∂E
∂y2
σ′
2(ξ2)y3
=
∂E
∂y2
y3
= −0.09438 377
Backpropagation – Example – Summary
378
Backpropagation – Example – Summary
Note that WE HAVE NOT YET CHANGED ANY WEIGHTS! 378
MLP Learning Example – Gradient Descent Step
So ONLY NOW we can make the
learning step and change the weights:
w
(1)
12 = w
(0)
12 − ε
∂E
∂w12
(⃗w(0)
)
=
1
4
− 0.5 · (−0.755)
= 0.627
w
(1)
23 = w
(0)
23 − ε
∂E
∂w23
(⃗w(0)
)
= 2 − 0.5 · (−0.09438)
= 2.047
We have made just a single gradient descent step!
379
MLP Training Summary
▶ MLP are trained using the gradient descent algorithm
In practice, modifications of GD are typically used, but most of them
have strong roots in GD.
▶ The gradients are computed using the backpropagation
algorithm
the backpropagation is a universal method for automatic differentiation,
surpassing any use in neural networks.
▶ Training of neural networks in practice is tricky due to many
reasons comprising in particular
▶ highly complex non-linear shape of the error function,
▶ tendency to overfit very quickly,
▶ black-box nature, hard to see what the network does,
▶ huge hype around deep learning (which many people confuse
with AI) results in high expectations even in cases where no
(learning) algorithm may solve the given problem!
An advice: Always concentrate the main effort on the solved
problem formulation and, afterward, on the data you have at your
disposal (and honestly separate the Test set right at the
beginning). 380
Deep Learning
▶ Cybenko’s theorem shows that two-layer networks are omnipotent –
such results nearly killed NN when support vector machines were
found to be easier to train in 00’s.
▶ Later, it has been shown (experimentally) that deep networks (with
many layers) have better representational properties.
▶ ... but how to train them? The gradient descent suffers from a
so-called vanishing gradient; intuitively, updates of weights in lower
layers are very slow.
▶ In 2006, a solution was found by Hinton et al.:
▶ Use unsupervised methods to initialize the weights layer by
layer to capture important data features.
More precisely: The lowest hidden layer learns patterns in data, the
second lowest learns patterns in data transformed through the first
layer, and so on.
▶ Then use a supervised learning algorithm to only fine tune
the weights to the desired input-output behavior.
▶ ... but the actual revolution started with convolutional networks
trained on several GPUs.
381
Convolutional network
382
ImageNet Large-Scale Visual Recognition Challenge
(ILSVRC)
ImageNet database (16,000,000 color images, 20,000 categories)
383
ImageNet Large-Scale Visual Recognition Challenge
(ILSVRC)
Competition in classification over a subset of images from
ImageNet.
In 2012, training saw 1,200,000 images and 1000 categories.
Validation set 50,000, Test set 150,000.
Many images contain several objects → typical rule is top-5
highest probability assigned by the net.
384
KSH net
ImageNet classification with deep convolutional neural networks, by Alex
Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton (2012).
Trained on two GPUs (NVIDIA GeForce GTX 580)
Results:
▶ Accuracy 84.7% in top-5 (second best alg. at the time:
73.8%)
▶ 63.3% in ”perfect” classification (top-1)
385
ILSVRC 2014
The same set of images as in 2012, top-5 criterium.
GoogLeNet: deep convolutional net, 22 layers
Results:
▶ 93.33% in top-5
Superhuman power?
386
Superhuman GoogLeNet?!
Andrej Karpathy: ...the task of labeling images with 5 out of 1000
categories quickly turned out to be highly challenging, even for some friends in
the lab who have been working on ILSVRC and its classes for a while. First, we
thought we would put it up on [Amazon Mechanical Turk]. Then, we thought
we could recruit paid undergrads. Then, I organized a labeling party of intense
labeling effort only among the (expert labelers) in our lab. Then, I developed a
modified interface that used GoogLeNet predictions to prune the number of
categories from 1000 to only about 100. It was still too hard - people kept
missing categories and getting up to ranges of 13-15% error rates. In the end, I
realized that to get anywhere competitively close to GoogLeNet, it would be
most efficient if I sat down and went through the painfully long training process
and the subsequent careful annotation process myself. The labeling happened
at a rate of about 1 per minute, but this decreased over time... Some images
are easily recognized, while some pictures (such as those of fine-grained breeds
of dogs, birds, or monkeys) can require multiple minutes of concentrated effort.
I became very good at identifying breeds of dogs... Based on the sample of
images I worked on, the GoogLeNet classification error turned out to be 6.8%...
In the end, my error turned out to be 5.1%
387
ILSVRC 2015
▶ Microsoft network ResNet: 152
layers, complex architecture
▶ Trained on 8 GPUs
▶ 96.43% accuracy in top-5
388
ILSVRC
389
ILSVRC 2016
Trumps-Soushen (The Third Research Institute of Ministry of
Public Security)
There is no new innovative technology or novelty by
Trimps-Soushen.
Ensemble of the pre-trained models from previous years.
Each model is strong at classifying some categories but weak at
categorizing others.
Test error: 2.99%
390
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classification-dfbc423111dd
391
Top-20 typical errors
Out of 1458 misclassified images in Top-20:
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classification-dfbc423111dd
392
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classification-dfbc423111dd
393
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classification-dfbc423111dd
394
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classification-dfbc423111dd
395
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classification-dfbc423111dd
396
Anomaly Detection
397
What is an Anomaly?
Anomalies are hard to define in general.
Attempts at a generic definition include:
▶ Observations that seem to be generated by a different
mechanism.
▶ Data instances that occur rarely and whose features differ
significantly from most data.
▶ Outliers - observations that appear to be inconsistent (far
away) with the remainder of the data
However, inliers are not covered, that is, anomalous data that lie within
the normal data distribution (say 0 in measurement results that are very
likely non-zero but distributed around 0).
▶ Patterns in data that do not conform to a well-defined notion
of normal behavior(??)
This lecture presents algorithms detecting specific data instances
that may be anomalous w.r.t. the used algorithm.
It is up to the context-knowing user to decide whether such data
are anomalous. 398
What is the Anomaly Detection Good For?
There are two main sources of motivation for anomaly detection in
machine learning:
▶ Modeling perspective: Many models are sensitive to
anomalous data.
▶ Application perspective: Many tasks are based on searching
for anomalies in data.
399
Modeling Perspective
Recall the behavior of the linear model.
To train such a model properly, the outliers should be removed.
On the other hand, we will see that the sensitivity of some models
can be used to detect the anomalies.
400
Applications Perspective
▶ Fraud Detection
▶ Analysis of purchasing behavior to detect credit card theft.
▶ Identification of theft through uncharacteristic buying patterns
and behavioral changes.
▶ Intrusion Detection
▶ Monitoring for attacks on computer systems and networks.
▶ Many attacks exhibit themselves as anomalous behavior.
▶ Ecosystem Disturbances
▶ Detect atypical natural world events.
▶ Prediction and understanding of hurricanes, floods, droughts,
heat waves, and fires.
▶ Medicine
▶ Unusual patient symptoms or test results may indicate health
problems.
▶ Balancing the need for further tests with the potential costs
and risks.
▶ ...
401
Causes of Anomalies in Data Preprocessing
Anomaly detection is a standard step in data preprocessing.
We typically search for anomalies from the following sources:
▶ Objects from different classes: Pear among apples.
▶ Natural variation: Abnormally tall person - not from a
different class (humans) but in the sense of an extreme
characteristic.
▶ Data measurement and collection error:
▶ Thermometer positioned on the direct sun (possibly a
measurement error)
▶ 40 kg infant is not usual among humans (possibly a data
collection error)
The anomalies mentioned above should be inspected by domain
experts and possibly corrected/removed.
402
Approaches to Anomaly Detection
▶ Model-based techniques:
▶ Build a model of data.
▶ Anomalies should not fit the model very well.
For example, using a clustering model, an anomaly may lie far away
from larger clusters.
▶ Alternatively, removing anomalies should have the strongest
impact on the model parameters (more on this later).
▶ Proximity-based techniques: Given distance between objects,
anomalies are objects distant from others.
▶ Density-based techniques: Estimate the density distribution of
the objects. Anomalies would probably be outside of dense
regions.
I would count the most recent deep learning-based anomaly detection among
the model-based techniques even though a combination of the above
approaches is usually used.
We will make the above ideas more precise in the rest of this
lecture.
403
Anomaly Detecion Learning
There are three approaches to anomaly detection based on
available information about data:
▶ Supervised: Given a training set distinguishing normal and
anomalous data. We may train a supervised learning classifier.
As anomalies are rare, approaches based on supervised learning are rare.
▶ Semi-supervised:
▶ We may know what instances are normal.
A tumor detection system trained on healthy people detects tumors
as anomalies.
In this case, we detect anomalies that are dissimilar to normal
instances. We may train a model representing the normal
instances and detect instances that do not fit the model.
▶ We may have information about (some) normal and some
anomalous instances. Here, we can use methods for
semi-supervised classification.
▶ Unsupervised: Create a model of all instances and hope that
anomalous instances will still be dissimilar to instances typical
for the model.
404
Issues
Task-specific issues comprise:
▶ Single vs. multi-attribute anomaly: The question is whether
an object is anomalous due to a single attribute (200 kg
person) or a combination of attributes (100 kg person of 1
meter height).
This is especially important when data is high-dimensional. For example,
in genetic data, every sample is an anomaly(?)
▶ Global vs. local perspective: An object may be anomalous
w.r.t. all objects but not w.r.t. objects in its local
neighborhood (210 cm high basketball player).
▶ Degree of anomaly: Usually, anomalies are reported in a
binary fashion. However, we often want to measure how
anomalous an object is, similar to normal objects.
Another issue is evaluation: How can we determine how good an
anomaly detector is?
In the supervised case, we have a ground truth, but usually, we just observe
detected anomalies.
405
Various Approaches to Anomaly Detection
We shall have a look at five approaches to the unsupervised
anomaly detection:
▶ Statistical
▶ Proximity-based
▶ Density-based
▶ Cluster-based
▶ Autoencoders
The above approaches are also used in the semi-supervised setting where we
have a dataset of normal instances. However, some issues discussed further will
not appear in this case.
406
Statistical
Definition 1
An outlier is an object with a low probability in the probability
distribution of the data.
Suppose we knew the “true” probability distribution P on objects.
In that case, we may set up a threshold t (using an appropriate
training set) and say that every object A with P(A) < t is
anomalous.
However, in most cases, we do not know P and have to resort to a
model of P created using a dataset of feature vectors.
There are many more or less sophisticated tests based on models
of P in the literature.
Some use rather advanced statistical methods.
Let us have a look at a few simple examples.
407
Box Plot
A simple method for outlier detection in the univariate case, that
is, for values of a single attribute.
408
Univariate Gaussian Model
Consider a numeric attribute whose values x are normally
distributed with mean µ and standard deviation σ.
We write N(µ, σ2
).
Given an attribute value x, we compute the z-score:
z = (x − µ)/σ
Then z has the distribution N(0, 1).
Now, choose c so the probability P(|z| ≥ c) is small enough for z
to be an outlier w.r.t. N(0, 1).
Given an attribute value x, we may decide whether x is an outlier
by deciding whether |z| = |(x − µ)/σ| ≥ c.
409
Univariate Gaussian
Problem 1: The attribute does not have a normal distribution.
Before starting, use a normality test (observe the histogram, use a
specialized test such as Shapiro-Wilk).
Some transformations may sometimes succeed in normalizing an
attribute (Box-Cox transformation, etc.)
Different distributions can be used to model the attribute (Log Normal,
Weibull, etc.), but be aware of the assumptions of anomaly detection tests!
410
Univariate Gaussian
Problem 2: We typically do not know the population mean µ and
the population variance σ2.
They can be estimated from a dataset by the sample mean and the
sample variance:
¯µ =
1
p
p
i=1
xi s2
=
1
p − 1
p
i=1
(xi − ¯µ)2
However, then z = (x − ¯µ)/s does no longer have the distribution
N(0, 1).
The distribution of z is normal if x is sampled independently of the data
yielding ¯µ and s2
.
Note that ¯µ and s2 are unbiased estimates of µ and σ2. Also, ¯µ
converges to µ and s2 converges to σ2 with growing sample size.
411
Univariate Gaussian Model
Problem 3: The outliers distort the estimates ¯µ and s2 of the
mean and the standard deviation.
For example, a millionaire would not look like an outlier in a group
containing a billionaire.
There is a circularity here: To get outliers using the normal distribution model,
we need to remove the outliers to have a good model.
One possible heuristic is to remove the outliers one by one, starting
from the most extreme, hoping the most extreme outlier will be
detected in every step.
One such heuristic is the Grubb’s test.
412
Grubb’s Test
Consider a dataset D = {x1, . . . , xp} of values of a normally
distributed attribute and choose α > 0.
The Grubb’s statistic is defined by
G =
maxi=1,...,p |xi − ¯x|
s
Here ¯x is the sample mean and s sample standard deviation of D.
Now P(G ≥ cα,p) = α if
cα,p =
p − 1
√
p
t2
p − 2 + t2
Here t is such a value that makes P(T ≥ t) = α/(2p) for T with
the t-distribution with p − 2 degrees of freedom.
For finding t we used to use tables.
Grubb’s test simply iteratively tests whether G ≥ cα,p and, if yes,
removes an xi maximizing |xi − ¯x| from D.
413
Multivariate Gaussian Model
Univariate tests cannot find all anomalies if the anomaly depends
on combinations of particular attribute values.
The univariate Gaussian approach can be generalized to the
multivariate Gaussian by considering the multivariate Gaussian
distribution.
Mahalanobis distance:
mahalanobis(⃗x, ⃗z) = (⃗x − ⃗z)Σ−1
(⃗x − ⃗z)⊤
Here Σ is the covariance matrix of the data.
Intuitively, the Mahalanobis distance generalizes the squared
Euclidean distance:
(⃗x − ⃗z)(⃗x − ⃗z)⊤
By taking into account the “spread” of the data.
414
Mahalanobis Distance
Here, we assume multivariate normal data distribution. The
covariance matrix is
Σ =
1.00 0.75
0.75 3.00
415
Mahalanobis Distance
Notice that A has a larger Mahalanobis distance from the cluster’s center
than B even though it is closer in the Euclidean distance.
The reason is that the density function falls more rapidly in the direction
of A than in the direction of B.
416
Likelihood-Based Detection
Assume that the dataset D contains objects from a mixture of two
probability distributions:
▶ M, the distribution of the majority of normal objects,
▶ A, the distribution of anomalous objects.
Let us choose α > 0, indicating how rare the anomalies should be.
Let us assume that we have M and A models to compute PM(x)
and PA(x) the likelihoods of generating x.
A is often considered uniform, and M is learned from the data (yes, distorted
by the outliers).
Now, we compute the total likelihood of the dataset before and
after removing each element. If the likelihood changes significantly,
we have probably eliminated an anomaly.
417
Given a partition U, V of D we define
L(U, V ) =

(1 − α)|U|
xi ∈U
PM(xi )



α|V |
xi ∈V
PA(xi )


This is the likelihood of generating D using the following random
process: Generate x1, x2, . . . , xp sequentially and independently
where each xi is generated as follows:
▶ Randomly choose between normal and anomaly, where the
probability of choosing anomaly is α.
▶ If normal has been chosen, choose xi from the distribution M.
▶ If anomaly has been chosen, choose xi from the distribution A.
To eliminate the “large” product, we consider the log-likelihood:
LL(U, V ) = log(L(U, V ))
= |U| log(1 − α) +
xi ∈U
log PM(xi )
+ |V | log α +
xi ∈V
log PA(xi )
418
Likelihood-Based Anomaly Detection
Start with sets M0 = D and A0 = ∅.
The algorithm computes M1, M2, . . . and A1, A2, . . . by moving
elements from Mk to Ak as follows:
▶ For every i = 1, . . . , p such that xi ∈ Mk compute
∆i = |LL(Mk, Ak) − LL(Mk ∖ {xi }, Ak ∪ {xi })|
▶ Consider i maximizing ∆i . If ∆i ≥ c, then
Mk+1 = Mk ∖ {xi } Ak+1 = Ak ∪ {xi }
Else, stop.
Here, c is a threshold we must set up.
Ultimately, the Ak will contain the anomalies the algorithm detects.
Note that we may also move all anomalies detected in a single iteration from
Mk to Ak . This would result in a different method (also valid).
419
Summary of Statistical Anomaly Detection
▶ Build on strong statistical foundations.
▶ Tests are very effective if the dataset is sufficiently large
(informative).
▶ Lots of tests for univariate data, the area has developed for a
long time.
▶ Fewer options for multivariate data and problematic for highly
dimensional data (curse of dimensionality).
The likelihood-based approach does not assume a particular
distribution shape: It can be used with arbitrary models of PM and
PA, including deep learning ones.
420
Proximity-Based Methods
421
Proximity-Based Outlier Detection
Assume a distance measure d. That is, given two feature vectors
⃗x, ⃗z their distance is d(⃗x, ⃗z).
We consider the Euclidean distance for simplicity.
Definition 2
The outlier score of an object is given by the distance to its
k-nearest neighbor.
A threshold on the minimum distance of an outlier can be set on a
training set.
422
Outlier Score
The outlier score is based on the distance to the fifth nearest
neighbor.
423
Outlier Score
The outlier score is based on the distance to the fifth nearest
neighbor.
423
Outlier Score
The outlier score is based on the distance to the fifth nearest
neighbor.
423
Outlier Score
The outlier score is based on the distance to the first nearest
neighbor.
423
Proximity-Based Approaches
▶ Conceptually simple and easy to implement.
▶ Time complexity typically O(p2) where p is the number of
feature vectors in the dataset.
▶ Sensitivity to the choice of parameters (the number of
neighbors).
▶ Density/compactness not taken explicitly into account.
424
Density-Based Methods
425
Density-Based Approaches
Definition 3
The outlier score of an object is the inverse of the density around
the object.
We consider the Local Outlier Factor (LOF) method:
A local density of an object corresponds to the inverse of the
average distance to its k-nearest neighbors.
Compute the LOF score for an object A by
▶ computing the local density DA of A,
▶ computing the average ¯D of the local densities of k-nearest
neighbors of A,
▶ dividing the average local density with the local density of A,
that is, compute ¯D/DA.
The question is, how exactly can the local density be defined?
426
LOF - Illustration
Here, the density around A is smaller than the densities around the
other points.
427
LOF - Formally
Let us fix k ∈ N.
Define k-distance(A) as the distance of the k-th nearest neighbor.
Denote by Nk(A) the set of all objects in the distance from A up
to k-distance(A).
Note that if there are objects in the same distance from A, we may have more
than k elements in Nk (A).
Define
reachability-distancek(A, B) = max{k-distance(B), d(A, B)}
The reachability distance of A from B is the distance between A
and B but at least the distance from B to the k-nearest neighbor.
428
Reachability Distance
429
LOF
Define local reachability density
lrdk(A) =
B∈Nk (A) reachability-distancek(A, B)
|Nk(A)|
−1
That is the reciprocal of the average reachability distance of A
from its k-nearest neighbors.
We define local outlier factor of an object A by
LOFk(A) =
B∈Nk (A) lrdk(B)
|Nk(A)|
/lrdk(A)
Now
▶ LOFk(A) ≈ 1 - similar density as the neighbors have
▶ LOFk(A) < 1 - higher density of neighbors than the neighbors
have (inlier?)
▶ LOFk(A) > 1 - smaller density of neighbors than the
neighbors have (outlier?)
430
Example
Consider a dataset
D = {A, B, C, D} = {5, 2, 1, 0}
Consider k = 2 and the distance d(U, V ) = |U − V |.
431
Example
Consider a dataset
D = {A, B, C, D} = {5, 2, 1, 0}
Consider k = 2 and the distance d(U, V ) = |U − V |.
Nk(A) = {B, C}
k-distance(.) d(A, .) reach-distk(A, .)
B 2 3 3
C 1 4 4
lrdk(A) =
1
2
(reach-distk(A, B) + reach-distk(A, C))
−1
= 2/7
431
Example
Consider a dataset
D = {A, B, C, D} = {5, 2, 1, 0}
Consider k = 2 and the distance d(U, V ) = |U − V |.
Nk(B) = {C, D}
k-distance(.) d(B, .) reach-distk(B, .)
C 1 1 1
D 2 2 2
lrdk(B) =
1
2
(reach-distk(B, C) + reach-distk(B, D))
−1
= 2/3
431
Example
Consider a dataset
D = {A, B, C, D} = {5, 2, 1, 0}
Consider k = 2 and the distance d(U, V ) = |U − V |.
Nk(C) = {B, D}
k-distance(.) d(C, .) reach-distk(C, .)
B 2 1 2
D 2 1 2
lrdk(C) =
1
2
(reach-distk(C, B) + reach-distk(C, D))
−1
= 1/2
431
Example
Consider a dataset
D = {A, B, C, D} = {5, 2, 1, 0}
Consider k = 2 and the distance d(U, V ) = |U − V |.
Nk(D) = {B, C}
k-distance(.) d(D, .) reach-distk(D, .)
B 2 2 2
C 1 1 1
lrdk(D) =
1
2
(reach-distk(D, B) + reach-distk(D, C))
−1
= 2/3
431
Example
lrdk(A) = 2/7
lrdk(B) = 2/3
lrdk(C) = 1/2
lrdk(D) = 2/3
LOFk(A) =
1
2
(lrdk(B) + lrdk(C))/lrdk(A)
= (1/2)(2/3 + 1/2)/(2/7) = 2.041
LOFk(B) =
1
2
(lrdk(C) + lrdk(D))/lrdk(B)
= (1/2)(1/2 + 2/3)/(2/3) = 0.875
LOFk(C) =
1
2
(lrdk(B) + lrdk(D))/lrdk(B)
= (1/2)(2/3 + 2/3)/(2/3) = 1
LOFk(D) = 0.875
432
Another Example
433
Yet Another Example
434
Comments on Density Based Methods
▶ As opposed to the distance-based methods, quantifies the
distance of the neighborhood of the (potential) outliers.
▶ Time complexity still O(p2) but may be reduced for low
dimensional data (to O(p log p)) using special data structures.
▶ Still need to determine the parameter k.
435
Clustering-Based Methods
436
Clustering Approach
Clustering is supposed to group similar objects.
Anomaly detection detects objects dissimilar to others.
So, clustering inherently solves similar problems.
But how do we detect anomalies based on clustering?
We may detect anomalies as elements of small clusters.
This approach is problematic as it strongly depends on the size/number of
clusters.
A better approach would be to asses how strongly objects belong
to clusters.
Definition 4
An object is a cluster-based outlier if the object does not strongly
belong to any cluster.
Depending on the particular clustering algorithm, we have (at least
some) information about the cohesion of objects.
437
Anomaly Detection Using k-Means Clustering
In k-means, the clusters are represented by centroids.
We may measure the anomaly of a given object using the distance
to its cluster centroid.
This approach does not take into account the density of clusters. 438
Anomaly Detection Using k-Means Clustering
Here, we compute the relative distance to the cluster center, that
is, the ratio of the object’s distance to the centroid and the median
distance of all objects of the same cluster to the centroid.
439
Clustering Based Anomaly Detection
Other algorithms may provide different information (probability
density of the clusters, etc.).
The usual problem: Outliers have an impact on the clustering
itself.
Some algorithms can be modified to treat potential outliers
specially.
For example, during the k-means clustering, put objects very far away from the
current cluster centroids into a special category not used to move the
centroids. After every step, test whether some of these objects became close
enough to at least one centroid (in which case these objects will be removed
from the special category).
Setting the proper number of clusters is an issue as well.
For anomaly detection, a larger number of smaller clusters may be
beneficial as they may be more cohesive. If an object seems to be
an outlier with small clusters, it is probably an outlier.
440
Comments on Clustering Based Methods
Some clustering methods have a sub-quadratic complexity.
Conceptually, clustering complements anomaly detection, so it is
natural to compute both together.
On the other hand, the results strongly depend on the number of
clusters, and anomalies may distort the clustering.
441
Autoencoders as Anomaly Detectors
... just a short comment
442
Neural Networks in Anomaly Detection
The previous approaches work well with (relatively)
low-dimensional data.
However, what should we do with data with high or even variable
dimensions?
▶ Images
▶ Text
▶ Video
▶ Semantic graphs
▶ ...
One way is to transform them into lower-dimensional data.
Train a neural network that takes the large input and returns a
smaller representation, which (hopefully) preserves crucial features
of the input.
443
Autoencoders
An autoencoder consists of two parts:
▶ ϕ : Rn → Rm the encoder
▶ ψ : Rm → Rn the decoder
The goal is to find ϕ, ψ so that ψ ◦ ϕ is (almost) identity.
The value ⃗h = ϕ(⃗x) is called the latent representation of ⃗x.
Training: Assume
T = {⃗x1, . . . , ⃗xp}
where ⃗xi ∈ Rn for all i ∈ {1, . . . , n}.
Minimize the reconstruction error
E =
p
i=1
(⃗xi − ψ(ϕ(⃗xi )))2
444
Autoencoders – neural networks
Both ϕ and ψ can be represented using MLP Mϕ and Mψ,
respectively.
Mϕ and Mψ can be connected into a single network.
445
Autoencoders – Usage
▶ Compression/feature extraction – from ⃗x to ⃗h.
▶ Dimensionality reduction – the latent representation ⃗h has a
smaller dimension.
▶ Generative versions – (roughly) generate ⃗h from a known
distribution, let Mψ generate realistic inputs/outputs ⃗x
▶ Anomaly detection (see the next slide)
446
Anomaly Detection with Autoencoders
Straightforward approach:
▶ Train the autoencoder on the normal data.
▶ Detect an anomaly using the large reconstruction error.
The idea is that an anomaly will not be properly reconstructed.
More general approach:
▶ Train the autoencoder and transform the normal data to their
low dimensional latent representations.
▶ Use one of the previous approaches to anomaly detection to
detect anomalies in the latent representations.
The assumption is that the latent representation of an anomaly will
substantially differ from the latent representations of normal instances.
447
Summary of Anomaly Detection
▶ It is hard even to define an anomaly - context knowledge is
usually needed.
▶ Supervised anomaly detection reduces to classification.
▶ Unsupervised anomaly detection can be done in various ways;
we have seen the following:
▶ Statistical
▶ Proximity-based
▶ Density-based
▶ Cluster-based
▶ Autoencoders
▶ Semi-supervised anomaly detection is usually concerned with
data where the normal class is known.
▶ There are many more methods: One class SVM, isolation
forests, etc.
448
Ensemble Methods
449
Voting Classifiers
Train several models. They may differ in
▶ Structure (completely different models)
▶ Training data (subsample the training set)
450
Voting Classifiers
During the inference, ensemble the predictions. For binary
classifiers, you may do the following:
▶ Take the majority vote.
▶ Summarize the output probabilities (e.g., by averaging)
▶ If logistic regression or neural networks with logistic output
activation are used, summarize the outputs before the
application of the last logistic sigmoid. 451
The Ensemble Idea
An ensemble of weak classifiers may be a strong classifier.
Analogy: Assume a coin toss that has a 51 percent chance of
heads and 49 percent tails.
Try 1,000 tosses in a row. Do this repeatedly. In 75 percent of
cases, you get more heads than tails. With 10,000 tosses, it would
be 97 percent.
Intuitively, we might suspect that a voting ensemble of 1,000
sufficiently independent classifiers, each with an accuracy of
around 51 percent, would have an accuracy of around 70 percent.
The ensemble methods work best when the models are as
independent as possible.
Use different learning methods or independently chosen training sets (see later
slides).
Let us try to formalize the above intuition.
452
Bias/Variance Decomposition
Consider a random function g(x) + ε where g : R → R and ε is
random noise with mean 0 and variance σ2.
Consider a dataset for regression:
D = {(x1, d1), . . . , (xp, dp)}
Here x1, . . . , xp are fixed inputs, and each dk = g(xk) + εk where
ε1, . . . , εp are independent samples from the noise ε.
We get different models for different samples of the training set D.
We may ask:
▶ How much do the models trained on different samples differ?
▶ How much do they differ (on average) from g?
Let us denote by hD : R → R a model trained on the dataset D by
minimizing the squared error p
k=1 (hD(xk) − dk)2
.
453
Bias/Variance Decomposition
Let us consider the expected value of the squared error:
E
p
k=1
(hD(xk) − dk)2
=
p
k=1
E (hD(xk) − dk)2
The expectation is computed over the distribution of the datasets D.
One can prove that
p
k=1
E (hD(xk) − dk)2
=
p
k=1
E (hD(xk) − E[hD(xk)])2
+ (E[hD(xk)] − g(xk))2
+ σ2
Here
▶ E (hD(xk) − E[hD(xk)])2
is the Variance of the model on xk
how much the model’s value jumps around its average on xk .
▶ E[hD(xk)] − g(xk) is the Bias of the model.
how much the average model’s value differs from the true values.
▶ σ2 is the noise variance.
454
455
B/V Decomposition Proof (Optional)
Let us study E (hD(xk) − dk)2
. To simplify notation we drop the
index k and write ED (hD(x) − d)2
.
(hD(x) − d)2
= (hD(x) − E[hD(x)] + E[hD(x)] − d)2
= (hD(x) − E[hD(x)])2
+ (E[hD(x)] − d)2
+ 2 (hD(x) − E[hD(x)]) (E[hD(x)] − d)
Now, just apply E to both sides. As
E(d) = E(g(x) + ε) = g(x)
E(d2
) = E(g(x)+ε)2
= E(g(x)2
+2εg(x)+ε2
) = g(x)2
+σ2
We obtain ...
456
B/V Decomposition Proof (Optional)
... this
E (E[hD(x)] − d)
2
= E E[hD(x)]2
− 2dE[hD(x)] + d2
= E[hD(x)]2
− 2E(d)E[hD(x)] + Ed2
= E[hD(x)]2
− 2g(x)E[hD(x)] + g(x)2
+ σ2
= E (E[hD(x)] − g(x))
2
+ σ2
and this
E [2 (hD(x) − E[hD(x)]) (E[hD(x)] − d)]
= E [2 (hD(x) − E[hD(x)]) (E[hD(x)] − g(x) − ε)]
= E [2 (hD(x) − E[hD(x)]) (E[hD(x)] − g(x)) − 2 (hD(x) − E[hD(x)]) ε]
= E [2 (hD(x) − E[hD(x)]) (E[hD(x)] − g(x))]
= (E[hD(x)] − g(x)) E [2 (hD(x) − E[hD(x)])]
= 0
Here, the third equality follows from the zero mean of ε.
457
Ensemble Methods vs Bias/Variance
Suppose that we could somehow sample m independent datasets:
D1, . . . , Dm.
We train m models hD1 , . . . , hDm and consider the ensemble model
hens(x) = 1
m
m
ℓ=1 hDℓ
(x).
Does the ensemble hens have a different bias-variance decomp.?
▶ The noise variance σ2 does not change.
▶ The Bias does not change (E[hDℓ
(xk)] is independent of ℓ):
E
1
m
m
ℓ=1
hDℓ
(xk) − g(xk) = E[hDℓ
(xk)] − g(xk)
▶ Only the Variance changes, it is reduced:
E (hens(xk) − E[hens(xk)])2
=
1
m
E(hDℓ
(xk)−E[hDℓ
(xk)])2
The equality follows from basic facts about the variance of sums of
independent variables.
Now, having m independently sampled datasets is a real luxury!
458
Bagging
In practice, we use sampled subsets of a given dataset.
One example of using different subsets of the training set is
bagging (Bootstrap Aggregating).
▶ Bootstrap sampling = sampling data subsets with replacement
▶ Aggregating = majority voting (classification) or averaging
(regression) 459
Bagging Decision Trees
Left: A single decision tree (overfit)
Right: A bagging ensemble of 500 trees
460
Summary of Bagging
▶ Reduces overfitting (variance)
▶ Can work with any type of classifier
▶ Easy to parallelize
Just train each model independently, inference can also be parallelized.
▶ Loses (some) interpretability even for interpretable models
(how could you read 500 decision trees?)
461
Random Forest
Random forest = bagging ensemble of decision trees.
Has all hyperparameters of decision trees + the number of trees
The random forest algorithm of scikit-learn introduces the
following randomness:
When searching for a split attribute, look only into a randomly
sampled subset of attributes (by default
√
n of n attributes).
This gives a greater diversity of the forests.
The notion of the feature importance can be easily generalized to
random forests (computed over all trees).
462
Boosting
An ensemble method in which weak learners are trained
sequentially.
Each new learner tries to correct the error of the current ensemble.
There are several variants of boosting. We just have a look at the
basic idea behind AdaBoost (adaptive boosting).
To implement this kind of learning algorithm, we need to be able
to train models on weighted datasets.
463
Weighted Training
We are going to use weights on samples.
Note that these are different from the weights used in neural-like models!
▶ Linear regression: Assume a given dataset
D = {(⃗x1, f1), . . . , (⃗xp, fp)} and weights a1, . . . , ap associated
to examples from D.
Let ⃗w be a vector of model weights. We minimize the
weighted mean squared error
E(⃗w) =
1
p
p
i=1
ai (⃗w · xi − fi )2
▶ Decision trees: Given dataset D with p samples, we have
weights a1, . . . , ap ∈ R (one weight for each example in D).
Given a class c ∈ C, we compute the class probability pc as
pc = {ai | i-th sample belongs to c} /
p
i=1
ai
Everything else (Gini impurity, etc.) is the same.
464
AdaBoost Idea
▶ Train classifiers using weighted samples in the training dataset.
▶ Start with uniform weights, that is, each sample in the dataset
has the same weight.
▶ In k-th iteration:
▶ train a new classifier hk on weighted samples,
▶ obtain a coefficient αk > 0 of the classifier hk ,
▶ Consider the current ensemble classifier
hk
ens(x) = sign
k
ℓ=1
αℓhℓ(x)
Shift the weights so that the “most wrong” training instances
for the current ensemble have the largest weights.
Consider a training example (x, c) with c ∈ {−1, 1}. Then hk
ens
misclassifies x iff c · k
ℓ=1 αℓhℓ(x) < 0. The more negative this
number is, the more wrong the ensemble classifier is.
After K iterations, the ensemble classifier hK
ens is the output.
For details, see more advanced courses in Machine Learning.
465
Summary of Ensemble Methods
▶ Ensemble methods may improve “independently” trained
models by grouping them and letting them decide collectively.
▶ Technically, the ensembling decreases the model variance.
▶ There are several approaches to ensembling:
▶ Bagging - training of several models on bootstrap subsamples
of the training dataset (random forest is an example)
▶ Boosting - ensemble is built sequentially, the newly added
model possibly solves the hardest instances
▶ There are many more algorithms (gradient boosting, etc.)
466
THE END
467