Introduction
Jan Sedmidubský Masaryk University
sedmidubsky@mail.muni.cz
Jan Sedmidubský | PV056 Machine Learning and Data Mining 1
Outline
• Course information
• Objectives of the course
• Evaluation – semestral project + final exam
• Outline of the lectures + what I can learn in other courses
• Data mining pipeline
• Data preprocessing
• Tasks – classification, regression, prediction, event detection, anomaly
detection
• Learning – supervised, self-supervised, semi-supervised, unsupervised, active,
meta
• Semestral project in detail – conditions and tasks
• Existing machine-learning tools/libraries
• Deep learning frameworks – TensorFlow, Keras, PyTorch
Jan Sedmidubský | PV056 Machine Learning and Data Mining 2
Course objectives
• Learn principles of selected machine-learning (ML) and data-mining
(DM) techniques
• Understand how selected techniques can be applied to specific reallife
use cases
• Solve practical tasks within a group of students (semestral projects)
Jan Sedmidubský | PV056 Machine Learning and Data Mining 3
Course evaluation
• Final exam: 80%
• Open questions, focus on main principles + applicability
• Semestral project: 20%
• 3–4 students for one group
• Goal – solve a given machine-learning/data-mining problem
• E.g., classification of plant-disease images
• You are expected to:
• Implement your solution using the Google Colab environment (cloud Jupyter Notebooks)
• Write a 2-page project report
• Present your project (10 minutes presentation + 5 minutes discussion)
• Details specified later
• Project organizer: Ondřej Sotolář (xsotolar@fi.muni.cz)
Jan Sedmidubský | PV056 Machine Learning and Data Mining 4
Course topics
1) Introduction to machine learning and data mining + projects
2) Metric learning, product quantization, approximate searching
3) Advanced clustering methods
4) Advanced anomaly detection
5) Bayesian optimization
6) Automated machine learning
7) Time-series data mining
8) Processing of multidimensional time series of human motion
9) Cross-modal learning
10)Applied machine learning: examples of real-life applications
Jan Sedmidubský | PV056 Machine Learning and Data Mining 5
LLMs as a personal tutor
• You can use LLMs (e.g., ChatGPT) to
• Discuss suitable methods and parameter settings for different use cases
• Generate and debug Python code for experimenting with the methods
• Generate multiple-choice and open questions for self-assessment
Jan Sedmidubský | PV056 Machine Learning and Data Mining 6
Literature and sources
• Textbooks:
• Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining.
2nd Edition. Pearson / Addison Wesley, 2019.
• Aurélien Géron: Hands-on Machine Learning with Scikit-Learn, Keras &
TensorFlow. 2nd or 3rd Edition, O’Reilly, 2019 or 2022
• Other sources:
• University of Mannheim
• Introduction to Data Mining by Christian Bizer
• University of Minnesota
• Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar
• Purdue University
• Deep Learning by Avi Kak and Charles Bouman
Jan Sedmidubský | PV056 Machine Learning and Data Mining 7
Big data everywhere
Jan Sedmidubský | PV056 Machine Learning and Data Mining 8
Big data everywhere
• US Library of Congress: ≈ 235 TB archived ≈ 40 Wikipedias
• arXiv Preprint Server: > 2 million papers
• Tasks:
• Discover topic distributions or citation networks
• Train Large Language Models
• Facebook
• 4 Petabyte of new data generated every day
• over 300 Petabyte in Facebook’s data warehouse
• Tasks:
• Predict interests and behavior of over one billion people
Jan Sedmidubský | PV056 Machine Learning and Data Mining 9
Big data everywhere
• Law enforcement agencies
• Collect unknown amounts of data from various sources
• Cell phone calls
• Location data
• Web browsing behavior
• Credit card transactions
• Online profiles (Facebook)
• …
• Tasks:
• Predict terrorist
• Find compromising photos
Jan Sedmidubský | PV056 Machine Learning and Data Mining 10
Source: https://www.novinky.cz/clanek/krimi-fbi-odhalila-mladeho-
cecha-ktery-vyhrozoval-odpalenim-bomby-40391731
Data mining
• Data mining – process of discovering patterns, relationships, and
insights from large datasets
• Goal – extract useful information from raw data
• DM methods help us to take decisions based on the patterns
← Amount of data that
is collected
← Amount of data that
can be looked at by
humans
Exploration & analysis of
large quantities of data in
order to discover
meaningful patterns
Non-trivial extraction of
implicit, previously
unknown, and potentially
useful information from data
Jan Sedmidubský | PV056 Machine Learning and Data Mining 11
Machine learning
• Machine learning – branch of AI that enables computers to learn from
data and make predictions/decisions without explicit programming
• Goal – develop models that improve performance automatically through
experience
• Key components:
• Experience – input data or historical information the system learns from
• Task – the specific problem the system is trying to solve (e.g., image
classification, speech recognition)
• Performance measure – a metric used to evaluate how well the system
performs the task (e.g., accuracy, precision, recall)
Statistical algorithms that
can learn from data and
generalize to unseen
data, and thus perform
tasks without explicit
instructions
Improving performance
on a specific task by
recognizing patterns,
making predictions or
decisions based on input
data
AI
Machine
learning
Deep
learning
Jan Sedmidubský | PV056 Machine Learning and Data Mining 12
Data mining vs. machine learning
Jan Sedmidubský | PV056 Machine Learning and Data Mining 13
Data mining Machine learning
Purpose
Find patterns &
insights
Make predictions &
automate decisions
Approach Exploratory analysis
Algorithm-based
learning
Dependence on
humans
More human-driven
(analysis &
interpretation)
More automated (selfimproving
models)
Outcome Knowledge discovery Predictive modeling
Example
Market basket analysis
(which products are
bought together)
Recommender system
(suggesting products
based on user
behavior)
Data mining pipeline
Source: https://www.linkedin.com/pulse/data-mining-knowledge-discovery-process-model-leandro-guerra/
Jan Sedmidubský | PV056 Machine Learning and Data Mining 14
1) Data selection
• Data types – text, images, videos, audio,
time-series, spatio-temporal data, etc.
• Selection
• What data is potentially useful for the task at
hand?
• What data is available?
• What do I know about the quality of the data?
• Exploration / profiling
• Get an initial understanding of the data
• Calculate basic summarization statistics
• Visualize the data
• Identify data problems such as outliers,
missing values, duplicate records
Jan Sedmidubský | PV056 Machine Learning and Data Mining 15
2) Preprocessing and transformation
• Data cleaning – handling missing values, removing duplicate records
• Transformation of data into a suitable representation
• Scales of attributes (nominal, ordinal, numeric)
• Number of dimensions (represent relevant information using less attributes)
• Amount of data (determines hardware requirements)
• Methods
• Discretization and binarization
• Feature subset selection / dimensionality reduction
• Attribute transformation / text to term vector / embeddings
• Aggregation, sampling
• Integration of data from multiple sources
• Good data preparation is key to producing valid and reliable models
• Data integration/preparation takes 70–80% of the time and effort
Jan Sedmidubský | PV056 Machine Learning and Data Mining 16
3) Data mining
• Input: preprocessed data
• Output: model / patterns
• Steps:
1) Apply data mining method
2) Evaluate resulting model / patterns
3) Iterate
• Experiment with different hyperparameter settings
• Experiment with multiple alternative methods
• Improve preprocessing and feature generation
• Increase amount or quality of training data
4) Deploy – use the most promising model in the business context
Jan Sedmidubský | PV056 Machine Learning and Data Mining 17
Tasks and applications
• Descriptive tasks
• Goal: Find human-interpretable patterns that describe the data
• Example: Which products are often bought together?
• Predictive tasks
• Goal: Use some variables to predict unknown or future values of other variables
• Given observations (e.g., from the past)
• Example: Will a person click an online advertisement?
• Given their browsing history
• Machine learning terminology
• Descriptive ~ unsupervised
• Predictive ~ supervised
Jan Sedmidubský | PV056 Machine Learning and Data Mining 18
Tasks
• Cluster analysis [Descriptive]
• Classification [Predictive]
• Regression [Predictive]
• Association analysis [Descriptive]
• Anomaly detection [Predictive]
• Time-series forecasting [Predictive]
• Event detection [Predictive]
• (Cross-modal) retrieval [Descriptive]
Jan Sedmidubský | PV056 Machine Learning and Data Mining 19
Cluster analysis
• Goal: given a set of data points, each having a set of attributes, and a
similarity measure among them, find groups such that
• Data points in one group are more similar to one another
• Intra-cluster distances are minimized
• Data points in separate groups are less similar to one another
• Inter-cluster distances are maximized
• Similarity measures
• Euclidean distance if attributes are continuous
• Other task-specific similarity measures
• Result: a descriptive grouping of data points
Jan Sedmidubský | PV056 Machine Learning and Data Mining 20
Cluster analysis – example
• Application area: market segmentation
• Goal: find groups of similar customers
• Group may be conceived as a marketing target to be
reached with a distinct marketing mix
• Approach:
1) Collect information about customers
2) Find clusters of similar customers
3) Measure the clustering quality by observing buying
patterns after targeting customers with distinct
marketing mixes
Jan Sedmidubský | PV056 Machine Learning and Data Mining 21
Classification
• Goal: previously unseen records should be assigned a class from a
given set of classes as accurately as possible
• Approach:
1) Given a collection of records (training set)
• Each record contains a set of attributes
• One attribute is the class attribute (label) that should be predicted
2) Find a model for predicting the class attribute as a function of the values of
other attributes
?
Jan Sedmidubský | PV056 Machine Learning and Data Mining 22
Classification – example
• Application area: fraud detection
• Goal: predict fraudulent cases in credit card transactions
• Approach:
1) Use credit card transactions and information about account-holders as
attributes
• When and where does a customer buy? What does he buy?
• How often he pays on time? Etc.
2) Label past transactions as fraud or fair transactions
• This forms the class attribute
3) Learn a model for the class attribute from the transactions
4) Use this model to detect fraud by observing credit card transactions on an
account
Jan Sedmidubský | PV056 Machine Learning and Data Mining 23
Regression
• Goal: predict a value of a continuous variable based on the values of
other variables, assuming a linear or nonlinear model of dependency
• Examples – predicting:
• The price of a house or car
• Sales amounts of new product based on advertising expenditure
• Miles per gallon (MPG) of a car as a function of its weight and horsepower
• Wind velocities as a function of temperature, humidity, air pressure, etc.
• Difference to classification: the predicted attribute is continuous, while
classification is used to predict nominal attributes (e.g., yes/no)
Jan Sedmidubský | PV056 Machine Learning and Data Mining 24
Association analysis
• Goal: given a set of records each of which contain some number of
items from a given collection, discover frequent itemsets
• Produce association rules which will predict occurrence of an item based on
occurrences of other items
Jan Sedmidubský | PV056 Machine Learning and Data Mining 25
Association analysis – example
• Application area: supermarket shelf management
• Goal: identify items that are bought together by sufficiently many
customers
• Approach: process the point-of-sale data collected with barcode
scanners to find dependencies among items
• A classic rule and its implications:
• If a customer buys diapers and milk, then they are likely to buy beer as well
• So, don’t be surprised if you find six-packs stacked next to diapers
• Promote diapers to boost beer sales
• If selling diapers is discontinued, this will affect beer sales as well
Jan Sedmidubský | PV056 Machine Learning and Data Mining 26
Deviation / anomaly detection
• Goal: detect significant deviations from normal behavior
• Examples:
• Network intrusion detection
• Identify anomalous behavior from sensor networks for monitoring and
surveillance
• Detecting changes in the global forest cover
Jan Sedmidubský | PV056 Machine Learning and Data Mining 27
Time-series forecasting
• Goal: predict future values of a series of data points based on
historical data, which is typically organized chronologically
• Examples:
• Predict future demand for products to optimize inventory and reduce costs
• Predict energy usage to balance supply and demand effectively
• Forecast stock prices, currency exchange rates, or economic indicators
Jan Sedmidubský | PV056 Machine Learning and Data Mining 28
Event detection
• Goal: identify, classify, and analyze significant occurrences or patterns
in data streams
• These events often represent meaningful changes, anomalies, or predefined
patterns in the data
• Examples:
• Monitor network activity to identify suspicious or unauthorized access attempts
• Identify trending topics, significant news in real-time from social media platforms
• Detect suspicious human-motion actions (e.g., kicking) from surveillance cams
Jan Sedmidubský | PV056 Machine Learning and Data Mining 29
Cross-modal retrieval
• Goal: retrieve relevant data from one modality (e.g., text, image,
audio, or video) using a query from another modality
• This enables seamless interaction between different types of data, leveraging
the relationship between them to deliver meaningful results
• Examples:
• Retrieve unannotated images based on textual descriptions or vice versa
• Index large video datasets (e.g., YouTube, Netflix) for content-based retrieval
• Retrieving clinical notes based on visual annotations or imaging results
Jan Sedmidubský | PV056 Machine Learning and Data Mining 30
Learning
• Supervised
• Semi-supervised
• Unsupervised (self-supervised)
• Active learning
• Meta learning
Jan Sedmidubský | PV056 Machine Learning and Data Mining 31
Learning
• Supervised learning
• Learning from a labeled dataset where the input-output relationship is known
• Key features:
• Data has labels
• Model learns a mapping function (e.g., classification or regression tasks)
• Examples: image classification, speech recognition
• Challenges: requires a large amount of labeled data
Jan Sedmidubský | PV056 Machine Learning and Data Mining 32
Learning
• Unsupervised (self-supervised) learning
• Learns patterns from unlabeled data
• Key features:
• No labeled data
• Focuses on clustering, dimensionality reduction, and anomaly detection
• Examples: clustering customers into segments, discovering hidden patterns
Jan Sedmidubský | PV056 Machine Learning and Data Mining 33
Learning
• Semi-supervised learning
• Combines a small amount of labeled data with a large amount of unlabeled data
• Key features:
• Uses both labeled and unlabeled data
• Improves performance when labeled data is scarce
• Examples: text classification where only a few labeled examples are available,
but a large amount of raw text can be leveraged
Jan Sedmidubský | PV056 Machine Learning and Data Mining 34
Learning
• Active learning
• Model actively queries for labels in the data it is most uncertain about
• Key features:
• Reduces labeling costs by asking for human annotations only on difficult or ambiguous
samples
• Examples: real-world scenarios where labeling all data is expensive, such as
medical diagnosis
Jan Sedmidubský | PV056 Machine Learning and Data Mining 35
Learning
• Meta learning
• Learning to learn – the model adapts to new tasks by leveraging past learning
experiences
• Key features:
• Focuses on fast learning from few examples
• Helps in generalizing to new tasks quickly
• Examples: few-shot learning where the model learns to classify from only a few
examples per class
• A model trained on various handwriting styles can quickly adapt to recognizing a new, unseen
script with minimal samples
Jan Sedmidubský | PV056 Machine Learning and Data Mining 36
Learning – comparison
Jan Sedmidubský | PV056 Machine Learning and Data Mining 37
Learning type Data labels Purpose
Example
applications
Supervised learning Labeled Maps input to output Image classification
Semi-supervised Mixed
Leverages limited
labeled data
Text categorization
Unsupervised Unlabeled
Finds hidden
structures or patterns
Clustering,
dimensionality
reduction
Active Learning Minimal
Queries only the
most uncertain data
points
Medical image
labeling
Meta Learning Few labeled
Learns how to learn
new tasks faster
Few-shot
classification
Top ML algorithms in industry
• The reasons for use:
• High accuracy for structured data
• Easy to implement and train
• Interpretability and explainability
Kaggle online poll 2022, 23,997 respondents,
Source: https://www.kaggle.com/code/eraikako/data-science-and-mlops-landscape-in-industry
Jan Sedmidubský | PV056 Machine Learning and Data Mining 38
Semestral project
• Goal – design, implement and test some ML/DM task
• Requirements – you will:
• Select one of the offered topics
• Form a team of 3–4 students and collaborate
• Implement your solution in Google Colab
• You provide a link to your solution in Colab
• Write a compact technical report with hard limit of 2 pages
• You upload the report into IS MU
• Present your project within a 10-minute presentation
• Limitations:
• If you decide to use prompt engineering, you need to include at least 2
additional techniques in your solution, such as RAG or prompt augmentation
through external signals
Jan Sedmidubský | PV056 Machine Learning and Data Mining 39
Semestral project – textual report
• Write a compact technical report with hard limit of 2 pages +
appendices (additional plots or tables, author contributions)
• Use the Springer template – LaTeX or Word
• It should contain the following sections:
• Introduction
• Related work
• Proposed method(s)
• Results and discussion
• Appendix
• Author Contributions: very short descriptions of individual author contributions
• Recommendation – the report should include one table/plot with results,
additional tables/plots can be included in Appendix or in the Colab notebook
Jan Sedmidubský | PV056 Machine Learning and Data Mining 40
Semestral project – deadlines
• Feb 18–28: forming groups of 3–4 students + topic selection
• Task: enter information to provided Excel Sheet
• March 1–April 15: implementation phase (i.e., deadline April 15)
• If you have any issue, you can ask for feedback (Ondřej Sotolář:
xsotolar@fi.muni.cz)
• Task: handover the link to your Colab solution & report PDF into IS MU vault
• Test if the notebook is set to shared
• April 15–29: preparation of presentation
• Task: prepare 10-minute presentation – presentations starting from April 29
• Shortly: introduce your problem & related work, mainly focus on your approach and results
• Evaluation – you will be notified about the final score which constitutes
20% of the final mark
• 14% for a basic solution, 6% bonus for addressing the reviewer's issues or high-quality work
Jan Sedmidubský | PV056 Machine Learning and Data Mining 41
Semestral project – topics
• Topic1 – Human activity recognition (~time series classification)
• https://www.kaggle.com/datasets/uciml/human-activity-recognition-with-
smartphones
• Topic2 – Food hazard detection (~multi-modal text classification)
• Optionally muti-modal
• https://food-hazard-detection-semeval-2025.github.io
• Topic3 – Plant disease classification (~image classification)
• https://github.com/Denisganga/the_plant_doctor/tree/main
• Topic4 (harder) – Urban sound classification (~signal processing)
• https://www.kaggle.com/code/aadith0/rnn-audio-classification
Jan Sedmidubský | PV056 Machine Learning and Data Mining 42
Topic1: Human Activity Recognition (HAR)
• Classify recordings of study participants
performing activities while carrying a
smartphone with embedded inertial sensors
• Data:
• https://www.kaggle.com/datasets/uciml/human
-activity-recognition-with-smartphones
• Data analysis:
• https://www.kaggle.com/code/anushareddy56/starter-human-activity-
recognition-6cad9ae9-2
• https://sakshamchecker.medium.com/human-activity-recognition-7abaa9a1cf34
• Baseline: naïve LSTM ~0.8 F1macro
• https://colab.research.google.com/drive/1a1QP9gyS9Rptq2escYjGj6hqcO5DsP
wA?usp=sharing
Jan Sedmidubský | PV056 Machine Learning and Data Mining 43
Topic1: HAR continued…
• Data:
• 30 subjects’ data is randomly split to 70% test and
30% train data
• Each datapoint corresponds to one of the 6 activities
• Classes almost balanced
• Baseline:
• Most feature-based and neural-net based models
should easily get > 0.9 F1macro
• Dataset paper:
• https://www.esann.org/sites/default/files/proceedings
/legacy/es2013-84.pdf
Jan Sedmidubský | PV056 Machine Learning and Data Mining 44
Topic1: HAR conclusion
• Steps:
1. Get yourself acquainted with the area of human activity recognition
• Hints: Read blog posts and research papers to get an idea
2. Perform exploratory data analysis
• Plot both overview statistics and intuitively cherrypicked feature statistics
• Hints: acceleration should separate walking/sitting etc.
• Try automatic clustering to hypothesize about problematic classes
• Hints: T-SNE
3. Train a predictive model of your own choice
• You need to improve over the naïve baseline in the provided Colab
• Use Google Colab! This is a non-debatable requirement
• The solution does not need to include a neural network
4. Perform an error analysis
• Identify easy/hard to predict classes
5. Handover the Colab link and the technical report PDF
Jan Sedmidubský | PV056 Machine Learning and Data Mining 45
Topic2: Food hazard detection
• Classify titles of food-incident reports collected from the web (NLP)
• Data:
• https://food-hazard-detection-semeval-2025.github.io/
• Baseline:
• TF-IDF + LinearRegression
• https://colab.research.google.com/drive/1hv6QifrJ6qRddffo
QR1ZQlaWCDBalDhY?usp=sharing
• Leaderboard:
• https://codalab.lisn.upsaclay.fr/competitions/
leaderboard_widget/19955/
Jan Sedmidubský | PV056 Machine Learning and Data Mining 46
Topic3: Plant disease classification
• Train a model to discriminate plant diseases given their images (CV)
• Data:
• https://github.com/Denisganga/the_plant_doctor/tree/main
• Baseline:
• https://colab.research.google.com/drive/1o5gXJuB8B-
kV0ehfDW1Kv1w0iNHivAKU?usp=sharing
Jan Sedmidubský | PV056 Machine Learning and Data Mining 47
Topic4: Urban sound classification
• Train a model to classify sound recordings on the train split
• Data:
• UrbanSounds8k: https://www.kaggle.com/code/aadith0/rnn-audio-classification
• Short sound recordings (e.g., dog bark)
• 10 classes, slightly imbalanced instances per class
• Baseline:
• Evaluate the model on the test split
• How to:
• A naïve solution is to plot the sound recordings and use image classification or
raw time-series models
• Better solutions use signals theory features (e.g. MFCC) or image processing
(spectrograms): https://github.com/mashrin/UrbanSound-Spectrogram
Jan Sedmidubský | PV056 Machine Learning and Data Mining 48
Past projects – examples of solutions
• Selected examples from past semesters:
• AlphaZero for 2-player games
• https://colab.research.google.com/drive/1l9sGcW466SBNRLsl0KvqVVt4NDJhBShi
• Spatial temporal prediction
• https://colab.research.google.com/drive/1LMOP3UqRpRy92mUdfh6iqOGUS781I1y1
• Feature construction using genetic programming
• https://colab.research.google.com/drive/1Y-
FuI07lnYutbh2rw2SM1NnqJ3ONFujh#scrollTo=2DN7msn0i7ZD
• Feature hashing
• https://colab.research.google.com/drive/1KKtwurErcvkEnQfsczCmyJ-PF5DL209C
• Object recognition with the Vision Transformer
• https://colab.research.google.com/drive/1_GCpaFtSoRLdW6u7R7Hv-
4NshM0Th5rg#scrollTo=XZtQuNSsgYy7
Jan Sedmidubský | PV056 Machine Learning and Data Mining 49
Semestral project – ML frameworks to use
Jan Sedmidubský | PV056 Machine Learning and Data Mining 50
ML frameworks – continued
• Scikit-learn
• Simple and efficient tools for predictive data analysis
• Accessible to everybody, and reusable in various contexts
• Built on NumPy, SciPy, and matplotlib
Warning! Learn about Neural Networks before working with these.
• PyTorch & TensorFlow
• open-source deep learning frameworks
• Autograd: dynamic computation graphs
• GPU Acceleration
• HuggingFace
• Hub for state-of-the-art pretrained models for NLP & CV
• Python library
Jan Sedmidubský | PV056 Machine Learning and Data Mining 51
Development tools
• Colab
+ Free GPU, online: no setup & all platforms, seamless graphical interface
- Time limits, debugging in terminal
• Vim/NeoVim
+ Skill building, easy setup on remote machines
- Access to machine w. GPU, coding & debugging in terminal, too much fun
• VS Code
+ GUI, run on remote folders, Copilot
- Own/access to Machine with GPU, difficulties in setting up remote dev
• PyCharm
+ excellent GUI debugging, very capable IDE, data inspection tools
- Own/access to Machine with GPU, difficulties in setting up remote dev
Jan Sedmidubský | PV056 Machine Learning and Data Mining 52
Optional development resources
• Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and
machine learning (Vol. 4, No. 4, p. 738). Springer.
• Well written book on theory with exercises
• Relevant for data science
• Bishop, C. M., & Bishop, H. (2023). Deep learning: Foundations and
concepts. Springer Nature.Newer book with focus on neural networks
• Goodfellow, I. (2016). Deep learning (Vol. 196). MIT press.
• Foundational theory behind neural networks
• Jurafsky, M., Speech and language processing
• https://web.stanford.edu/~jurafsky/slp3/
• Foundations of NLP with both feature-based and neural-net models
• Kaggle for code and data analysis
Jan Sedmidubský | PV056 Machine Learning and Data Mining 53