Research
“How to do cool stuff and get paid for it”
Today’s Speakers
Miriama
Jánošová
Jaroslav
Oľha
Terézia
Slanináková
David
Procházka
Complex data AlphaFindLearned Metric Index PhD study
What is research?
What is research?
Realizing your ideas
Programming
Presenting
Designing experiments
Staying abroad
Meeting clever people
Writing publicationsCreating diagrams
Teaching*
Being paid
Doing cool stuff
Supervising students*
Going to conferences
Being creative
Making friends
Working on cutting edge things
What is research?
Doing cool stuff
How to make cool stuff?
Either
f
ind somethingyou think is cool,
or
f
ind someonewho makes cool things.
• We use Python, Rust, PyTorch, Docker, Kubernetes, JupyterHub, …
• We work with images, proteins, human motion, …
• We cooperate with partners from Switzerland, Denmark, Germany, …
• We organize invited talks from Kiwi.com, JAMF, SAP, …
CODA Research Group
We
f
ind patterns in data and mine information from complexity.
Complex Data
Today’s data is and that is a problem…
big
complex
abundant
How does Spotify recommend similar songs?
How does Net
f
lix determine what should you watch next?
Everything can be a vector…
(0.9259, -0.4775, …, 0.7019, -0.5630)
High-dimensional dense vector embeddingComplex object Embedding model
Dimensionality of Embeddings
Data Source Dimensionality
DINOv2
(image)
384 – 1,536
CLIP
(image + text description)
512 – 1,024
Llama 3
(text)
4,096 – 16,384
(0.9259, -0.4775, …, 0.7019, -0.5630)
(7.1041, -3.8554, …, -12.5602, 0.9923)
(-2.5482, 0.2563, …, 3.002, 8.8223)
(3.8520, -39.459, …, 15.3019, -1.0592)
(1.3222, -0.8269, …, 7.3929, 4.6901)
…
1,000,000,000+
vectors
1,000+
dimensions
4+ TB
of memory
Any two vectors have similar distance
Source: Chávez et al., Searching in metric spaces, 2001
Low-dimensional dataset High-dimensional datasetVS
Clusters?
NO!
Curse of Dimensionality
1. Problems get exponentially harder
2. Any two vectors have similar distance
3. All vectors are near orthogonal
4. No notion of locality
5. No clusters
(0.9259, -0.4775, …, 0.7019, -0.5630)
(7.1041, -3.8554, …, -12.5602, 0.9923)
(-2.5482, 0.2563, …, 3.002, 8.8223)
(3.8520, -39.459, …, 15.3019, -1.0592)
(1.3222, -0.8269, …, 7.3929, 4.6901)
…
1,000,000,000+
vectors
1,000+
dimensions
4+ TB
of memory
Learned Metric Index
Next-generation indexing for high-dimensional data
Learned Indexing for 1D data
Kraska et al. 2018
treemodel
Learned Indexing for 1D data
Kraska et al. 2018
Why? Because we could reduce O(log n) to O(1).
Learned Indexing
There is more to it
One-dimensional data
1D
Learn cumulative
distribution function.
Multi-dimensional data
2 to 20D
Transform into 1D case.
High-dimensional data
20D+
Learn existing clustering
or iteratively improve one.
Dimensionality of Embeddings
Data Source Dimensionality
DINOv2
(image)
384 – 1,536
CLIP
(image + text description)
512 – 1,024
Llama 3
(text)
4,096 – 16,384
A bucket, basic organizational
unit of a collection of vectors.
A machine learning model solving a
supervised classification problem:
“Where are the most similar objects?”
Learned Indexes for High-Dimensional Data
Learned Metric Index, NeuralLSH, BLISS, BATL, FLEX, …
AlphaFind
Rede
f
ining what it means to ef
f
iciently search within 214M proteins
AlphaFind: Similarity search
applied to 214-million proteins
T. Slaninakova, 27.9.2024
PA195/PA220 AlphaFind Presentations 2024 1
The why
PA195/PA220 AlphaFind Presentations 2024 2
• Proteins == chains of amino acids folded in 3D
space
• The shape determines the function
• Use case for similarity: drug design
The „can-I-do-sim-search-on-it“ checklist
PA195/PA220 AlphaFind Presentations 2024 3
1.There is a ground truth we can rely on
- We know what similar and not similar look like
- Ideally, we can quantify it
2.We can represent the data as vectors
The „can-I-do-sim-search-on-it“ checklist
PA195/PA220 AlphaFind Presentations 2024 4
1.There is a ground truth we can rely on
- We know what similar and not similar look like
- Ideally, we can quantify it
- Necessary
2.We can represent the data as vectors
- Optional, but saves us a lot of headache
The „can-I-do-sim-search-on-it“ checklist
PA195/PA220 AlphaFind Presentations 2024 5
1.There is a ground truth we can rely on
- We know what similar and not similar look like
- Ideally, we can quantify it
- Necessary
2.We can represent the data as vectors
- Optional, but saves us a lot of headache
% of protein
alignment
Proteins:
Graph neural
networks,
polynomials, …
Ok, now what?
PA195/PA220 AlphaFind Presentations 2024 6
- Task: Given an input protein, find k most similar proteins in 214M AlphaFold database
- Approach:
1. Offline phase:
1. transform the data into vectors
2. pre-cluster based on mutual distance
3. create an index to help with navigation to the clusters
2. Online phase:
1. locate protein ID in the database
2. predict its location with the index
3. do quick (but not as accurate) pre-filtering
4. do slow (but accurate) post-filtering
- The challenge: carefully strike the balance between accuracy and speed
← Optimize as much
you want, you have (in
theory) all the time in
the world
← You better be fast
here
Result
PA195/PA220 AlphaFind Presentations 2024 7
- backend of an app AlphaFind
running on alphafind.fi.muni.cz
- associated publication in Nucleic
Acid Research journal
PA195/PA220 AlphaFind Presentations 2024 9
Summary
Your
presenters
today
Bio friends from CEITEC
Bc. student who created the
entire front-end as his
bachelor’s thesis
1. Research does not have to be
theoretical
2. Research is not reserved for
professors:
Next plans:
- search in protein complexes /
RNA / DNA
- Search in ESM Atlas (700M
proteins)
- discussing the integration into
big protein databases
We’re looking for collaborators :)
Also, check us out today in Sitola
(A502) at Researcher’s night (Noc
vědců)
PhD
What is it?
Why should I care?
How much does it cost?
…
What about the 💸💸💸?
16k + up to 14k + extra = ~50k net income
💸💸💸
Involvement in projects, teaching, …
What do we offer?
muni.cz/go/students
Thesis Topics About Cool Stuff #1
Thesis tag: CODA research group
Machine Learning
• Indexing Complex Data With Transformers
• Continual Learning for Evolving Data
• Designing Model Architecture for Learned Indexing of Complex Data
Human Motion Data
• Quantization of Auto-Encoded Human Motion Features
Algorithms
• Designing Clustering Algorithm for Indexing Complex Data
Thesis Topics About Cool Stuff #2
Thesis tag: CODA research group
Curse of Dimensionality
• Understanding the Curse of Dimensionality: Implications for High-Dimensional Indexing
Indexing
• Nearest Neighbor Ordering Under Dimensionality Reduction Techniques
• Graph Navigation Approaches to ANN Indexing
Bioinformatics
• Search systems for biomolecular complexes (RNA+proteins)
• Extending metadata search with actual data for large molecular dynamics repositories
Open Position
Machine Learning Enthusiast
• Develop novel machine learning and data mining approaches to
uncover patterns within large datasets.
• Collaborate with other researchers to design and implement algorithms
for fast indexing of complex data such as human motion, proteins,
images, etc.
• Analyze and interpret results from experiments using various
visualization tools and statistical methods.
Get involved with the CODA Research Group
muni.cz/go/coda
1. Collaborate on research topics
• Learned Metric Index, AlphaFind, …
2. Do your PhD in our group
3. Sign up for one of our thesis topics
Still not sure?
Come ask us in person or contact us (dohnal@
f
i.muni.cz) and we can
f
igure it out together!