👷 Introduction to Information Retrieval

Anatomy of the web-scale IR system and embedding revolution 27. 3. 2024

Lectures (week 6)

This week, there will be a summary of the first part of the course, which is building an inverted index and querying on local and global scales, as well as the basics of the new generation of indexing based on the embeddings. By this week's end, students should understand inverted index-based retrieval locally (Sketch Engine, Digital Libraries) or globally (Google).

Ota Mikušek (Lexical Computing): Indexing and querying texts locally - The story of Sketch Engine (40 min)

Indexing and querying corpora in the Sketch Engine. How are the texts from the web crawled and created?

PV211 Information Retrieval Sketch Engine
How are the texts from the web crawled and created? Creating, indexing, and querying corpora in the Sketch Engine.

Santosh Kesiraju (FIT VUT): The basics of embedding IR era (60 min)

Introduction to word embeddings
Introduction to word embeddings. Word2vec, skip-gram, CBoW. Loss function, gradient descent, derivation, and interpretation. Paragraph vector (bag-of-words), loss function, and interpretation. Doc embeddings. Similarity metrics (cosine, Euclidean) for IR.

Recording of the 2023 PV211 lecture (including the extensive comments about the second project assignments)

Second-term project assignment
BEIR CQADupStack and ARQMath Collection
Infrastructure building at Google
Slides from the lecture in 2013 by Jeff Dean, principal engineer of Google

Challenges in Building Large-Scale Information Retrieval Systems
Slides from the lecture in 2015 by Jeff Dean, principal engineer of Google

Readings

Google Crash Course (in Czech)
A web page by Dušan "Yuhů" Janovský
The Anatomy of a Large-Scale Hypertextual Web Search Engine
A 1998 paper by Sergey Brin and Lawrence Page
The Google File System
A 2003 paper by Ghemawat et al.
The Anatomy of Google Architecture
Slides for a lecture from 2009 by Ed Austin
Building Software Systems At Google and Lessons Learned
A lecture from 2010-11-10 by Jeff Dean
Lessons Learned While Building Infrastructure Software at Google
Slides for a lecture from 2013 by Jeff Dean
How Google Works (in Czech)
A tutorial from 2014 by Tomáš Effenberger

Seminar

Second-term project assignment (CQADupStack Collection)
Google Colaboratory code for the second-term project
Second-term project leaderboard (CQADupStack Collection)
Google Spreadsheet leaderboard for the second term project
Alternative second-term project assignment (ARQMath Collection)
Google Colaboratory code for the alternative second-term project
Alternative second-term project leaderboard (ARQMath Collection)
Google Spreadsheet leaderboard for the alternative second-term project
Second term project Jupyter Hub
Dedicated computational resources for your second-term projects