Cranﬁeld collection IR system
Václav Sobotka
454828@mail.muni.cz
May 7, 2021
Document & query preprocessing techniques
Tokenization – nltk word tokenizer
Trimming of hyphens, apostrophes, slashes and similar
characters
Splitting tokens by hyphens and slashes into more tokens
Filtering of ignored tokens – slightly expanded list of stopwords
from nltk
Stemming using SnowballStemmer from nltk
Bigrams are derived and used as separate terms
Bigrams are created as a concatenation of two subsequent terms
after both are completely preprocessed including the stemming
Václav Sobotka · ·May 7, 2021 2 / 5
Inverted index structure & vector model
Custom implementation
Items in posting lists keep occurrence counts for title, authors,
bibliography and body separately instead of total count for term
in document
Further used for different importance of document sections in the
TF-IDF model
TF-IDF with (weighted) cosine similarity is used as the vector
model
Václav Sobotka · ·May 7, 2021 3 / 5
Pseudo-relevance feedback
Proved to have a signiﬁcant impact on the results
After obtaining the initial result, all documents with scores of at
least 90 % of the best document are taken as relevant
Top 100 important terms (based on TF-IDF) per document are
kept
The second exapanded query is enriched by the most important
terms from all the documents that were assumed to be relevant
Only one additional query is constructed and executed
Václav Sobotka · ·May 7, 2021 4 / 5
Topic modeling
In addition to the main index used for TF-IDF searching, 3 more
indices with topic modeling were used
The topic modeling code was taken from the example notebook
for LSI
The three indices use 300, 150 and 50 topics respectively
The ﬁnal ordering of answers is based on a weighted scheme
The main TF-IDF index is the most important
The importance of the topic indices decreases with the
decreasing number of topics
The ﬁnal score is just a weighted sum of scores achieved across
all of the four indices
Václav Sobotka · ·May 7, 2021 5 / 5