Matrix decompositions & latent semantic indexing
(Chapter 18)
Definition 1 (Latent semantic analysis (LSA))
Tf-idf term and document representations are high-dimensional and sparse. This poses
computational problems and reduces recall: Terms that occur in documents that are
similar but distinct will have dissimilar term representations even though the terms
are similar. We can tackle our problem by mapping the term-document matrix 𝐶 to a
low-rank representation 𝐶 𝑘 with rank 𝑘 that minimizes the Frobenius norm of 𝐶 − 𝐶 𝑘:
||𝐶 − 𝐶 𝑘|| =
√︃∑︁
𝑡
∑︁
𝑑
(𝐶 − 𝐶 𝑘) 𝑡,𝑑
The Eckart-Young theorem states that 𝐶 𝑘 = 𝑈Σ 𝑘 𝑉 𝑇
, where 𝐶 = 𝑈Σ𝑉 𝑇
is the
singular value decomposition (SVD) of 𝐶 and Σ 𝑘 is Σ with only 𝑘 largest diagonal entries
retained.
𝑈Σ 𝑘 then gives us dense 𝑘-dimensional tf-idf term representations, while Σ 𝑘 𝑉 𝑇
gives us dense 𝑘-dimensional tf-idf document representations. The parameter 𝑘 is
usually in small hundreds. LSA increases recall and may increase precision by making
representations of similar terms more similar due to the low dimensionality.
Exercise 18/1
Assume we have a term-document incidence matrix 𝐶 for three terms (rows) and two
documents (columns) with the following singular value decomposition (SVD) to 𝑈Σ𝑉 𝑇
:
𝐶 =
⎛
⎝
1 1
0 1
1 0
⎞
⎠, 𝑈 =
⎛
⎝
−0.816 0.000
−0.408 −0.707
−0.408 0.707
⎞
⎠, Σ =
(︂
1.732 0.000
0.000 1.000
)︂
, 𝑉 𝑇
=
(︂
−0.707 −0.707
0.707 −0.707
)︂
.
Using a Google Colaboratory notebook, compute:
a) A rank 1 representation 𝐶1 of 𝐶,
b) 1-dimensional document representation, and
c) 1-dimensional term representation.
Distributed Representations (Chapter 18)
Definition 2 (tf-idf weighting scheme)
In the tf-idf weighting scheme, a term 𝑡 in a document 𝑑 has weight
tf-idf 𝑡,𝑑 = tf 𝑡,𝑑 · idf 𝑡
where tf 𝑡,𝑑 is the number of tokens 𝑡 (the term frequency of 𝑡) in a document 𝑑.
1
Definition 3 (ℓ2
(cosine) normalization)
A vector 𝑣 is cosine-normalized by
𝑣 𝑗 ←
𝑣 𝑗
||𝑣||
=
𝑣 𝑗
√︁
∑︀|𝑣|
𝑘=1 𝑣 𝑘
2
where 𝑣 𝑗 is the element at the 𝑗-th position in 𝑣.
Definition 4 (tf-idf term representation)
Just as a document 𝑑 can be represented as a vector of tf-idf weights of terms 𝑡, a term
𝑡 can be represented as a vector of weights of 𝑡 in documents 𝑑. If we represent our
document collection as a term-document matrix 𝐶 𝑡,𝑑 = tf-idf(𝑡, 𝑑) (see Definition 2), then
tf-idf term representations correspond to the rows and tf-idf document representations
correspond to the columns.
Definition 5 (Word2Vec)
Word2Vec is a neural network language model. Given terms 𝑡1 and 𝑡2, Word2vec predicts
the probability 𝑝(𝑡1 | 𝑡2) of term 𝑡1 appearing in the context window of size surrounding
term 𝑡2. Word2Vec is trained on a text corpus to maximize the probabilities of terms
that appear in context. As a side product, word2vec produces dense 𝑘-dimensional term
representations. The parameter 𝑘 is usually in low hundreds.
Definition 6 (Soft vector space model)
The tf-idf document representation with the scalar product as the similarity score underestimates
the similarity of documents that use similar but distinct terms. Replacing the
scalar product
score(𝑑, 𝑞) = 𝑑 𝑇
· 𝑞 =
𝑇∑︁
𝑖=1
(𝑑𝑖 · 𝑞𝑖),
where 𝑑 is a tf-idf document representation, 𝑞 is a tf-idf query representation, and 𝑇 is
the number of terms, with the soft scalar product
score(𝑑, 𝑞) = 𝑑 𝑇
· 𝑆 · 𝑞 =
𝑇∑︁
𝑖=1
𝑇∑︁
𝑗=1
(𝑑𝑖 · 𝑆𝑖,𝑗 · 𝑞 𝑗),
where 𝑆𝑖,𝑗 is the similarity of terms 𝑖 and 𝑗, solves this problem.
Algorithm 1 (Levenshtein Distance – declarative approach)
Distance between two strings 𝑎 and 𝑏 is given by lev 𝑎,𝑏(|𝑎|, |𝑏|) where
lev 𝑎,𝑏(𝑖, 𝑗) =
⎧
⎪⎪⎨
⎪⎪⎩
max(𝑖, 𝑗) if min(𝑖, 𝑗) = 0
min
⎧
⎨
⎩
lev 𝑎,𝑏(𝑖 − 1, 𝑗) + 1
lev 𝑎,𝑏(𝑖, 𝑗 − 1) + 1
lev 𝑎,𝑏(𝑖 − 1, 𝑗 − 1) + 1(𝑎 𝑖̸= 𝑏 𝑗 )
otherwise
where 1(𝑎 𝑖̸= 𝑏 𝑗 ) is the indicator function equal to 1 when 𝑎𝑖 ̸= 𝑏 𝑗, and 0 otherwise.
lev 𝑎,𝑏(𝑖, 𝑗) is the distance between the first 𝑖 characters of string 𝑎 and the first 𝑗 characters
of string 𝑏.
Exercise 18/2
Consider the Euclidean normalized tf-idf weights from Exercises 6/1 through 6/3.
a) What are the tf-idf representations of terms car, auto, insurance, and best?
2
b) What is the similarity score (scalar product) between the tf-idf representations of
terms car and auto?
c) What is the similarity score (scalar product) between the tf-idf representations of
documents 𝑑𝑜𝑐1 and 𝑑𝑜𝑐2?
d) What is the similarity score (soft scalar product) between the tf-idf representations
of documents 𝑑𝑜𝑐1 and 𝑑𝑜𝑐2 if we use the vector dot product of tf-idf term
representations as the term similarity 𝑆? Use Google Colaboratory to perform the
computations.
e) What is the similarity score (soft scalar product) between the tf-idf representations
of documents 𝑑𝑜𝑐1 and 𝑑𝑜𝑐2 if we use the inverse of the Levenshtein distance (see
Algorithm 1) as the term similarity 𝑆? Use Google Colaboratory to perform the
computations.
f) What is the similarity score (soft scalar product) between the tf-idf representations
of documents 𝑑𝑜𝑐1 and 𝑑𝑜𝑐2 if we use the vector dot product of Word2Vec representations
as the term similarity 𝑆? Use Google Colaboratory to train a Word2Vec
model and to perform the computations.
Exercise 18/3
Consider the tf representations of the following two documents:
• obama speaks to the media in illinois
• the president greets the press in chicago
What is the similarity score of the two documents if we use
a) the scalar product,
b) the soft scalar product using the following term similarity 𝑆:
obama
speaks
to
the
media
in
illinois
president
greets
press
chicago
obama 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7 0.0 0.0 0.0
speaks 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
to 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
the 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
media 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.8 0.0
in 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
illinois 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.2
president 0.7 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
greets 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
press 0.0 0.0 0.0 0.0 0.8 0.0 0.0 0.0 0.0 1.0 0.0
chicago 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0 1.0
Use Google Colaboratory to perform the computations.
3