Matrix decompositions & latent semantic indexing (Chapter 18) Definition 1 (Latent semantic analysis (LSA)) Tf-idf term and document representations are high-dimensional and sparse. This poses computational problems and reduces recall: Terms that occur in documents that are similar but distinct will have dissimilar term representations even though the terms are similar. We can tackle our problem by mapping the term-document matrix 𝐢 to a low-rank representation 𝐢 π‘˜ with rank π‘˜ that minimizes the Frobenius norm of 𝐢 βˆ’ 𝐢 π‘˜: ||𝐢 βˆ’ 𝐢 π‘˜|| = βˆšοΈƒβˆ‘οΈ 𝑑 βˆ‘οΈ 𝑑 (𝐢 βˆ’ 𝐢 π‘˜) 𝑑,𝑑 The Eckart-Young theorem states that 𝐢 π‘˜ = π‘ˆΞ£ π‘˜ 𝑉 𝑇 , where 𝐢 = π‘ˆΞ£π‘‰ 𝑇 is the singular value decomposition (SVD) of 𝐢 and Ξ£ π‘˜ is Ξ£ with only π‘˜ largest diagonal entries retained. π‘ˆΞ£ π‘˜ then gives us dense π‘˜-dimensional tf-idf term representations, while Ξ£ π‘˜ 𝑉 𝑇 gives us dense π‘˜-dimensional tf-idf document representations. The parameter π‘˜ is usually in small hundreds. LSA increases recall and may increase precision by making representations of similar terms more similar due to the low dimensionality. Exercise 18/1 Assume we have a term-document incidence matrix 𝐢 for three terms (rows) and two documents (columns) with the following singular value decomposition (SVD) to π‘ˆΞ£π‘‰ 𝑇 : 𝐢 = βŽ› ⎝ 1 1 0 1 1 0 ⎞ ⎠, π‘ˆ = βŽ› ⎝ βˆ’0.816 0.000 βˆ’0.408 βˆ’0.707 βˆ’0.408 0.707 ⎞ ⎠, Ξ£ = (οΈ‚ 1.732 0.000 0.000 1.000 )οΈ‚ , 𝑉 𝑇 = (οΈ‚ βˆ’0.707 βˆ’0.707 0.707 βˆ’0.707 )οΈ‚ . Using a Google Colaboratory notebook, compute: a) A rank 1 representation 𝐢1 of 𝐢, b) 1-dimensional document representation, and c) 1-dimensional term representation. Distributed Representations (Chapter 18) Definition 2 (tf-idf weighting scheme) In the tf-idf weighting scheme, a term 𝑑 in a document 𝑑 has weight tf-idf 𝑑,𝑑 = tf 𝑑,𝑑 Β· idf 𝑑 where tf 𝑑,𝑑 is the number of tokens 𝑑 (the term frequency of 𝑑) in a document 𝑑. 1 Definition 3 (β„“2 (cosine) normalization) A vector 𝑣 is cosine-normalized by 𝑣 𝑗 ← 𝑣 𝑗 ||𝑣|| = 𝑣 𝑗 √︁ βˆ‘οΈ€|𝑣| π‘˜=1 𝑣 π‘˜ 2 where 𝑣 𝑗 is the element at the 𝑗-th position in 𝑣. Definition 4 (tf-idf term representation) Just as a document 𝑑 can be represented as a vector of tf-idf weights of terms 𝑑, a term 𝑑 can be represented as a vector of weights of 𝑑 in documents 𝑑. If we represent our document collection as a term-document matrix 𝐢 𝑑,𝑑 = tf-idf(𝑑, 𝑑) (see Definition 2), then tf-idf term representations correspond to the rows and tf-idf document representations correspond to the columns. Definition 5 (Word2Vec) Word2Vec is a neural network language model. Given terms 𝑑1 and 𝑑2, Word2vec predicts the probability 𝑝(𝑑1 | 𝑑2) of term 𝑑1 appearing in the context window of size surrounding term 𝑑2. Word2Vec is trained on a text corpus to maximize the probabilities of terms that appear in context. As a side product, word2vec produces dense π‘˜-dimensional term representations. The parameter π‘˜ is usually in low hundreds. Definition 6 (Soft vector space model) The tf-idf document representation with the scalar product as the similarity score underestimates the similarity of documents that use similar but distinct terms. Replacing the scalar product score(𝑑, π‘ž) = 𝑑 𝑇 Β· π‘ž = π‘‡βˆ‘οΈ 𝑖=1 (𝑑𝑖 Β· π‘žπ‘–), where 𝑑 is a tf-idf document representation, π‘ž is a tf-idf query representation, and 𝑇 is the number of terms, with the soft scalar product score(𝑑, π‘ž) = 𝑑 𝑇 Β· 𝑆 Β· π‘ž = π‘‡βˆ‘οΈ 𝑖=1 π‘‡βˆ‘οΈ 𝑗=1 (𝑑𝑖 Β· 𝑆𝑖,𝑗 Β· π‘ž 𝑗), where 𝑆𝑖,𝑗 is the similarity of terms 𝑖 and 𝑗, solves this problem. Algorithm 1 (Levenshtein Distance – declarative approach) Distance between two strings π‘Ž and 𝑏 is given by lev π‘Ž,𝑏(|π‘Ž|, |𝑏|) where lev π‘Ž,𝑏(𝑖, 𝑗) = ⎧ βŽͺβŽͺ⎨ βŽͺβŽͺ⎩ max(𝑖, 𝑗) if min(𝑖, 𝑗) = 0 min ⎧ ⎨ ⎩ lev π‘Ž,𝑏(𝑖 βˆ’ 1, 𝑗) + 1 lev π‘Ž,𝑏(𝑖, 𝑗 βˆ’ 1) + 1 lev π‘Ž,𝑏(𝑖 βˆ’ 1, 𝑗 βˆ’ 1) + 1(π‘Ž 𝑖̸= 𝑏 𝑗 ) otherwise where 1(π‘Ž 𝑖̸= 𝑏 𝑗 ) is the indicator function equal to 1 when π‘Žπ‘– ΜΈ= 𝑏 𝑗, and 0 otherwise. lev π‘Ž,𝑏(𝑖, 𝑗) is the distance between the first 𝑖 characters of string π‘Ž and the first 𝑗 characters of string 𝑏. Exercise 18/2 Consider the Euclidean normalized tf-idf weights from Exercises 6/1 through 6/3. a) What are the tf-idf representations of terms car, auto, insurance, and best? 2 b) What is the similarity score (scalar product) between the tf-idf representations of terms car and auto? c) What is the similarity score (scalar product) between the tf-idf representations of documents π‘‘π‘œπ‘1 and π‘‘π‘œπ‘2? d) What is the similarity score (soft scalar product) between the tf-idf representations of documents π‘‘π‘œπ‘1 and π‘‘π‘œπ‘2 if we use the vector dot product of tf-idf term representations as the term similarity 𝑆? Use Google Colaboratory to perform the computations. e) What is the similarity score (soft scalar product) between the tf-idf representations of documents π‘‘π‘œπ‘1 and π‘‘π‘œπ‘2 if we use the inverse of the Levenshtein distance (see Algorithm 1) as the term similarity 𝑆? Use Google Colaboratory to perform the computations. f) What is the similarity score (soft scalar product) between the tf-idf representations of documents π‘‘π‘œπ‘1 and π‘‘π‘œπ‘2 if we use the vector dot product of Word2Vec representations as the term similarity 𝑆? Use Google Colaboratory to train a Word2Vec model and to perform the computations. Exercise 18/3 Consider the tf representations of the following two documents: β€’ obama speaks to the media in illinois β€’ the president greets the press in chicago What is the similarity score of the two documents if we use a) the scalar product, b) the soft scalar product using the following term similarity 𝑆: obama speaks to the media in illinois president greets press chicago obama 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7 0.0 0.0 0.0 speaks 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 to 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 the 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 media 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.8 0.0 in 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 illinois 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.2 president 0.7 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 greets 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 press 0.0 0.0 0.0 0.0 0.8 0.0 0.0 0.0 0.0 1.0 0.0 chicago 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0 1.0 Use Google Colaboratory to perform the computations. 3