Seminar 4
Definition 1 (Term weight) Weight of a term 𝑡 in a document 𝑑 is counted as
𝑤 𝑡,𝑑 =
{︂
1 + log
(︀
tf 𝑡,𝑑
)︀
if 𝑛 > 0
0 otherwise
where tf 𝑡,𝑑 is the number of terms 𝑡 in a document 𝑑.
Definition 2 (Inverse document frequency) Inverse document frequency of a term 𝑡 is
defined as
idf 𝑡 = log
(︂
𝑁
df 𝑡
)︂
where 𝑁 is the number of all documents and df 𝑡 (document frequency) is the number of
documents that contain 𝑡.
Definition 3 (tf-idf weighting scheme) In the tf-idf weighting scheme, a term 𝑡 in a
document 𝑑 has weight
tf-idf 𝑡,𝑑 = tf 𝑡,𝑑 · idf 𝑡
Definition 4 (Cosine (Euclidean) normalization) A vector 𝑣 is cosine-normalized by
𝑣 𝑗 =
𝑣 𝑗
||𝑣||
=
𝑣 𝑗
√︁
∑︀|𝑣|
𝑘=1 𝑣 𝑘
2
where 𝑣 𝑗 be the number on the 𝑗-th position in 𝑣.
Exercise 1
Consider the frequency table of the words of three documents 𝑑𝑜𝑐1, 𝑑𝑜𝑐2, 𝑑𝑜𝑐3 below.
Calculate the tf-idf weight of the terms car, auto, insurance, best for each document. idf
values of terms are in the table.
𝑑𝑜𝑐1 𝑑𝑜𝑐2 𝑑𝑜𝑐3 idf
car 27 4 24 1.65
auto 3 33 0 2.08
insurance 0 33 29 1.62
best 14 0 17 1.5
Table 1: Exercise.
Exercise 2
Count document representations as normalized Euclidean weight vectors for each document
from the previous exercise. Each vector has four components, one for each term.
1
Exercise 3
Based on the weights from the last exercise, compute the relevance scores of the three
documents for the query car insurance. Use each of the two weighting schemes:
a) Term weight is 1 if the query contains the word and 0 otherwise.
b) Euclidean normalized tf-idf.
Please note that a document and a representation of this document are different things.
Document is always fixed but the representations may vary under different settings and
conditions. In this exercise we fix document representations from the last exercises and will
count relevance scores for query and documents under two different representations of the
query. It might be helpful to view on a query as on another document, as it is a sequence of
words.
Exercise 4
Calculate the vector-space similarity between the query digital cameras and a document
containing digital cameras and video cameras by filling in the blank columns in the table
below. Assume 𝑁 = 10000000, logarithmic term weighting (columns 𝑤) for both query and
documents, idf weighting only for the query and cosine normalization only for the document.
and is a STOP word.
Query Document relevance
df tf w idf 𝑞 tf w 𝑑 𝑞 · 𝑑
digital 10 000
video 100 000
cameras 50 000
Table 2: Exercise.
Exercise 5
Show that for the query 𝑞1 = affection the documents in the table below are sorted by
relevance in the opposite order as for the query 𝑞2 = jealous gossip. Query is tf weight
normalized.
SaS PaP WH
affection 0.996 0.993 0.847
jealous 0.087 0.120 0.466
gossip 0.017 0 0.254
Table 3: Exercise.
2