Seminar 4
Definition 1 (Term weight) Weight of a term 𝑡 in a document 𝑑 is counted as
𝑤 𝑡,𝑑 =
{︂
1 + log
(︀
tf 𝑡,𝑑
)︀
if 𝑛 > 0
0 otherwise
where tf 𝑡,𝑑 is the number of terms 𝑡 in a document 𝑑.
Definition 2 (Inverse document frequency) Inverse document frequency of a term 𝑡 is
defined as
idf 𝑡 = log
(︂
𝑁
df 𝑡
)︂
where 𝑁 is the number of all documents and df 𝑡 (document frequency) is the number of
documents that contain 𝑡.
Definition 3 (tf-idf weighting scheme) In the tf-idf weighting scheme, a term 𝑡 in a
document 𝑑 has weight
tf-idf 𝑡,𝑑 = tf 𝑡,𝑑 · idf 𝑡
Definition 4 (Cosine (Euclidean) normalization) A vector 𝑣 is cosine-normalized by
𝑣 𝑗 =
𝑣 𝑗
||𝑣||
=
𝑣 𝑗
√︁
∑︀|𝑣|
𝑘=1 𝑣 𝑘
2
where 𝑣 𝑗 be the number on the 𝑗-th position in 𝑣.
Exercise 1
Consider the frequency table of the words of three documents 𝑑𝑜𝑐1, 𝑑𝑜𝑐2, 𝑑𝑜𝑐3 below.
Calculate the tf-idf weight of the terms car, auto, insurance, best for each document. idf
values of terms are in the table.
𝑑𝑜𝑐1 𝑑𝑜𝑐2 𝑑𝑜𝑐3 idf
car 27 4 24 1.65
auto 3 33 0 2.08
insurance 0 33 29 1.62
best 14 0 17 1.5
Table 1: Exercise.
After counting tf-idf weights by Definition 3 individually for each term we get the following
table
1
tf-idf
𝑑𝑜𝑐1 𝑑𝑜𝑐2 𝑑𝑜𝑐3
car 44.55 6.6 39.6
auto 6.24 68.64 0
insurance 0 53.46 46.98
best 21 0 25.5
Table 2: Solution.
Exercise 2
Count document representations as normalized Euclidean weight vectors for each document
from the previous exercise. Each vector has four components, one for each term.
Normalized Euclidean weight vectors are counted by Definition 4. Denominators 𝑚 𝑑𝑜𝑐 𝑛
for
the individual documents are
𝑚 𝑑𝑜𝑐1
=
√︀
44.552 + 6.242 + 212 = 49.6451
𝑚 𝑑𝑜𝑐2
=
√︀
6.62 + 68.642 + 53.462 = 87.2524
𝑚 𝑑𝑜𝑐3
=
√︀
39.62 + 46.982 + 25.52 = 66.5247
and the document representations are
𝑑1 =
(︂
44.55
49.6451
;
6.24
49.6451
;
0
49.6451
;
21
49.6451
)︂
= (0.8974; 0.1257; 0; 0.423)
𝑑2 =
(︂
6.6
87.2524
;
68.64
87.2524
;
53.46
87.2524
;
0
87.2524
)︂
= (0.0756; 0.7876; 0.6127; 0)
𝑑3 =
(︂
39.6
66.5247
;
0
66.5247
;
46.98
66.5247
;
25.5
66.5247
)︂
= (0.5953; 0; 0.7062; 0.3833)
Exercise 3
Based on the weights from the last exercise, compute the relevance scores of the three
documents for the query car insurance. Use each of the two weighting schemes:
a) Term weight is 1 if the query contains the word and 0 otherwise.
b) Euclidean normalized tf-idf.
Please note that a document and a representation of this document are different things.
Document is always fixed but the representations may vary under different settings and
conditions. In this exercise we fix document representations from the last exercises and will
count relevance scores for query and documents under two different representations of the
query. It might be helpful to view on a query as on another document, as it is a sequence of
words.
2
We count the relevance scores for a) as the scalar products of the representation of the query
𝑞 = (1, 0, 1, 0) with representations of the documents 𝑑 𝑛 from the last exercise:
𝑞 · 𝑑1 = 1 · 0.8974 + 0 · 0.1257 + 1 · 0 + 0 · 0.423 = 0.8974
𝑞 · 𝑑2 = 1 · 0.0756 + 0 · 0.7876 + 1 · 0.6127 + 0 · 0 = 0.6883
𝑞 · 𝑑3 = 1 · 0.5953 + 0 · 0 + 1 · 0.7062 + 0 · 0.3833 = 1.3015
For b) we first need the normalized tf-idf vector 𝑞, which is obtained by dividing each
component of the query by the length of idf vector
√
1.652 + 02 + 1.622 + 02 = 2.3123
tf idf tf-idf 𝑞
car 1 1.65 1.65 0.7136
auto 0 2.08 0 0
insurance 1 1.62 1.62 0.7006
best 0 1.5 0 0
Table 3: Process of finding the Euclidean normalized tf-idf.
Now we multiply 𝑞 with the document vectors and we obtain the relevance scores:
𝑞 · 𝑑1 = 0.7136 · 0.8974 + 0 · 0.1257 + 0.7006 · 0 + 0 · 0.423 = 0.6404
𝑞 · 𝑑2 = 0.7136 · 0.0756 + 0 · 0.7876 + 0.7006 · 0.6127 + 0 · 0 = 0.4832
𝑞 · 𝑑3 = 0.7136 · 0.5953 + 0 · 0 + 0.7006 · 0.7062 + 0 · 0.3833 = 0.9196
Exercise 4
Calculate the vector-space similarity between the query digital cameras and a document
containing digital cameras and video cameras by filling in the blank columns in the table
below. Assume 𝑁 = 10000000, logarithmic term weighting (columns 𝑤) for both query and
documents, idf weighting only for the query and cosine normalization only for the document.
and is a STOP word.
Query Document relevance
df tf w idf 𝑞 tf w 𝑑 𝑞 · 𝑑
digital 10 000
video 100 000
cameras 50 000
Table 4: Exercise.
The tf value is filled according to the occurrences of the terms in both query and document.
tf 𝑞 = 𝑑𝑖𝑔𝑖𝑡𝑎𝑙 𝑐𝑎𝑚𝑒𝑟𝑎𝑠 = (1, 0, 1)
tf 𝑑 = 𝑑𝑖𝑔𝑖𝑡𝑎𝑙 𝑐𝑎𝑚𝑒𝑟𝑎𝑠 𝑎𝑛𝑑 𝑣𝑖𝑑𝑒𝑜 𝑐𝑎𝑚𝑒𝑟𝑎𝑠 = (1, 1, 2)
3
Logarithmic weighting uses the Definition 1. For the query the values are
𝑤 𝑑𝑖𝑔𝑖𝑡𝑎𝑙 = 1 + log (1) = 1 + 0 = 1
𝑤 𝑣𝑖𝑑𝑒𝑜 = 0
𝑤 𝑐𝑎𝑚𝑒𝑟𝑎𝑠 = 1 + log (1) = 1 + 0 = 1
and for the document
𝑤 𝑑𝑖𝑔𝑖𝑡𝑎𝑙 = 1 + log (1) = 1 + 0 = 1
𝑤 𝑣𝑖𝑑𝑒𝑜 = 1 + log (1) = 1 + 0 = 1
𝑤 𝑐𝑎𝑚𝑒𝑟𝑎𝑠 = 1 + log (2) = 1 + 0.301 = 1.301
Now we need to count the idf weights for the query. These are counted by Definition 2.
𝑖𝑑𝑓 𝑑𝑖𝑔𝑖𝑡𝑎𝑙 = log
(︁
107
104
)︁
= log
(︀
103
)︀
= 3
𝑖𝑑𝑓 𝑣𝑖𝑑𝑒𝑜 = log
(︁
107
105
)︁
= log
(︀
102
)︀
= 2
𝑖𝑑𝑓 𝑐𝑎𝑚𝑒𝑟𝑎𝑠 = log
(︁
107
5×104
)︁
= log (200) = 2.301
and 𝑞 = 𝑤 · 𝑖𝑑𝑓. Cosine normalization for the document is counted similarly as in the last
exercises by Definition 4 using 𝑤.
𝑑 𝑑𝑖𝑔𝑖𝑡𝑎𝑙 = 1√
12+12+1.3012
= 0.5204
𝑑 𝑣𝑖𝑑𝑒𝑜 = 1√
12+12+1.3012
= 0.5204
𝑑 𝑐𝑎𝑚𝑒𝑟𝑎𝑠 = 1.301√
12+12+1.3012
= 0.677
The score is the scalar multiple of 𝑞 and 𝑑. The final table is
Query Document relevance
df tf w idf q tf w d 𝑞 · 𝑑
digital 10 000 1 1 3 3 1 1 0.5204 1.5612
video 100 000 0 0 2 0 1 1 0.5204 0
cameras 50 000 1 1 2.301 2.301 2 1.301 0.677 1.5578
Table 5: Solution.
and the similarity score is
𝑠𝑐𝑜𝑟𝑒(𝑑, 𝑞) =
3∑︁
𝑖=1
(𝑑𝑖 · 𝑞𝑖) = 3.119.
Exercise 5
Show that for the query 𝑞1 = affection the documents in the table below are sorted by
relevance in the opposite order as for the query 𝑞2 = jealous gossip. Query is tf weight
normalized.
4
SaS PaP WH
affection 0.996 0.993 0.847
jealous 0.087 0.120 0.466
gossip 0.017 0 0.254
Table 6: Exercise.
We add queries to the original table:
SaS PaP WH 𝑞1 𝑞2
affection 0.996 0.993 0.847 1 0
jealous 0.087 0.120 0.466 0 1
gossip 0.017 0 0.254 0 1
Table 7: Exercise with queries.
Now we normalize the vectors 𝑞𝑖 by Definition 4 and get
SaS PaP WH 𝑞1 𝑞2 𝑞1𝑛 𝑞2𝑛
affection 0.996 0.993 0.847 1 0 1 0
jealous 0.087 0.120 0.466 0 1 0 0.7071
gossip 0.017 0 0.254 0 1 0 0.7071
Table 8: Exercise with queries after normalization.
In the last step we count the similarity score between the queries and documents by
𝑠𝑐𝑜𝑟𝑒(𝑑, 𝑞) =
∑︀|𝑑|
𝑖=1(𝑑𝑖 · 𝑞𝑖)
𝑠𝑐𝑜𝑟𝑒(𝑆𝑎𝑆, 𝑞1) = 0.9961 · 1 + 0.087 · 0 + 0.017 · 0 = 0.9961
𝑠𝑐𝑜𝑟𝑒(𝑃 𝑎𝑃, 𝑞1) = 0.993 · 1 + 0.120 · 0 + 0 · 0 = 0.993
𝑠𝑐𝑜𝑟𝑒(𝑊 𝐻, 𝑞1) = 0.847 · 1 + 0.466 · 0 + 0.254 · 0 = 0.847
𝑠𝑐𝑜𝑟𝑒(𝑆𝑎𝑆, 𝑞2) = 0.9961 · 0 + 0.087 · 0.7071 + 0.017 · 0.7071 = 0.0735
𝑠𝑐𝑜𝑟𝑒(𝑃 𝑎𝑃, 𝑞2) = 0.993 · 0 + 0.120 · 0.7071 + 0 · 0.7071 = 0.0849
𝑠𝑐𝑜𝑟𝑒(𝑊 𝐻, 𝑞2) = 0.847 · 0 + 0.466 · 0.7071 + 0.254 · 0.7071 = 0.5091
The ordering for 𝑞1 is SaS > PaP > WH and for 𝑞2 is WH > PaP > SaS, and we see that
they are opposite.
5