Seminar 4 Definition 1 (Term weight) Weight of a term ๐‘ก in a document ๐‘‘ is counted as ๐‘ค ๐‘ก,๐‘‘ = {๏ธ‚ 1 + log (๏ธ€ tf ๐‘ก,๐‘‘ )๏ธ€ if ๐‘› > 0 0 otherwise where tf ๐‘ก,๐‘‘ is the number of terms ๐‘ก in a document ๐‘‘. Definition 2 (Inverse document frequency) Inverse document frequency of a term ๐‘ก is defined as idf ๐‘ก = log (๏ธ‚ ๐‘ df ๐‘ก )๏ธ‚ where ๐‘ is the number of all documents and df ๐‘ก (document frequency) is the number of documents that contain ๐‘ก. Definition 3 (tf-idf weighting scheme) In the tf-idf weighting scheme, a term ๐‘ก in a document ๐‘‘ has weight tf-idf ๐‘ก,๐‘‘ = tf ๐‘ก,๐‘‘ ยท idf ๐‘ก Definition 4 (Cosine (Euclidean) normalization) A vector ๐‘ฃ is cosine-normalized by ๐‘ฃ ๐‘— = ๐‘ฃ ๐‘— ||๐‘ฃ|| = ๐‘ฃ ๐‘— โˆš๏ธ โˆ‘๏ธ€|๐‘ฃ| ๐‘˜=1 ๐‘ฃ ๐‘˜ 2 where ๐‘ฃ ๐‘— be the number on the ๐‘—-th position in ๐‘ฃ. Exercise 1 Consider the frequency table of the words of three documents ๐‘‘๐‘œ๐‘1, ๐‘‘๐‘œ๐‘2, ๐‘‘๐‘œ๐‘3 below. Calculate the tf-idf weight of the terms car, auto, insurance, best for each document. idf values of terms are in the table. ๐‘‘๐‘œ๐‘1 ๐‘‘๐‘œ๐‘2 ๐‘‘๐‘œ๐‘3 idf car 27 4 24 1.65 auto 3 33 0 2.08 insurance 0 33 29 1.62 best 14 0 17 1.5 Table 1: Exercise. After counting tf-idf weights by Definition 3 individually for each term we get the following table 1 tf-idf ๐‘‘๐‘œ๐‘1 ๐‘‘๐‘œ๐‘2 ๐‘‘๐‘œ๐‘3 car 44.55 6.6 39.6 auto 6.24 68.64 0 insurance 0 53.46 46.98 best 21 0 25.5 Table 2: Solution. Exercise 2 Count document representations as normalized Euclidean weight vectors for each document from the previous exercise. Each vector has four components, one for each term. Normalized Euclidean weight vectors are counted by Definition 4. Denominators ๐‘š ๐‘‘๐‘œ๐‘ ๐‘› for the individual documents are ๐‘š ๐‘‘๐‘œ๐‘1 = โˆš๏ธ€ 44.552 + 6.242 + 212 = 49.6451 ๐‘š ๐‘‘๐‘œ๐‘2 = โˆš๏ธ€ 6.62 + 68.642 + 53.462 = 87.2524 ๐‘š ๐‘‘๐‘œ๐‘3 = โˆš๏ธ€ 39.62 + 46.982 + 25.52 = 66.5247 and the document representations are ๐‘‘1 = (๏ธ‚ 44.55 49.6451 ; 6.24 49.6451 ; 0 49.6451 ; 21 49.6451 )๏ธ‚ = (0.8974; 0.1257; 0; 0.423) ๐‘‘2 = (๏ธ‚ 6.6 87.2524 ; 68.64 87.2524 ; 53.46 87.2524 ; 0 87.2524 )๏ธ‚ = (0.0756; 0.7876; 0.6127; 0) ๐‘‘3 = (๏ธ‚ 39.6 66.5247 ; 0 66.5247 ; 46.98 66.5247 ; 25.5 66.5247 )๏ธ‚ = (0.5953; 0; 0.7062; 0.3833) Exercise 3 Based on the weights from the last exercise, compute the relevance scores of the three documents for the query car insurance. Use each of the two weighting schemes: a) Term weight is 1 if the query contains the word and 0 otherwise. b) Euclidean normalized tf-idf. Please note that a document and a representation of this document are different things. Document is always fixed but the representations may vary under different settings and conditions. In this exercise we fix document representations from the last exercises and will count relevance scores for query and documents under two different representations of the query. It might be helpful to view on a query as on another document, as it is a sequence of words. 2 We count the relevance scores for a) as the scalar products of the representation of the query ๐‘ž = (1, 0, 1, 0) with representations of the documents ๐‘‘ ๐‘› from the last exercise: ๐‘ž ยท ๐‘‘1 = 1 ยท 0.8974 + 0 ยท 0.1257 + 1 ยท 0 + 0 ยท 0.423 = 0.8974 ๐‘ž ยท ๐‘‘2 = 1 ยท 0.0756 + 0 ยท 0.7876 + 1 ยท 0.6127 + 0 ยท 0 = 0.6883 ๐‘ž ยท ๐‘‘3 = 1 ยท 0.5953 + 0 ยท 0 + 1 ยท 0.7062 + 0 ยท 0.3833 = 1.3015 For b) we first need the normalized tf-idf vector ๐‘ž, which is obtained by dividing each component of the query by the length of idf vector โˆš 1.652 + 02 + 1.622 + 02 = 2.3123 tf idf tf-idf ๐‘ž car 1 1.65 1.65 0.7136 auto 0 2.08 0 0 insurance 1 1.62 1.62 0.7006 best 0 1.5 0 0 Table 3: Process of finding the Euclidean normalized tf-idf. Now we multiply ๐‘ž with the document vectors and we obtain the relevance scores: ๐‘ž ยท ๐‘‘1 = 0.7136 ยท 0.8974 + 0 ยท 0.1257 + 0.7006 ยท 0 + 0 ยท 0.423 = 0.6404 ๐‘ž ยท ๐‘‘2 = 0.7136 ยท 0.0756 + 0 ยท 0.7876 + 0.7006 ยท 0.6127 + 0 ยท 0 = 0.4832 ๐‘ž ยท ๐‘‘3 = 0.7136 ยท 0.5953 + 0 ยท 0 + 0.7006 ยท 0.7062 + 0 ยท 0.3833 = 0.9196 Exercise 4 Calculate the vector-space similarity between the query digital cameras and a document containing digital cameras and video cameras by filling in the blank columns in the table below. Assume ๐‘ = 10000000, logarithmic term weighting (columns ๐‘ค) for both query and documents, idf weighting only for the query and cosine normalization only for the document. and is a STOP word. Query Document relevance df tf w idf ๐‘ž tf w ๐‘‘ ๐‘ž ยท ๐‘‘ digital 10 000 video 100 000 cameras 50 000 Table 4: Exercise. The tf value is filled according to the occurrences of the terms in both query and document. tf ๐‘ž = ๐‘‘๐‘–๐‘”๐‘–๐‘ก๐‘Ž๐‘™ ๐‘๐‘Ž๐‘š๐‘’๐‘Ÿ๐‘Ž๐‘  = (1, 0, 1) tf ๐‘‘ = ๐‘‘๐‘–๐‘”๐‘–๐‘ก๐‘Ž๐‘™ ๐‘๐‘Ž๐‘š๐‘’๐‘Ÿ๐‘Ž๐‘  ๐‘Ž๐‘›๐‘‘ ๐‘ฃ๐‘–๐‘‘๐‘’๐‘œ ๐‘๐‘Ž๐‘š๐‘’๐‘Ÿ๐‘Ž๐‘  = (1, 1, 2) 3 Logarithmic weighting uses the Definition 1. For the query the values are ๐‘ค ๐‘‘๐‘–๐‘”๐‘–๐‘ก๐‘Ž๐‘™ = 1 + log (1) = 1 + 0 = 1 ๐‘ค ๐‘ฃ๐‘–๐‘‘๐‘’๐‘œ = 0 ๐‘ค ๐‘๐‘Ž๐‘š๐‘’๐‘Ÿ๐‘Ž๐‘  = 1 + log (1) = 1 + 0 = 1 and for the document ๐‘ค ๐‘‘๐‘–๐‘”๐‘–๐‘ก๐‘Ž๐‘™ = 1 + log (1) = 1 + 0 = 1 ๐‘ค ๐‘ฃ๐‘–๐‘‘๐‘’๐‘œ = 1 + log (1) = 1 + 0 = 1 ๐‘ค ๐‘๐‘Ž๐‘š๐‘’๐‘Ÿ๐‘Ž๐‘  = 1 + log (2) = 1 + 0.301 = 1.301 Now we need to count the idf weights for the query. These are counted by Definition 2. ๐‘–๐‘‘๐‘“ ๐‘‘๐‘–๐‘”๐‘–๐‘ก๐‘Ž๐‘™ = log (๏ธ 107 104 )๏ธ = log (๏ธ€ 103 )๏ธ€ = 3 ๐‘–๐‘‘๐‘“ ๐‘ฃ๐‘–๐‘‘๐‘’๐‘œ = log (๏ธ 107 105 )๏ธ = log (๏ธ€ 102 )๏ธ€ = 2 ๐‘–๐‘‘๐‘“ ๐‘๐‘Ž๐‘š๐‘’๐‘Ÿ๐‘Ž๐‘  = log (๏ธ 107 5ร—104 )๏ธ = log (200) = 2.301 and ๐‘ž = ๐‘ค ยท ๐‘–๐‘‘๐‘“. Cosine normalization for the document is counted similarly as in the last exercises by Definition 4 using ๐‘ค. ๐‘‘ ๐‘‘๐‘–๐‘”๐‘–๐‘ก๐‘Ž๐‘™ = 1โˆš 12+12+1.3012 = 0.5204 ๐‘‘ ๐‘ฃ๐‘–๐‘‘๐‘’๐‘œ = 1โˆš 12+12+1.3012 = 0.5204 ๐‘‘ ๐‘๐‘Ž๐‘š๐‘’๐‘Ÿ๐‘Ž๐‘  = 1.301โˆš 12+12+1.3012 = 0.677 The score is the scalar multiple of ๐‘ž and ๐‘‘. The final table is Query Document relevance df tf w idf q tf w d ๐‘ž ยท ๐‘‘ digital 10 000 1 1 3 3 1 1 0.5204 1.5612 video 100 000 0 0 2 0 1 1 0.5204 0 cameras 50 000 1 1 2.301 2.301 2 1.301 0.677 1.5578 Table 5: Solution. and the similarity score is ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’(๐‘‘, ๐‘ž) = 3โˆ‘๏ธ ๐‘–=1 (๐‘‘๐‘– ยท ๐‘ž๐‘–) = 3.119. Exercise 5 Show that for the query ๐‘ž1 = affection the documents in the table below are sorted by relevance in the opposite order as for the query ๐‘ž2 = jealous gossip. Query is tf weight normalized. 4 SaS PaP WH affection 0.996 0.993 0.847 jealous 0.087 0.120 0.466 gossip 0.017 0 0.254 Table 6: Exercise. We add queries to the original table: SaS PaP WH ๐‘ž1 ๐‘ž2 affection 0.996 0.993 0.847 1 0 jealous 0.087 0.120 0.466 0 1 gossip 0.017 0 0.254 0 1 Table 7: Exercise with queries. Now we normalize the vectors ๐‘ž๐‘– by Definition 4 and get SaS PaP WH ๐‘ž1 ๐‘ž2 ๐‘ž1๐‘› ๐‘ž2๐‘› affection 0.996 0.993 0.847 1 0 1 0 jealous 0.087 0.120 0.466 0 1 0 0.7071 gossip 0.017 0 0.254 0 1 0 0.7071 Table 8: Exercise with queries after normalization. In the last step we count the similarity score between the queries and documents by ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’(๐‘‘, ๐‘ž) = โˆ‘๏ธ€|๐‘‘| ๐‘–=1(๐‘‘๐‘– ยท ๐‘ž๐‘–) ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’(๐‘†๐‘Ž๐‘†, ๐‘ž1) = 0.9961 ยท 1 + 0.087 ยท 0 + 0.017 ยท 0 = 0.9961 ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’(๐‘ƒ ๐‘Ž๐‘ƒ, ๐‘ž1) = 0.993 ยท 1 + 0.120 ยท 0 + 0 ยท 0 = 0.993 ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’(๐‘Š ๐ป, ๐‘ž1) = 0.847 ยท 1 + 0.466 ยท 0 + 0.254 ยท 0 = 0.847 ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’(๐‘†๐‘Ž๐‘†, ๐‘ž2) = 0.9961 ยท 0 + 0.087 ยท 0.7071 + 0.017 ยท 0.7071 = 0.0735 ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’(๐‘ƒ ๐‘Ž๐‘ƒ, ๐‘ž2) = 0.993 ยท 0 + 0.120 ยท 0.7071 + 0 ยท 0.7071 = 0.0849 ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’(๐‘Š ๐ป, ๐‘ž2) = 0.847 ยท 0 + 0.466 ยท 0.7071 + 0.254 ยท 0.7071 = 0.5091 The ordering for ๐‘ž1 is SaS > PaP > WH and for ๐‘ž2 is WH > PaP > SaS, and we see that they are opposite. 5