Seminar 4 Definition 1 (Term weight) Weight of a term ๐‘ก in a document ๐‘‘ is counted as ๐‘ค ๐‘ก,๐‘‘ = {๏ธ‚ 1 + log (๏ธ€ tf ๐‘ก,๐‘‘ )๏ธ€ if ๐‘› > 0 0 otherwise where tf ๐‘ก,๐‘‘ is the number of terms ๐‘ก in a document ๐‘‘. Definition 2 (Inverse document frequency) Inverse document frequency of a term ๐‘ก is defined as idf ๐‘ก = log (๏ธ‚ ๐‘ df ๐‘ก )๏ธ‚ where ๐‘ is the number of all documents and df ๐‘ก (document frequency) is the number of documents that contain ๐‘ก. Definition 3 (tf-idf weighting scheme) In the tf-idf weighting scheme, a term ๐‘ก in a document ๐‘‘ has weight tf-idf ๐‘ก,๐‘‘ = tf ๐‘ก,๐‘‘ ยท idf ๐‘ก Definition 4 (Cosine (Euclidean) normalization) A vector ๐‘ฃ is cosine-normalized by ๐‘ฃ ๐‘— = ๐‘ฃ ๐‘— ||๐‘ฃ|| = ๐‘ฃ ๐‘— โˆš๏ธ โˆ‘๏ธ€|๐‘ฃ| ๐‘˜=1 ๐‘ฃ ๐‘˜ 2 where ๐‘ฃ ๐‘— be the number on the ๐‘—-th position in ๐‘ฃ. Exercise 1 Consider the frequency table of the words of three documents ๐‘‘๐‘œ๐‘1, ๐‘‘๐‘œ๐‘2, ๐‘‘๐‘œ๐‘3 below. Calculate the tf-idf weight of the terms car, auto, insurance, best for each document. idf values of terms are in the table. ๐‘‘๐‘œ๐‘1 ๐‘‘๐‘œ๐‘2 ๐‘‘๐‘œ๐‘3 idf car 27 4 24 1.65 auto 3 33 0 2.08 insurance 0 33 29 1.62 best 14 0 17 1.5 Table 1: Exercise. Exercise 2 Count document representations as normalized Euclidean weight vectors for each document from the previous exercise. Each vector has four components, one for each term. 1 Exercise 3 Based on the weights from the last exercise, compute the relevance scores of the three documents for the query car insurance. Use each of the two weighting schemes: a) Term weight is 1 if the query contains the word and 0 otherwise. b) Euclidean normalized tf-idf. Please note that a document and a representation of this document are different things. Document is always fixed but the representations may vary under different settings and conditions. In this exercise we fix document representations from the last exercises and will count relevance scores for query and documents under two different representations of the query. It might be helpful to view on a query as on another document, as it is a sequence of words. Exercise 4 Calculate the vector-space similarity between the query digital cameras and a document containing digital cameras and video cameras by filling in the blank columns in the table below. Assume ๐‘ = 10000000, logarithmic term weighting (columns ๐‘ค) for both query and documents, idf weighting only for the query and cosine normalization only for the document. and is a STOP word. Query Document relevance df tf w idf ๐‘ž tf w ๐‘‘ ๐‘ž ยท ๐‘‘ digital 10 000 video 100 000 cameras 50 000 Table 2: Exercise. Exercise 5 Show that for the query ๐‘ž1 = affection the documents in the table below are sorted by relevance in the opposite order as for the query ๐‘ž2 = jealous gossip. Query is tf weight normalized. SaS PaP WH affection 0.996 0.993 0.847 jealous 0.087 0.120 0.466 gossip 0.017 0 0.254 Table 3: Exercise. 2