The Term Vocabulary (Chapter 2) Definition 1 (Recall) Recall describes how many of the relevant documents are retrieved. π‘Ÿπ‘’π‘π‘Žπ‘™π‘™ = 𝑅 = #π‘Ÿπ‘’π‘™π‘’π‘£π‘Žπ‘›π‘‘ π‘Ÿπ‘’π‘‘π‘Ÿπ‘–π‘’π‘£π‘’π‘‘ #π‘Ÿπ‘’π‘™π‘’π‘£π‘Žπ‘›π‘‘ Definition 2 (Precision) Precision describes how many of the retrieved documents are relevant. π‘π‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› = 𝑃 = #π‘Ÿπ‘’π‘™π‘’π‘£π‘Žπ‘›π‘‘ π‘Ÿπ‘’π‘‘π‘Ÿπ‘–π‘’π‘£π‘’π‘‘ #π‘Ÿπ‘’π‘‘π‘Ÿπ‘–π‘’π‘£π‘’π‘‘ Definition 3 (Porter stemmer) The entire Porter’s algorithm is too complex to present here. It consists of 5 phases of word reductions, applied sequentially. The first phase uses the following rule group. Importantly, only the rule that applies to the longest suffix is used. Rule SSES β†’ SS IES β†’ I SS β†’ SS S β†’ Example caresses β†’ caress ponies β†’ poni caress β†’ caress cats β†’ cat Exercise 2/1 Are the following statements true or false? 1. In a Boolean retrieval system, stemming never lowers precision. 2. In a Boolean retrieval system, stemming never lowers recall. 3. Stemming increases the size of the vocabulary. 4. Stemming should be invoked at indexing time but not while processing a query. Exercise 2/2 Suggest what normalized form should be used for these words (including the word itself as a possibility): 1. ’Cos 2. Shi’ite 3. cont’d 4. Hawai’i 5. O’Rourke 1 Exercise 2/3 The following pairs of words are stemmed to the same form by the Porter stemmer. Which pairs would you argue shouldn’t be conflated. Give your reasoning. 1. abandon/abandonment 2. absorbency/absorbent 3. marketing/markets 4. university/universe 5. volume/volumes Exercise 2/4 For the Porter stemmer group shown in Definition 3: 1. What is the purpose of including an identity rule such as SS β†’ SS? 2. Applying just this rule group, what will the following words be stemmed to? circus, canaries, boss 3. What rule should be added to correctly stem pony? 4. The stemming for pony and ponies might seem strange. Does it have a deleterious effect on retrieval? Why or why not? Posting Lists (Chapter 2) Exercise 2/5 Below is a part of index with positions in the form doc1: βŸ¨π‘π‘œπ‘ 1, π‘π‘œπ‘ 2, π‘π‘œπ‘ 3, . . .⟩; doc2: βŸ¨π‘π‘œπ‘ 1, π‘π‘œπ‘ 2, . . .⟩; . . . β€’ angels: 2 : ⟨36, 174, 252, 651⟩; 4 : ⟨12, 22, 102, 432⟩; 7 : ⟨17⟩; β€’ fools: 2 : ⟨1, 17, 74, 222⟩; 4 : ⟨8, 78, 108, 458⟩; 7 : ⟨3, 13, 23, 193⟩; β€’ fear: 2 : ⟨87, 704, 722, 901⟩; 4 : ⟨13, 43, 113, 433⟩; 7 : ⟨18, 328, 528⟩; β€’ in: 2 : ⟨3, 37, 76, 444, 851⟩; 4 : ⟨10, 20, 110, 470, 500⟩; 7 : ⟨5, 15, 25, 195⟩; β€’ rush: 2 : ⟨2, 66, 194, 321, 702⟩; 4 : ⟨9, 69, 149, 429, 569⟩; 7 : ⟨4, 14, 404⟩; β€’ to: 2 : ⟨47, 86, 234, 999⟩; 4 : ⟨14, 24, 774, 944⟩; 7 : ⟨19, 319, 599, 709⟩; β€’ tread: 2 : ⟨57, 94, 333⟩; 4 : ⟨15, 35, 155⟩; 7 : ⟨20, 320⟩; 2 β€’ where: 2 : ⟨67, 124, 393, 1001⟩; 4 : ⟨11, 41, 101, 421, 431⟩; 7 : ⟨15, 35, 735⟩; The following terms are phrase queries. Which documents correspond to the following queries and on which positions? a) fools rush in b) fools rush in AND angels fear to tread. The index is incorrect. How? Exercise 2/6 Below is a part of index with positions in the form doc1: βŸ¨π‘π‘œπ‘ 1, π‘π‘œπ‘ 2, π‘π‘œπ‘ 3, . . .⟩; doc2: βŸ¨π‘π‘œπ‘ 1, π‘π‘œπ‘ 2, . . .⟩; . . . β€’ ostrich: 1 : <1,7>; 2 : <4,5>; β€’ hippo: 1 : <5,8,9>; 2 : <6,9>; β€’ lion: 1 : <3,6>; 2 : <3,7>; β€’ giraffe: 1 : <2,4>; 2 : <1,2,8>; Which documents correspond to the phrase query lion giraffe hippo and on which positions? Include intermediate results. Exercise 2/7 Consider a query composed of two terms. Non-positional postings list of one term is composed of 16 items 𝑃 = [4, 6, 10, 12, 14, 16, 18, 20, 22, 32, 47, 81, 120, 215, 300, 500] and the second term has the postings list of only a single element 𝑅 = [47]. Find out how many comparisons (and why) are necessary to find out the intersection of the lists that are organized as follows: a) standard postings lists b) postings lists with skip pointers of skip frequency βˆšοΈ€ |𝑃| Exercise 2/8 Consider a query composed of two terms. Non-positional postings list with skip pointers of one term is composed of 16 items 𝑃1 = [4, 6, 10, 12, 14, 16, 18, 20, 22, 32, 47, 81, 120, 215, 300, 500] with skip frequency of square root of its length and the second term has the standard postings list 𝑃2 = [18, 32, 60]. How many comparisons are necessary to find out the intersection of the lists? 3 Exercise 2/9 List the comparisons performed to intersect the following sorted non-positional postings lists with skip pointers of frequency 5. 𝑃1 = [2, 10, 12, 16] and 𝑃2 = [1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] Exercise 2/10 List the comparisons performed to intersect the following sorted non-positional postings lists with skip pointers of frequency 5. 𝑃1 = [4, 5, 6, 7, 8, 9, 10, 13, 14, 15] and 𝑃2 = [1, 2, 3, 4, 5, 10, 11, 15, 16] 4