Seminar 2 Algorithm 1 (Soundex Code) Transformation of a string to a 4-character soundex code 1. Keep the first character 2. Rewrite {𝐴, 𝐸, 𝐼, 𝑂, 𝑈, 𝐻, 𝑊, 𝑌 } to 0 3. Rewrite characters (a) {𝐵, 𝐹, 𝑃, 𝑉 } to 1 (b) {𝐶, 𝐺, 𝐽, 𝐾, 𝑄, 𝑆, 𝑋, 𝑍} to 2 (c) {𝐷, 𝑇} to 3 (d) {𝐿} to 4 (e) {𝑀, 𝑁} to 5 (f) {𝑅} to 6 4. Remove duplicities 5. Remove zeros 6. Change to length 4 (truncate or add zeros) Algorithm 2 (Querying in Permuterm Index) For query 𝑞, find keys according to the following scheme: ∙ for 𝑞 = 𝑋, find keys in the form 𝑋$ ∙ for 𝑞 = 𝑋* , find keys in the form $𝑋* ∙ for 𝑞 = * 𝑋, find keys in the form 𝑋$* ∙ for 𝑞 = * 𝑋* , find keys in the form 𝑋* ∙ for 𝑞 = 𝑋* 𝑌 , find keys in the form 𝑌 $𝑋* Exercise 1 Below is a part of index with positions in the form doc1: ⟨𝑝𝑜𝑠1, 𝑝𝑜𝑠2, 𝑝𝑜𝑠3, . . .⟩; doc2: ⟨𝑝𝑜𝑠1, 𝑝𝑜𝑠2, . . .⟩; . . . ∙ angels: 2 : ⟨36, 174, 252, 651⟩; 4 : ⟨12, 22, 102, 432⟩; 7 : ⟨17⟩; ∙ fools: 2 : ⟨1, 17, 74, 222⟩; 4 : ⟨8, 78, 108, 458⟩; 7 : ⟨3, 13, 23, 193⟩; ∙ fear: 2 : ⟨87, 704, 722, 901⟩; 4 : ⟨13, 43, 113, 433⟩; 7 : ⟨18, 328, 528⟩; ∙ in: 2 : ⟨3, 37, 76, 444, 851⟩; 4 : ⟨10, 20, 110, 470, 500⟩; 7 : ⟨5, 15, 25, 195⟩; ∙ rush: 2 : ⟨2, 66, 194, 321, 702⟩; 4 : ⟨9, 69, 149, 429, 569⟩; 7 : ⟨4, 14, 404⟩; ∙ to: 2 : ⟨47, 86, 234, 999⟩; 4 : ⟨14, 24, 774, 944⟩; 7 : ⟨19, 319, 599, 709⟩; ∙ tread: 2 : ⟨57, 94, 333⟩; 4 : ⟨15, 35, 155⟩; 7 : ⟨20, 320⟩; 1 ∙ where: 2 : ⟨67, 124, 393, 1001⟩; 4 : ⟨11, 41, 101, 421, 431⟩; 7 : ⟨15, 35, 735⟩; The following terms are phrase queries. Which documents correspond to the following queries and on which positions? a) fools rush in b) fools rush in AND angels fear to tread. The index is incorrect. How? In order to retrieve the query it is necessary that the words are in a sequence. That is, if the word angels is in doc2 on position 36, then the word fear has to be in the same document on the position 37 and so on. For the exercise a) we calculate all possible positions of the phrase. Word fools appears in doc2 on positions ⟨1, 17, 74, 222⟩. That means that the word rush has to appear on positions ⟨2, 18, 75, 223⟩ and the word in on positions ⟨3, 19, 76, 224⟩. Similar process is applied on doc4 and doc7 which retrieves the requested results. doc2 ⟨1, 2, 3⟩, ⟨17, 18, 19⟩, ⟨74, 75, 76⟩, ⟨222, 223, 224⟩ doc4 ⟨8, 9, 10⟩, ⟨78, 79, 80⟩, ⟨108, 109, 110⟩, ⟨458, 459, 460⟩ doc7 ⟨3, 4, 5⟩, ⟨13, 14, 15⟩, ⟨23, 24, 25⟩, ⟨193, 194, 195⟩ Now we look at the original position index and search for whether there is a conjunction between requested and real positions. Take doc2 and check whether the words fools, rush and in are in a sequence on positions ⟨1, 2, 3⟩. Since yes, the system returns doc2 as relevant to our query. Same analogy is used for the remaining documents for which we get the result doc2: {⟨1, 2, 3⟩}; doc4: {⟨8, 9, 10⟩}; and doc7: {⟨3, 4, 5⟩, ⟨13, 14, 15⟩}. For the exercise b) we find the requested positions for also the term angels fear to tread. doc2 ⟨36, 37, 38, 39⟩, ⟨174, 175, 176, 177⟩, ⟨252, 253, 254, 255⟩, ⟨651, 652, 653, 654⟩ doc4 ⟨12, 13, 14, 15⟩, ⟨22, 23, 24, 25⟩, ⟨102, 103, 104, 105⟩, ⟨432, 433, 434, 435⟩ doc7 ⟨17, 18, 19, 20⟩ They appear in the correct order in doc4: {⟨12, 13, 14, 15⟩} and in doc7: {⟨17, 18, 19, 20⟩}. Taking the first part from a), we only check whether the results overlap {doc2(1), doc4(8), doc7(3), doc7(13)} ∩ {doc4(12), doc7(17)} = {doc4(8,12), doc7(3,17), doc7(13,17)}. The index is incorrect. We need to have a look into doc7, where on position 15 are two terms in and where. Exercise 2 Consider a query composed of two terms. Non-positional postings list of one term is composed of 16 items 𝑃 = [4, 6, 10, 12, 14, 16, 18, 20, 22, 32, 47, 81, 120, 215, 300, 500] and the second term has the postings list of only a single element 𝑅 = [47]. Find out how many comparisons (and why) are necessary to find out the intersection of the lists that are organized as follows: a) standard postings lists 2 b) postings lists with skip pointers of skip frequency √︀ |𝑃| a) A naive algorithm compares each element from 𝑃 with each element from 𝑅, which is 11. b) With skip pointers of frequency √︀ |𝑃| = 4 we reduce the number of comparisons. Instead of the next element 𝑖 + 1 only, the pointer goes directly to 𝑖 + √︀ |𝑃| until the referred value is larger than the searched value, after which it jumps back and then step forward by 1. Starting at position 0, the algorithm proceeds as follows: compare 4:47, jump to 14, compare 14:47, jump to 22, compare 22:47, jump to 120, compare 120:47, jump back to 22, step to 32, compare 32:47, step to 47, compare 47:47, done. The number of compare operations is 6. Exercise 3 a) Find two different words of the same soundex code. b) Find two phonetically similar words of different soundex codes. a) sword and short have codes S630 b) fog and thug have codes F200 and T200 Exercise 4 Write elements in a dictionary of the permuterm index generated by the term mama. 𝑚𝑎𝑚𝑎$, 𝑎𝑚𝑎$𝑚, 𝑚𝑎$𝑚𝑎, 𝑎$𝑚𝑎𝑚, $𝑚𝑎𝑚𝑎. Exercise 5 Which keys are usable for finding the term s*ng in a permuterm wildcard index? 𝑛𝑔$𝑠* 3