Dictionaries and Tolerant Retrieval (Chapter 3) Algorithm 1 (Soundex Code) Transformation of a string to a 4-character soundex code 1. Keep the first character 2. Rewrite {𝐴, 𝐸, 𝐼, 𝑂, π‘ˆ, 𝐻, π‘Š, π‘Œ } to 0 3. Rewrite characters (a) {𝐡, 𝐹, 𝑃, 𝑉 } to 1 (b) {𝐢, 𝐺, 𝐽, 𝐾, 𝑄, 𝑆, 𝑋, 𝑍} to 2 (c) {𝐷, 𝑇} to 3 (d) {𝐿} to 4 (e) {𝑀, 𝑁} to 5 (f) {𝑅} to 6 4. Remove duplicities 5. Remove zeros 6. Change to length 4 (truncate or add trailing zeros) Algorithm 2 (Querying in Permuterm Index) For query π‘ž, find keys according to the following scheme: β€’ for π‘ž = 𝑋, find keys in the form 𝑋$ β€’ for π‘ž = 𝑋* , find keys in the form $𝑋* β€’ for π‘ž = * 𝑋, find keys in the form 𝑋$* β€’ for π‘ž = * 𝑋* , find keys in the form 𝑋* β€’ for π‘ž = 𝑋* π‘Œ , find keys in the form π‘Œ $𝑋* Algorithm 3 (Levenshtein Distance – declarative approach) Distance between two strings π‘Ž and 𝑏 is given by lev π‘Ž,𝑏(|π‘Ž|, |𝑏|) where lev π‘Ž,𝑏(𝑖, 𝑗) = ⎧ βŽͺβŽͺ⎨ βŽͺβŽͺ⎩ max(𝑖, 𝑗) if min(𝑖, 𝑗) = 0 min ⎧ ⎨ ⎩ lev π‘Ž,𝑏(𝑖 βˆ’ 1, 𝑗) + 1 lev π‘Ž,𝑏(𝑖, 𝑗 βˆ’ 1) + 1 lev π‘Ž,𝑏(𝑖 βˆ’ 1, 𝑗 βˆ’ 1) + 1(π‘Ž 𝑖̸= 𝑏 𝑗 ) otherwise where 1(π‘Ž 𝑖̸= 𝑏 𝑗 ) is the indicator function equal to 1 when π‘Žπ‘– ΜΈ= 𝑏 𝑗, and 0 otherwise. lev π‘Ž,𝑏(𝑖, 𝑗) is the distance between the first 𝑖 characters of string π‘Ž and the first 𝑗 characters of string 𝑏. Algorithm 4 (Levenshtein distance – imperative approach) 1: function LevenshteinDistance(𝑠1, 𝑠2) 2: for i = 0 to |𝑠1| do 3: π‘š[𝑖, 0] = 𝑖 4: end for 5: for j = 0 to |𝑠2| do 6: π‘š[0, 𝑗] = 𝑗 1 7: end for 8: for i = 1 to |𝑠1| do 9: for j = 1 to |𝑠2| do 10: if 𝑠1[𝑖] == 𝑠2[𝑗] then 11: π‘š[𝑖, 𝑗] = min{π‘š[𝑖 βˆ’ 1, 𝑗] + 1, π‘š[𝑖, 𝑗 βˆ’ 1] + 1, π‘š[𝑖 βˆ’ 1, 𝑗 βˆ’ 1]} 12: else 13: π‘š[𝑖, 𝑗] = min{π‘š[𝑖 βˆ’ 1, 𝑗] + 1, π‘š[𝑖, 𝑗 βˆ’ 1] + 1, π‘š[𝑖 βˆ’ 1, 𝑗 βˆ’ 1] + 1} 14: end if 15: end for 16: end for 17: return π‘š[|𝑠1|, |𝑠2|] 18: end function Exercise 3/1 a) Find two different words of the same soundex code. b) Find two phonetically similar words of different soundex codes. Exercise 3/2 Write elements in a dictionary of the permuterm index generated by the term mama. Exercise 3/3 Which keys are usable for finding the term s*ng in a permuterm wildcard index? Exercise 3/4 What is the complexity of intersection of two un-ordered posting lists of lengths π‘š and 𝑛? Exercise 3/5 What is the complexity (in π’ͺ-notation) of intersecting of two ordered posting lists of lengths π‘š and 𝑛? Exercise 3/6 What is the worst-case complexity of searching in hash tables? 2 Exercise 3/7 Compute the Levenshtein distance between paris and alice. Write down the matrix of distances between all prefixes as computed by Algorithm 4. 3