Web search basics (Chapter 19) Definition 1 (Mark and Recapture) Suppose that we could pick a random page from the index of 𝐸1 and test whether it is in 𝐸2’s index and symmetrically, test whether a random page from 𝐸2 is in 𝐸1. These experiments give us the fraction π‘₯ of the pages in 𝐸1 that are also in 𝐸2, and the fraction 𝑦 of the pages in 𝐸2 that are also in 𝐸1. Let |𝐸𝑖| denote the size of the index 𝐸𝑖. Then π‘₯|𝐸1| β‰ˆ 𝑦|𝐸2| ⇐⇒ |𝐸1| |𝐸2| β‰ˆ 𝑦 π‘₯ . Definition 2 (Shingling) By some estimates, as many as 40% of the pages on the Web are duplicates of other pages. Search engines try to avoid indexing multiple copies of the same content, to keep down storage and processing overheads. Given a positive integer π‘˜ and a sequence of terms in a document 𝑑, define the π‘˜-shingles of 𝑑 to be the set of all consecutive sequences of π‘˜ terms in 𝑑. As an example, consider the following text: β€œa rose is a rose is a rose”. The 4-shingles for this text (π‘˜ = 4 is a typical value used in the detection of near-duplicate web pages) are β€œa rose is a”, β€œrose is a rose”, and β€œis a rose is”. Intuitively, two documents are near duplicates if the sets of shingles generated from them are nearly the same. Let 𝑆(𝑑 𝑗) denote the set of shingles of document 𝑑 𝑗. The Jaccard coefficient measures the degree of overlap between the sets 𝑆(𝑑1) and 𝑆(𝑑2) as 𝐽(𝑆(𝑑1), 𝑆(𝑑2)) = |𝑆(𝑑1) ∩ 𝑆(𝑑2)| |𝑆(𝑑1) βˆͺ 𝑆(𝑑2)| . If the Jaccard index exceeds a preset threshold (say, 0.9), we declare documents 𝑑1 and 𝑑2 near-duplicates and eliminate one from indexing. Since computing the Jaccard index between all pairs of documents is time-consuming, an estimate is often used, see Section 19.6 in the Manning book. Exercise 19/1 Pick a topic of your interest and describe it by 5–10 words. Open Sketch Engine at https://app.sketchengine.eu/, use institutional login, click β€œNew Corpus”, and create a corpus using the description words as seeds. Wait until data are downloaded and search the word corpus for collocations using the β€œWord Sketch Difference” tool. Exercise 19/2 Two web search engines 𝐴 and 𝐡 each generate a large number of pages uniformly at random from their indexes. 30% of 𝐴’s pages are present in 𝐡’s index, while 50% of 𝐡’s pages are present in 𝐴’s index. What is the number of pages in 𝐴’s index relative to 𝐡? Exercise 19/3 Using shingling with π‘˜ = 4 and the threshold 0.9 to decide whether the documents 𝑑1 = β€œnow is the time for all good men to come to the aid of their country”, and 𝑑2 = β€œnow is the time for all good men to come to the aid of their party” are near-duplicates. 1 2