Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/PV211 MR 1: Boolean Retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2017-02-28 Sojka, MR Group: PV211: Boolean Retrieval 1/79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: EE I ake-away 9 Basic information about the course, teachers, evaluation, exercises • Boolean Retrieval: Design and data structures of a simple information retrieval system • What topics will be covered in this class (overview)? Sojka, MR Group: PV211: Boolean Retrieval 2/79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Co u n verv iew Q Introduction 0 History of information retrieval Q Boolean model Q Inverted index ^ Processing Boolean queries Q Query optimization Q Course overview and agenda Sojka, MR Group: PV211: Boolean Retrieval 3/79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: nrormation etrieva Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Sojka, MR Group: PV211: Boolean Retrieval Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour< srequisites Curiosity about how Information Retrieval works. But seriously: • Chapters 1-5 benefit from basic course on algorithms and data structures. • Chapters 6-7 needs in addition linear algebra, vectors and dot products. o For Chapters 11-13 basic probability notions are needed. • Chapters 18-21 demand course in linear algebra, notions of matrix rank, eigenvalues and eigenvectors. Sojka, MR Group: PV211: Boolean Retrieval 6/79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour< A i ■ 1 i r PV211 AC' □ve i learning l reai cures in • Student activities explicitly welcomed and built as part of classification system (10 pts). • Mentoring rather than 'ex cathedra' lectures: "The flipped classroom is a pedagogical model in which the typical lecture and homework elements of a course are reversed." • Respect to individual learning speed and knowledge. • Questions on PV211 IS discussion forum is welcomed especially before lectures. • Richness of materials available in advance: MOOC (Massive open online course) becoming widespread, parts of MR Stanford courses being available, together with other freely available teaching materials, including the whole MR book. Sojka, MR Group: PV211: Boolean Retrieval 7/79 • Petr Sojka, sojka@fi.muni.cz • Consulting hours Spring 2017: Tuesday 15:15-16:00 and Thursday 10:00-11:00 or write an email with other suggestions to meet. a Room C523 or C522 or A502, fifth floor, Botanická 68a. • Course web page: http://www.fi.muni.cz/~sojka/PV211/ • TA: Michal Balážia Consulting hours: Wednesday 14:00-16:00 C516 Sojka, MR Group: PV211: Boolean Retrieval 8/79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour< zvai luai □on 01 r si cua em CS Classification system is based on points achieved (100 pts max). You could get 50 points during the term: 20 pts for each of 2 midterm tests, 10 pts for your activity during term (lectures or discussion forums,...), and 50 pts for the final test. Final written exam will consist of open exercises (30 pts, similar to midterm ones) and multiple choice questions (20 pts). In addition, one can get additional premium points based on activities during lectures, exercises (good answers) or negotiated related projects. Classification scale (adjustments based on ECTS suggestions) z/k[/E/D/C/B/A] corresponds « 50/57/[64/71/78/85/92] points. Dates of [final] exams will be announced via IS.muni.cz (at least three terms), possibility to make midterm tests on first term for those ill. Questions? Sojka, MR Group: PV211: Boolean Retrieval 9/79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: rocee Questions? Language preferences? Warm ups? Personal cards. PA212? Be. or Mgr.? Sojka, MR Group: PV211: Boolean Retrieval 10 / 79 Introduction History Boolean model Inverted index Processing ; Boolean queries Query optimization Cour; istory or inrormazion retrieval ': graa uai i cr íannei i cr langes O V" Q M o o o o \Q \Q \Q \£> \Q 4i er. » -o o o o o o o ^ Od o o o o o Kŕ O O ■ŕ M O IV o o ^1 O O 06 O S KJ o o Sojka, MR Group: PV211: Boolean Retrieval 12 / 79 Gradual speedup of changes in IR 2008 2007 2006 2004 2002 2000 1998 1995 1990 19Ô0 1960 1940 1920 "Google" Circa 1997 (google.stanford.edu) Goi >ylc lijcsday ÉCFŕcmbor 10.13 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Coun googie.stanTora.eau • 'flipped IS', collaborative project with Stanford faculty • on collected disks o Google 1998 'Anatomy paper' (Page, Brin) Sojka, MR Group: PV211: Boolean Retrieval 16 / 79 2020 2015 2010 2009 2008 2007 2006 2004 2002 2000 1998 1995 loan Unstructured (text) vs. structured (database) data in 1996 160- y 140- 120 100 80- 60- 40 20- 0- m □ Unstructured □ Structured ~7 Data volume Market Cap Unstructured (text) vs. structured (database) data in 2006 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: I ooiean retrieva • The Boolean model is arguably the simplest model to base an information retrieval system on. and Brutus • The search engine returns all documents that satisfy the Boolean expression. Does Google use the Boolean model? Sojka, MR Group: PV211: Boolean Retrieval 21 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Coun oes boogie use ooiean mo On Google, the default interpretation of a query [w\ 1/1/2 ...wn] is w! AND w2 AND ... AND wn Cases where you get hits that do not contain one of the w,\ o anchor text • page contains variant of w-, (morphology, spelling correction, synonym) o long queries (n large) o boolean expression generates very few hits Simple Boolean vs. Ranking of result set o Simple Boolean retrieval returns matching documents in no particular order. • Google (and most well designed Boolean engines) rank the result set - they rank good hits (according to some estimator of relevance) higher than bad hits. Sojka, MR Group: PV211: Boolean Retrieval 22 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: Sojka, MR Group: PV211: Boolean Retrieval 24 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Coun • Which plays of Shakespeare contain the words Brutus and Caesar, but not Calpurnia? • One could grep all of Shakespeare's plays for Brutus and Caesar, then strip out lines containing Calpurnia. • Why is grep not the solution? • Slow (for large collections) o grep is line-oriented, IR is document-oriented j "not Calpurnia" is non-trivial • Other operations (e.g., find the word Romans near countryman) not feasible Sojka, MR Group: PV211: Boolean Retrieval 25 / 79 I ocument incidence matrix Anthony and Cleopatra Julius The Hamlet Othello Macbeth Caesar Tempest Anthony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 Entry is 1 if term occurs. Exampl e: Calpurnia occurs in Julius Caesar. Entry is 0 if term doesn't occur. Example: Calpurnia doesn't occur in tempest. Sojka, MR Group: PV211: Boolean Retrieval 26 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: nciaence vectors a So we have a 0/1 vector for each term. • To answer the query Brutus and Caesar and not Calpurnia: o Take the vectors for Brutus, Caesar, and Calpurnia • Complement the vector of Calpurnia o Do a (bitwise) and on the three vectors • 110100 and 110111 and 101111 = 100100 Sojka, MR Group: PV211: Boolean Retrieval 27 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: Julius Hamlet Othello Macbeth and Caesar Tempest Cleopatra Anthony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 result: 1 0 0 1 0 0 Sojka, MR Group: PV211: Boolean Retrieval 28 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour; A j_ Answers to query Anthony and Cleopatra, Act III, Scene ii Agrippa [Aside to Domitius Enobarbus]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar: I was killed i' the Capitol; Brutus killed me. Sojka, MR Group: PV211: Boolean Retrieval 29 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: I igger collection • Consider N = 10 documents, each with about 1000 tokens • total of 109 tokens • On average 6 bytes per token, including spaces and punctuation =4> size of document collection is about 6 • 109 = 6 GB • Assume there are M = 500,000 distinct terms in the collection • (Notice that we are making a term/token distinction.) Sojka, MR Group: PV211: Boolean Retrieval 30 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: e incidence ui m M = 500,000 x 106 = half a trillion 0s and Is. • But the matrix has no more than one billion Is. o Matrix is extremely sparse. • What is a better representations? • We only record the Is: inverted index! Sojka, MR Group: PV211: Boolean Retrieval 31 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Coun I nvertea inae For each term t, we store a list of all documents that contain t. Brutus —> 1 2 4 11 31 45 173 174 Caesar —> 1 2 4 5 6 16 57 132 . . . Calpurnia —> 2 31 54 101 S-v-' dictionary postings (vyskytnfk in Czech) Sojka, MR Group: PV211: Boolean Retrieval 32 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Coun I nvertea inae nstruction Q Collect the documents to be indexed Friends, Romans, countrymen So let it be with Caesar Q Tokenize the text, turning each document into a list of tokens: Friends Romans countrymen So O Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms: Q Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings. Sojka, MR Group: PV211: Boolean Retrieval 33 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Coun E OKenization ana preprocessm Doc 1. I did enact Julius Caesar: I was killed i' the Capitol; Brutus killed me. _\. -Y Doc 2. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious: Doc 1. i did enact Julius caesar i was killed i' the capitol brutus killed me Doc 2. so let it be with caesar the noble brutus hath told you caesar was ambitious Sojka, MR Group: PV211: Boolean Retrieval 34 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: rate postings Doc 1. i did enact julius caesar i was killed i' the capitol brutus killed me Doc 2. so let it be with caesar the noble brutus hath told you caesar was ambitious term i did enact julius docID 1 1 1 1 caesar i was killed i' the capitol brutus killed me so let it be with caesar the noble brutus hath told you caesar was ambitious 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Sojka, MR Group: PV211: Boolean Retrieval 35 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: bort postings term docID term docID i 1 ambitious 2 did 1 be 2 enact 1 brutus 1 julius 1 brutus 2 caesar 1 capitol 1 i 1 caesar 1 was 1 caesar 2 killed 1 caesar 2 i' 1 did 1 the 1 enact 1 capitol 1 hath 1 brutus 1 i 1 killed 1 i 1 me 1 —> 1 so 2 ' it 2 let 2 julius 1 it 2 killed 1 be 2 killed 1 with 2 let 2 caesar 2 me 1 the 2 noble 2 noble 2 so 2 brutus 2 the 1 hath 2 the 2 told 2 told 2 you 2 you 2 caesar 2 was 1 was 2 was 2 ambitious 2 with 2 Sojka, MR Group: PV211: Boolean Retrieval 36 / 79 ^reate postings lists, determine documei n queries Query optimization Cour: term docID ambitious 2 be brutus brutus capitol caesar caesar caesar did enact hath t julius killed killed let me noble so the the told you was was with term doc. freq. ambitic )US 1 be 1 brutus 2 capitol 1 caesar 2 did 1 enact 1 hath 1 i 1 i' 1 it 1 julius 1 killed 1 let 1 me 1 noble 1 so 1 the 2 told ] L you ] was 2 with 1 postings lists m m s-0 B- 0 0 m 0 0 0 0 0 0 0 0 0 0^0 0 0 0-0 Sojka, MR Group: PV211: Boolean Retrieval 37 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: e result into dictionary ana postings me Brutus —> 1 2 4 11 31 45 173 174 Caesar —> 1 2 4 5 6 16 57 132 . . . Calpurnia —> 2 31 54 101 S-v-' dictionary postings file Sojka, MR Group: PV211: Boolean Retrieval 38 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: is course • Index construction: how can we create inverted indexes for large collections? • How much space do we need for dictionary and index? • Index compression: how can we efficiently store and process indexes for large collections? • Ranked retrieval: what does the inverted index look like when we want the "best" answer? Sojka, MR Group: PV211: Boolean Retrieval 39 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: impie conjunctive query • Consider the query: Brutus AND Calpurnia o To find all matching documents using inverted index: Q Locate Brutus in the dictionary Q Retrieve its postings list from the postings file Q Locate Calpurnia in the dictionary O Retrieve its postings list from the postings file Q Intersect the two postings lists 0 Return intersection to user Sojka, MR Group: PV211: Boolean Retrieval 41 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Co u n I lersecting two postings us Brutus Calpurnia Intersection 1 -> 2 4 ->■ 11 ->■ 31 ->■ 45 ->■ 173 ->■ 174 2 31 54 101 31 • This is linear in the length of the postings lists. • Note: This only works if postings lists are sorted Sojka, MR Group: PV211: Boolean Retrieval 42 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Co u n I lersecting two postings us Intersect(pi, P2) 1 answer ^- ( ) 2 while pi 7^ nil and p2 7^ nil 3 do if doclD(pľ) = doclD{p2) 4 then Add (answer, doclD(pi)) 5 pi «— next(pi) 6 p2 «— next(p2) 7 else if doclD(pi) < doclD(p2) 8 then pi «— next(pi) 9 else p2 «— next(p2) 10 return answer Sojka, MR Group: PV211: Boolean Retrieval 43 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Co u n uery processing: txercise france -y paris -> lear -y 1^2^3^4^5^7^8^9^11^12^13^14^15 2 ^ 6 ^ 10 ^ 12 ^ 14 12 15 Compute hit list for ((paris AND NOT france) OR lear) Sojka, MR Group: PV211: Boolean Retrieval 44 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Co u n ooiean queries • The Boolean retrieval model can answer any query that is a Boolean expression. o Boolean queries are queries that use and, or and not to join query terms. • Views each document as a set of terms. o Is precise: Document matches condition or not. • Primary commercial retrieval tool for 3 decades • Many professional searchers (e.g., lawyers) still like Boolean queries. • You know exactly what you are getting. • Many search systems you use are also Boolean: spotlight, email, intranet etc. Sojka, MR Group: PV211: Boolean Retrieval 45 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: ommerciaiiy successTui Boolean retrieva o Largest commercial legal search service in terms of the number of paying subscribers • Over half a million subscribers performing millions of searches a day over tens of terabytes of text data o The service was started in 1975. • In 2005, Boolean search (called "Terms and Connectors" by Westlaw) was still the default, and used by a large percentage of users . .. o . . .although ranked retrieval has been available since 1992. Sojka, MR Group: PV211: Boolean Retrieval 46 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Co u n estiaw: txampie queries Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competing company Query: "trade secret" /s disclos! /s prevent /s employe! Information need: Requirements for disabled people to be able to access a workplace Query: disab! /p access! /s work-site work-place (employment /3 place) Information need: Cases about a host's responsibility for drunk guests Query: host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest Sojka, MR Group: PV211: Boolean Retrieval 47 / 79 o Proximity operators: /3 = within 3 words, /s = within a sentence, /p = within a paragraph • Space is disjunction, not conjunction! (This was the default in search pre-Google.) • Long, precise queries: incrementally developed, not like web search • Why professional searchers often like Boolean search: precision, transparency, control • When are Boolean queries the best way of searching? Depends on: information need, searcher, document collection, ... Sojka, MR Group: PV211: Boolean Retrieval 48 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: uery optimization • Consider a query that is an and of n terms, n > 2 • For each of the terms, get its postings list, then and them together • Example query: Brutus AND Calpurnia AND Caesar • What is the best order for processing this query? Sojka, MR Group: PV211: Boolean Retrieval 50 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Co u n uery optimization • Example query: Brutus AND Calpurnia AND Caesar • Simple and effective optimization: Process in order of increasing frequency • Start with the shortest postings list, then keep cutting further • In this example, first Caesar, then Calpurnia, then Brutus Brutus —> Calpurnia —> i_i Caesar —> |~5~|->i31 1 2 ->■ 4 11 31 45 173 174 2 -> 31 -> 54 -> 101 Sojka, MR Group: PV211: Boolean Retrieval 51 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: ptimizea intersection aigoritnm Tor conjunctive queries Intersect((t\,..., tn)) 1 terms <- SortByIncreasingFrequency((£i, ..., tn)) 2 result postings(first (terms)) 3 terms <— rest (terms) 4 while terms ^ nil and result ^ nil 5 do result <— lNTERSECT(resü/£, postings(flrst(terms))) 6 terms <— rest (terms) 7 return result Sojka, MR Group: PV211: Boolean Retrieval 52 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: ore general optimizatio • Example query: (madding or crowd) and (ignoble or strife) • Get frequencies for all terms • Estimate the size of each or by the sum of its frequencies (conservative) • Process in increasing order of or sizes Sojka, MR Group: PV211: Boolean Retrieval 53 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: Recommend a query processing order for: (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes) Sojka, MR Group: PV211: Boolean Retrieval 54 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: ourse overview ana agenda o We are done with Chapter 1 of MR (MR 01). • Plan for the rest of the semester: 16-18 of the 21 chapters of MR • In addition to experts from Fl lectures by leading industry experts from Facebook (Tomáš Mikolov on March 21st as part of Fl Informatics Colloquium), Seznam.cz (Tomáš Vrábel) or Rare Technologies (Radim Řehůřek). • In what follows: teasers for most chapters - to give you a sense of what will be covered. Sojka, MR Group: PV211: Boolean Retrieval 56 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: e term vocabulary ana postings lists • Phrase queries: "Stanford University" • Proximity queries: Gates near Microsoft • We need an index that captures position information for phrase queries and proximity queries. Sojka, MR Group: PV211: Boolean Retrieval 57 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: ictionaries an oierant retrieva bo -► aboard —► about —► boardroom -► border or -► border —► lord —► morbid -► sordid rd -► aboard —► ardent —► boardroom -► border Sojka, MR Group: PV211: Boolean Retrieval 58 / 79 splits o o o map phase segment files postings reduce phase Sojka, MR Group: PV211: Boolean Retrieval 59 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Co u n i-1-1-1-1-1-1-r 0 1 2 3 4 5 6 7 log 10 rank Zipf's law Sojka, MR Group: PV211: Boolean Retrieval 60 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: UK (Jo: bcormg, term weighting and the vector space model 9 Ranking search results • Boolean queries only give inclusion or exclusion of documents. • For ranked retrieval, we measure the proximity between the query and each document. • One formalism for doing this: the vector space model 9 Key challenge in ranked retrieval: evidence accumulation for a term in a document o 1 vs. 0 occurrence of a query term in the document o 3 vs. 2 occurrences of a query term in the document • Usually: more is better • But by how much? • Need a scoring function that translates frequency into score or weight Sojka, MR Group: PV211: Boolean Retrieval 61 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: coring in a complete searcn s j^i Parsing ■v^ Linguistics Documents Document cache Indexers user query Free text query parser Spell correction Scoring and ranking Results page Metadata in zone and field indexes Inexact top K retrieval Tiered inverted positional index /c-gram Indexes Scoring parameters <—> training MLR set ^— Sojka, MR Group: PV211: Boolean Retrieval 62 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: Web 5)Show options... Results 1 -1< Manitoba - Wikipedia, the free encydopedia Manitoba's capital and largest city, Winnipeg.....According to Environment Canada, Manitoba ranked first for clearest skies year round, and ranked second ... Geography - History - Demographics - Economy en.wikipedia.org/wikr/Manitoba - Cached - Similar List of cities in Canada - Wikipedia, the free encyclopedia Cities and towns in Manitoba. See also: List of communities in Manitoba .... Dartmouth - formerly the second largest city in Nova Scotiat now a Metropolitan ... en.wikipedia.org/wiki/List_of_cities_in_Canada - Cached - Similar 1+lShow more results from en.wikipedia.org Canadian Immigration Information - Manitoba The largest city in the province is the capital, Winnipeg, with a population exceeding 706900. The second largest city is Brandon. Manitoba has received ... www.canadavisa.com/about-manitoba.htmJ - Cached - Similar CBC Manitoba | EAL Lesson 57: Brandon - Manitoba's Second Largest City. For Teachers; For Students. Step One Open the Lesson: PDF (194kb) PDF WORD (238kb) Microsoft Word ... www.cbc.ca/manitoba/,../iesson-57-brandon—manitobas-second-Jargest.html - Cached Sojka, MR Group: PV211: Boolean Retrieval 63 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: \eleva nee teec Dack & query expansion jX Pif Ä\ (14 «3Í ,503*35) 0.554279 0.2S0S31 0.3033Í3 Ll4445ů,3JÍSJS>j 0.(54501 0.293(515 ■ 11440(5, 15j5í8j 0.ř55ft275 0.41L745 C 23*53 (14453^513799) 0.0(5709197 0.35S033 (i jftřOÍS (1444-73, 0 15721 Ú 3*392:1 C.i73i:ů iea49j (144-45(5, 24LK534; o.or^o is Ů 4Ó39 ů mu* H4-445ŕ, JJíSSi) 0.Í7WŮI 0.47(545 0.100451 (L4447Í, 1S328J 0.70DÍ5D 0.30ÍOŮ2 0.391357 (144*33,3*53$*} fi 7017079(5 0 3(517(5 C 339940 1.144478 ,5124 lil j 0.7029J 0.4(59111 0.13585$ Sojka, MR Group: PV211: Boolean Retrieval 64 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: ext classification aive b ayes o Text classification = assigning documents automatically to predefined classes • Examples: o Language (English vs. French) • Adult content o Region Sojka, MR Group: PV211: Boolean Retrieval 65 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: Sojka, MR Group: PV211: Boolean Retrieval 66 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Couri Support vector machines Support vectors Maximize Sojka, MR Group: PV211: Boolean Retrieval mar ein at clustering Inverted index Processing Boolean queries Query optimization Cour: yy Vivisimo* Clustered Results F> jaguar (2DB) f ■> Cars(74) © * Club{34j f-> Catf£3) P* AnlmaHiai ■> Restoration (iQj f-» Mac OS XfB) f > Jaguar Model {fl) $ > Request{5) Mark Webbers ;-> Maya (5) Find in clusters; Enter Keywords aguar | the Web 208 results of at least 20,373.974 retrieved for the query Jaguar (Details) Advanced Search ► Help 1. Jag-lovers - THE source for all Jaguar information [newwindow] [irame] [cache] ipraview] [clusters] ... Internet! Serving Enthusiasts since 1993 The Jag-lovers Web Currently with 40661 members The Premier Jaguar Cars web resource for all enthusiasts Lists and Forums Jag-lovers originally evolved around its ... WWW.jag-lovers.org ■ Open Directory 2, Wisenur 6. Ash Jeeves 8, MSN 9. Looks mart 12, MSN Search 18 2. Jaguar C arS [new window] [Irame] [cache] [preview] [duelers] /.../ redirected to www.Jaguar.com www.jaguarcars.com Look-;mart I. MSN 2 Lyooa 3. Wlsenut 6. MSN Search9.- MSN 29 3. httpi//WWW.jagUar.com/ [new window] [frame] [preview] [clusters] ■'■SBttrtv.jaguar.com - MSN 1. Ask Jeeves 1. MSN Search 3. Lycos 9 4. Apple - MaC OS X [newwindow] [frame] [prevew] [cluatera] Learn about the new OS X Server, designed for the Internet, digital media and workgroup management. Download a technical f actsheet. www.apple.com/macosx -Wisenut I. MSN 3. Looks mart 25 Sojka, MR Group: PV211: Boolean Retrieval 68 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: Hierarchical clustering http://news.google.com Sojka, MR Group: PV211: Boolean Retrieval 69 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: emantic indexing Contexts o 5 m X m m X c w X c w X m Sojka, MR Group: PV211: Boolean Retrieval 70 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: e weo an its c o Unusual and diverse documents o Unusual and diverse users and information needs • Beyond terms and text: exploit link analysis, user data • How do web search engines work? • How can we make them better? Sojka, MR Group: PV211: Boolean Retrieval 71 / 79 URL Frontier From other nodes Sojka, MR Group: PV211: Boolean Retrieval 72 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: Sojka, MR Group: PV211: Boolean Retrieval 73 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: (s Dy raceDOOK ^ lomas Abstract: Artificial neural networks are currently very successful in various machine learning tasks that involve natural language. In this talk, I will describe how recurrent neural network language models have been developed, as well as their most frequent applications to speech recognition and machine translation. I will also talk about distributed word representations, their interesting properties, and efficient ways how to compute them. Finally, I will describe our latest efforts to create a novel dataset that could be used to develop machines that can truly communicate with human users in natural language. Sojka, MR Group: PV211: Boolean Retrieval 74 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Co u n I nvitea lecture: image searc \m Ready to deploy image search researched by Seznam team lead (Tomas Vrabel). Abstract: Introduction to the Seznam.cz image search architecture. We will talk about the system overview and basic signals used in machine learning algorithms for relevance computation. We will cover the effect of user feedback on quality of results, the technology behind user query understanding and the deep convolutional neural networks for computer vision and image understanding. Sojka, MR Group: PV211: Boolean Retrieval 75 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: Sojka, MR Group: PV211: Boolean Retrieval 76 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: ognition Trom Motion capture Classification metrics for all implemented methods. Homogeneous setup, evaluated on the full database with 10-fold cross-validation. Methods are ordered by their CCR score. method CCR EER auc map CMC Rawjc 0.872 0.321 0.731 0 317 MMCBR 0.868 0.305 0.739 0 332 RawBR 0.867 0.333 0.701 0 259 MMCjc 0.861 0.325 0.72 0 309 pca+ldabr 0.845 0.335 0.682 0 247 — KwolekB 0.823 0.367 0.711 0 296 KrzeszowskiT 0.802 0.348 0.717 0 273 — PCA+LDAjc 0.79 0.417 0.634 0 189 DikovskiB 0.787 0.376 0.679 0 227 AhmedF 0.771 0.371 0.664 0 22 AnderssonVO 0.76 0.352 0.703 0 228 NareshKumarMS 0.717 0.459 0.613 0 19 — JiangS 0.692 0.407 0.637 0 204 — BallA 0.667 0.356 0.698 0 207 SinhaA 0.598 0.362 0.69 0 176 AhmedM 0.58 0.392 0.646 0 145 ................................. SedmidubskyJ 0.464 0.394 0.65 0 138 AliS 0.186 0.394 0.662 0 096 — PreisJ 0.131 0.407 0.618 0 066 — Random 0.039 +j ra DC c O ra u i/i i/i ra u (D L_ L_ O u > E U Sojka, MR Group: PV211: Boolean Retrieval 77 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: EE I ake-away 9 Basic information about the course, teachers, evaluation, exercises • Boolean Retrieval: Design and data structures of a simple information retrieval system • What topics will be covered in this class (overview)? Sojka, MR Group: PV211: Boolean Retrieval 78 / 79 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cour: esources • Chapter 1 of MR • Resources at http://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and Fl MU library • course schedule and overview o information retrieval links o Shakespeare search engine http://www.rhymezone.com/Shakespeare/ Sojka, MR Group: PV211: Boolean Retrieval 79 / 79 PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/PV211 MR 2: The term vocabulary and postings lists Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2017-02-28 Sojka, MR Group: PV211: The term vocabulary and postings lists 1/58 Documents Terms Skip pointers Phrase queries verview Q Documents Q Te rms • General + Non-English • English Q Skip pointers O Ph rase queries Sojka, MR Group: PV211: The term vocabulary and postings lists 2/58 Documents Terms Skip pointers Phrase queries E ai Ke-away • Understanding of the basic unit of classical information retrieval systems: words and documents: What is a document, what is a term? • Tokenization: how to get from raw text to words (or tokens) o More complex indexes: skip pointers and phrases Sojka, MR Group: PV211: The term vocabulary and postings lists 3/58 Documents Terms Skip pointers Phrase queries ajor steps in inverted maex construction Q Collect the documents to be indexed. O Tokenize the text. O Do linguistic preprocessing of tokens. Q Index the documents that each term occurs in. Sojka, MR Group: PV211: The term vocabulary and postings lists 5/58 Documents Terms Skip pointers Phrase queries ocumem CS • Last lecture: Simple Boolean retrieval system • Our assumptions were: o We know what a document is. o We can "machine-read" each document. • This can be complex in reality: "God is in the details." (Mies van der Rohe) Sojka, MR Group: PV211: The term vocabulary and postings lists 6/58 Documents Terms Skip pointers Phrase queries -'arsing a a ocumerr C • We need to deal with format and language of each document. • What format is it in? pdf, word, excel, html etc. • What language is it in? • What character set is in use? • Each of these is a classification problem, which we will study later in this course (MR 13). • Alternative: use heuristics Sojka, MR Group: PV211: The term vocabulary and postings lists 7/58 Documents Terms Skip pointers Phrase queries anguage: L.ompncations • A single index usually contains terms of several languages. • Sometimes a document or its components contain multiple languages/formats. o French email with Spanish pdf attachment • What is the document unit for indexing? • A file? • An email? 9 An email with 5 attachments? • A group of files (ppt or latex in HTML)? o Upshot: Answering the question "what is a document?" is not trivial and requires some design decisions. • Also: XML Sojka, MR Group: PV211: The term vocabulary and postings lists 8/58 Documents Terms Skip pointers Phrase queries • Word - A delimited string of characters as it appears in the text. • Term - A "normalized" word (case, morphology, spelling etc); an equivalence class of words. o Token - An instance of a word or term occurring in a document. • Type - The same as term in most cases: an equivalence class of tokens. More informally: what we consider same in the index, e.g. abstraction of a line in the incidence matrix. Sojka, MR Group: PV211: The term vocabulary and postings lists 11 / 58 Documents Terms Skip pointers Phrase queries ormai nzation • Need to "normalize" words in indexed text as well as query terms into the same form. • Example: We want to match U.S.A. and USA 9 We most commonly implicitly define equivalence classes of terms. • Alternatively: do asymmetric expansion • window —>» window, windows windows —>» Windows, windows • Windows (no expansion) • More powerful, but less efficient • Why don't you want to put window, Window, windows, and Windows in the same equivalence class? Sojka, MR Group: PV211: The term vocabulary and postings lists 12 / 58 Documents Terms Skip pointers Phrase queries ormai nzation: ier i languages • Normalization and language detection interact. • PETER WILL NICHT MIT. MIT = mit • He got his PhD from MIT. MIT ^ mit Sojka, MR Group: PV211: The term vocabulary and postings lists 13 / 58 Documents Terms Skip pointers Phrase queries nvertea inaex cons eca • Input: Friends, Romans, countrymen • Output: So let it be with Caesar friend roman countryman so • Each token is a candidate for a postings entry. • What are valid tokens to emit? Sojka, MR Group: PV211: The term vocabulary and postings lists 14 / 58 Documents Terms Skip pointers Phrase queries In June, the dog likes to chase the cat in the barn. - How many word tokens? How many word types? Why tokenization is difficult - even in English. Tokenize: Mr. O'Neill thinks that the boys' stories about Chile's capital aren't amusing. Sojka, MR Group: PV211: The term vocabulary and postings lists 15 / 58 Documents Terms Skip pointers Phras< 5 queries E 01 Kenization pro Dl lems: 1 U ne wora or two or severa o Hewlett-Packard • State-of-the-art a co-education • the hold-him-back-and-drag-him-away maneuver • data base • San Francisco • Los Angeles-based company • cheap San Francisco-Los Angeles fares • York University vs. New York University Sojka, MR Group: PV211: The term vocabulary and postings lists 16 / 58 o 3/20/91 • 20/3/91 • Mar 20, 1991 • B-52 • 100.2.86.144 • (800) 234-2333 • 800.234.2333 • Older IR systems may not index numbers ... • ... but generally it's a useful feature. • Google example (1+1) Sojka, MR Group: PV211: The term vocabulary and postings lists 17 / 58 Documents Terms Skip pointers Phrase queries Ho £B»t±, wmmnmrmm^m^o Sojka, MR Group: PV211: The term vocabulary and postings lists 18 / 58 Documents Terms Skip pointers Phrase queries mDiguous segmentation in The two characters can be treated as one word meaning 'monk' or as a sequence of two words meaning 'and' and 'still'. Sojka, MR Group: PV211: The term vocabulary and postings lists 19 / 58 Documents Terms Skip pointers Phrase queries er cases ot no wnitespace • Compounds in Dutch, German, Swedish, Czech (cistokapsonosoplena) • Computerlinguistik —>► Computer + Linguistik • Lebensversicherungsgesellschaftsangestellter • —>► leben + Versicherung + gesellschaft + angestellter • Inuit: tusaatsiarunnanngittualuujunga (I can't hear very well.) • Many other languages with segmentation difficulties: Finnish, Urdu, ... Sojka, MR Group: PV211: The term vocabulary and postings lists 20 / 58 5MOTTA I NA I 3r^-y<—-yofcv^v\i MLtto ^H^fj^ 5 0 «fT# 13i =«lnnp 2 toSrm^j© P, tl*i-0 4 different "alphabets": Chinese characters, hiragana syllabary for inflectional endings and function words, katakana syllabary for transcription of foreign words and other uses, and latin. No spaces (as in Chinese). End user can express query entirely in hiragana! Sojka, MR Group: PV211: The term vocabulary and postings lists 21 / 58 Documents Terms Skip pointers Phrase queries A 1 ■ ■ Ara bic scrip t un b ä t i k /kitäbun/ 'a book' Sojka, MR Group: PV211: The term vocabulary and postings lists 22 / 58 Documents Terms Skip pointers Phrase queries raoic script: Diairectionaiity ^jill J^VI c> 132 1962 ^ <^ffi'"l <--> <--> <— START 'Algeria achieved its independence in 1962 after 132 years of French occupation.' Bidirectionality is not a problem if text is coded in Unicode. Sojka, MR Group: PV211: The term vocabulary and postings lists 23 / 58 Documents Terms Skip pointers Phrase queries ccents an lacntics • Accents: resume vs. resume (simple omission of accent) • Umlauts: Universität vs. Universitaet (substitution with special letter sequence "ae") • Most important criterion: How are users likely to write their queries for these words? • Even in languages that standardly have accents, users often do not type them. (Polish?) Sojka, MR Group: PV211: The term vocabulary and postings lists 24 / 58 Documents Terms Skip pointers Phrase queries 1 ^ase i roi Id mg • Reduce all letters to lower case • Even though case can be semantically meaningful • capitalized words in mid-sentence • MIT vs. mit o Fed vs. fed o . .. • It's often best to lowercase everything since users will use lowercase regardless of correct capitalization. Sojka, MR Group: PV211: The term vocabulary and postings lists 26 / 58 Documents Terms Skip pointers Phrase queries :op worn s • stop words = extremely common words which would appear to be of little value in helping to select documents matching a user need • Examples: a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on, that, the, to, was, were, will, with • Stop word elimination used to be standard in older IR systems. • But you need stop words for phrase queries, e.g. "King of Denmark". • Most web search engines index stop words. Sojka, MR Group: PV211: The term vocabulary and postings lists 27 / 58 Documents Terms Skip pointers Phrase queries ore equivalence classing • Soundex: MR 3 (phonetic equivalence, Muller = Mueller) • Thesauri: MR 9 (semantic equivalence, car = automobile) Sojka, MR Group: PV211: The term vocabulary and postings lists 28 / 58 • Reduce inflectional/variant forms to base form • Example: am, are, is —± be • Example: car, cars, car's, cars' —>► car • Example: the boy's cars are different colors —> the boy car be different color 9 Lemmatization implies doing "proper" reduction to dictionary headword form (the lemma). • Inflectional morphology (cutting ^ cut) vs. derivational morphology (destruction —>► destroy) Sojka, MR Group: PV211: The term vocabulary and postings lists 29 / 58 Documents Terms Skip pointers Phrase queries o Definition of stemming: Crude heuristic process that chops off the ends of words in the hope of achieving what "principled" lemmatization attempts to do with a lot of linguistic knowledge. 9 Language dependent • Often inflectional and derivational • Example for derivational: automate, automatic, automation all reduce to automat Sojka, MR Group: PV211: The term vocabulary and postings lists 30 / 58 Documents Terms Skip pointers Phrase queries orter aigon • Most common algorithm for stemming English • Results suggest that it is at least as good as other stemming options • Conventions + 5 phases of reductions • Phases are applied sequentially • Each phase consists of a set of commands. • Sample command: Delete final ement if what remains is longer than 1 character o replacement —>► replac • cement —>» cement • Sample convention: Of the rules in a compound command, select the one that applies to the longest suffix. Sojka, MR Group: PV211: The term vocabulary and postings lists 31 / 58 Documents Terms Skip pointers Phrase queries Rule Example SSES SS caresses —>► caress IES —>► I ponies —>► poni SS SS caress —>► caress S —>► cats —>► cat Sojka, MR Group: PV211: The term vocabulary and postings lists 32 / 58 Sample text: Such an analysis can reveal features that are not easily Porter stemmer: such an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret Lovins stemmer: such an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres Paice stemmer: such an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation Sojka, MR Group: PV211: The term vocabulary and postings lists 33 / 58 Documents Terms Skip pointers Phrase queries oes stemming improve eirectiveness • In general, stemming increases effectiveness for some queries, and decreases effectiveness for others. • Queries where stemming is likely to help: [tartan sweaters], [sightseeing tour san francisco] o (equivalence classes: {sweater,sweaters}, {tour,tours}) • Porter Stemmer equivalence class oper contains all of operate operating operates operation operative operatives operational. 9 Queries where stemming hurts: [operational AND research], [operating AND system], [operative AND dentistry] Sojka, MR Group: PV211: The term vocabulary and postings lists 34 / 58 Documents Terms Skip pointers Phrase queries oes boogie • Stop words • Normalization o Tokenization • Lowercasing • Stemming • Non-latin alphabets o Umlauts • Compounds • Numbers Sojka, MR Group: PV211: The term vocabulary and postings lists 35 / 58 Documents Terms Skip pointers Phrase queries ecan oasic intersection aigori Brutus Calpurnia Intersection 1 -> 2 4 ->■ 11 ->■ 31 ->■ 45 ->■ 173 ->■ 174 2 ->■ 31 ->■ 54 ->■ 101 31 • Linear in the length of the postings lists. • Can we do better? Sojka, MR Group: PV211: The term vocabulary and postings lists 37 / 58 Documents Terms Skip pointers Phrase queries 51*60*71 Sojka, MR Group: PV211: The term vocabulary and postings lists 40 / 58 Documents Terms Skip pointers Phrase queries ntersectmg witn skip pointers Intersect WithSkips(pi, P2) 1 answer <— ( ) 2 while pi ^ nil and P2 ^ nil 3 do if doclD(pi) = doclD(p2) 4 then ADD(answer, doclD(pi)) 5 pi <- next(pi) 6 P2 «— next(p2) 7 else if doclD{p1) < doclD(p2) 8 then if hasSkip(pi) and (doclD(skip(pi)) < doclD(p2)) 9 then while hasSkip(pi) and (doclD(skip(pi)) < doclD(p2)) 10 do pi «— skip(pi) 11 else pi «— next(pi) 12 else if hasSkip(p2) and (doclD(skip(p2)) < doclD(pi)) 13 then while hasSkip(p2) and (doclD(skip(p2)) < doclD(pi)) 14 do p2 ^— skip(p2) 15 else p2 ^— next(p2) 16 return answer Sojka, MR Group: PV211: The term vocabulary and postings lists 41 / 58 Documents Terms Skip pointers Phrase queries ere ao we piace SKips • Tradeoff: number of items skipped vs. frequency skip can be taken 9 More skips: Each skip pointer skips only a few items, but we can frequently use it. • Fewer skips: Each skip pointer skips many items, but we can not use it very often. Sojka, MR Group: PV211: The term vocabulary and postings lists 42 / 58 Documents Terms Skip pointers Phrase queries ere ao we piace SKips • Simple heuristic: for postings list of length P, use yfP evenly-spaced skip pointers. • This ignores the distribution of query terms. • Easy if the index is static; harder in a dynamic environment because of updates. • How much do skip pointers help? o They used to help a lot. • With today's fast CPUs, they don't help that much anymore. Sojka, MR Group: PV211: The term vocabulary and postings lists 43 / 58 Documents Terms Skip pointers Phrase queries B 'ase queries • We want to answer a query such as [Masaryk university] - as a phrase. • Thus The president Tomas Garrigue Masaryk never went to Stanford university should not be a match. • The concept of phrase query has proven easily understood by users. • About 10% of web queries are phrase queries. • Consequence for inverted index: it no longer suffices to store doclDs in postings lists. • Two ways of extending the inverted index: o biword index o positional index • Any ideas? Sojka, MR Group: PV211: The term vocabulary and postings lists 45 / 58 Documents Terms Skip pointers Phrase queries E iwora ina exes • Index every consecutive pair of terms in the text as a phrase. • For example, Friends, Romans, Countrymen would generate two biwords: "friends romans" and "romans countrymen" • Each of these biwords is now a vocabulary term. • Two-word phrases can now easily be answered. Sojka, MR Group: PV211: The term vocabulary and postings lists 46 / 58 Documents Terms Skip pointers Phrase queries 1 _onger pr irase queries • A long phrase like "masaryk university brno" can be represented as the Boolean query "masaryk university" AND "university brno" • We need to do post-filtering of hits to identify subset that actually contains the 3-word phrase. Sojka, MR Group: PV211: The term vocabulary and postings lists 47 / 58 • Why are biword indexes rarely used? • False positives, as noted above • Index blowup due to very large term vocabulary Sojka, MR Group: PV211: The term vocabulary and postings lists 48 / 58 • Positional indexes are a more efficient alternative to biword indexes. o Postings lists in a nonpositional index: each posting is just a docID • Postings lists in a positional index: each posting is a docID and a list of positions Sojka, MR Group: PV211: The term vocabulary and postings lists 49 / 58 Query: "toi be^ ors not^ £05 be§ to, 993427: ( 1: (7, 18, 33, 72, 86, 231); 2: (1, 17, 74, 222, 255); 4: (8, 16, 190, 429, 433); 5: (363, 367); 7: (13, 23, 191); ...) be, 178239: ( 1: (17, 25); 4: (17, 191, 291, 430, 434); 5: (14, 19, 101); ...) Document 4 is a match! Sojka, MR Group: PV211: The term vocabulary and postings lists 50 / 58 Documents Terms Skip pointers Phrase queries Shown below is a portion of a positional index in the format: term: docl: (positionl, position2, ...); doc2: (positionl, position2, ...); etc. angels; 2; {36,174,252,651); 4: (12,22,102,432); 7: (17); fools; 2; (1,17,74,222); 4: (8,78,108,458); 7: (3,13,23,193); fear; 2; (87,704,722,901); 4: (13,43,113,433); 7: (18,328,528); in; 2; (3,37,76,444,851); 4: (10,20,110,470,500); 7: (5,15,25,195); rush; 2; (2,66,194,321,702); 4: (9,69,149,429,569); 7: (4,14,404); to; 2; (47,86,234,999); 4: (14,24,774,944); 7: (19,319,599,709); tread; 2; (57,94,333); 4: (15,35,155); 7: (20,320); where; 2; (67,124,393,1001); 4: (11,41,101,421,431); 7: (16,36,736); Which document(s) if any match each of the following two queries, where each expression within quotes is a phrase query?: "fools rush in", "fools rush in" and "angels fear to tread" Sojka, MR Group: PV211: The term vocabulary and postings lists 51 / 58 Documents Terms Skip pointers Phrase queries ximity searcl • We just saw how to use a positional index for phrase searches. • We can also use it for proximity search. • For example: employment /4 place • Find all documents that contain employment and place within 4 words of each other. o Employment agencies that place healthcare workers are seeing growth is a hit. • Employment agencies that have learned to adapt now place healthcare workers is not a hit. Sojka, MR Group: PV211: The term vocabulary and postings lists 52 / 58 Documents Terms Skip pointers Phrase queries ximity searcl • Use the positional index • Simplest algorithm: look at cross-product of positions of (i) employment in document and (ii) place in document • Very inefficient for frequent words, especially stop words • Note that we want to return the actual matching positions, not just a list of documents. • This is important for dynamic summaries etc. Sojka, MR Group: PV211: The term vocabulary and postings lists 53 / 58 Documents Terms Skip pointers Phrase queries imity intersection answer while pi 7^ nil and P2 7^ nil do if doclD(pi) = doclD{p2) then /«- ( ) ppi ^— positions(pi) PP2 ^— positions(p2) while ppi 7^ nil do while pp2 7^ nil do if \pos(ppi) - pos{pp2)\ < k then Add(/, pos(pp2)) else if pos(pp2) > pos(ppi) then break pp2 <- next{pp2) while / 7^ () and |/[0] - pos(ppi)| > k do Delete(/[0]) for each ps e / do Add (answer, (doclD(pi), pos(ppi), ps)) ppi «- next(ppi) pi «- next(pi) p2 «- next(p2) else if doclD(pi) < doclD(p2) then pi «— next(pi) else P2 ^— next(p2) return answer Sojka, MR Group: PV211: The term vocabulary and postings lists 54 / 58 9 Biword indexes and positional indexes can be profitably combined. • Many biwords are extremely frequent: Michael Jackson, Britney Spears etc. • For these biwords, increased speed compared to positional postings intersection is substantial. • Combination scheme: Include frequent biwords as vocabulary terms in the index. Do all other phrases by positional intersection. • Williams et al. (2004) evaluate a more sophisticated mixed indexing scheme. Faster than a positional index, at a cost of 26% more space for index. Sojka, MR Group: PV211: The term vocabulary and postings lists 55 / 58 Documents Terms Skip pointers Phrase queries I -'ositionai i queries on ^oogie • For web search engines, positional queries are much more expensive than regular Boolean queries. • Let's look at the example of phrase queries. • Why are they more expensive than regular Boolean queries? • Can you demonstrate on Google that phrase queries are more expensive than Boolean queries? Sojka, MR Group: PV211: The term vocabulary and postings lists 56 / 58 Documents Terms Skip pointers Phrase queries E ai Ke-away • Understanding of the basic unit of classical information retrieval systems: words and documents: What is a document, what is a term? • Tokenization: how to get from raw text to words (or tokens) o More complex indexes: skip pointers and phrases Sojka, MR Group: PV211: The term vocabulary and postings lists 57 / 58 Documents Terms Skip pointers Phrase queries !esources • Chapter 2 of MR • Resources at http://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and Fl MU library o Porter stemmer • A fun number search on Google Sojka, MR Group: PV211: The term vocabulary and postings lists 58 / 58 Dictionaries Wildcard queries Edit distance Spelling correction Soundex PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/PV211 MR 3: Dictionaries and tolerant retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2017-02-28PM Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 1 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex verv iew ^ Dictionaries Q Wildcard queries Q Edit distance Q Spelling correction Q Soundex Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 2 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex E ai Ke-away • Tolerant retrieval: What to do if there is no exact match between query term and document term a Wildcard queries • Spelling correction Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 3 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex invertec ] inaex For each term t, we store a list of all documents that contain t. Brutus —> 1 2 4 11 31 45 173 174 Caesar —> 1 2 4 5 6 16 57 132 . . . Calpurnia —> 2 31 54 101 S-v-' dictionary postings Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 5 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex ictionari • The dictionary is the data structure for storing the term vocabulary. • Term vocabulary: the data • Dictionary: the data structure for storing the term vocabulary Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 6 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex H ictionary as array or nxea-wi i entries • For each term, we need to store a couple of items: • document frequency • pointer to postings list • . .. • Assume for the time being that we can store this information in a fixed-length entry. • Assume that we store these entries in an array. Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 7 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex ictionary as array ot nxea-wi i entries term document pointer to frequency postings list a 656,265 —> aachen 65 —> zulu 221 —> space needed: 20 bytes 4 bytes 4 bytes How do we look up a query term q, in this array at query time? That is: which data structure do we use to locate the entry (row) in the array where q\ is stored? Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 8 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex res Tor looking up term • Two main classes of data structures: hashes and trees • Some IR systems use hashes, some use trees. • Criteria for when to use hashes vs. trees: • Is there a fixed number of terms or will it keep growing? o What are the relative frequencies with which various keys will be accessed? j How many terms are we likely to have? Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 9 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex • Each vocabulary term is hashed into an integer, its row number in the array. • At query time: hash query term, locate entry in fixed-width array. • Pros: Lookup in a hash is faster than lookup in a tree. • Lookup time is constant. • Cons • no way to find minor variants (resume vs. resume) o no prefix search (all terms starting with automat) o need to rehash everything periodically if vocabulary keeps growing Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 10 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex • Trees solve the prefix problem (find all terms starting with automat). • Simplest tree: binary tree. o Search is slightly slower than in hashes: O(logM), where M is the size of the vocabulary. o O(logM) only holds for balanced trees. • Rebalancing binary trees is expensive. o B-trees mitigate the rebalancing problem. • B-tree definition: every internal node has a number of children in the interval [a, b] where a, b are appropriate positive integers, e.g., [2,4]. Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 11 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 12 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 13 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex \ A /" 1 1 J 1 ■ Wil Id card queries • mon*: find all docs containing any term beginning with mon • Easy with B-tree dictionary: retrieve all terms t in the range: mon < t < moo • *mon: find all docs containing any term ending with mon o Maintain an additional tree for terms backwards. • Then retrieve all terms t in the range: nom < t < non • Result: A set of terms that are matches for wildcard query. • Then retrieve documents that contain any of these terms. Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 15 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex uery processing • At this point, we have an enumeration of all terms in the dictionary that match the wildcard query. • We still have to look up the postings for each enumerated term. Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 16 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex m :o nam mm term • Example: m*nchen • We could look up m* and *nchen in the B-tree and intersect the two term sets. • Expensive • Alternative: permuterm index • Basic idea: Rotate every wildcard query, so that the * occurs at the end. • Store each of these rotations in the dictionary, say, in a B-tree Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 17 / 109 • For term hello: add hello$, ello$h, llo$he, lo$hel, o$hell, and $hello to the B-tree where $ is a special symbol Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 18 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 19 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex ermuterm index • For hello, we've stored: hello$, ello$h, llo$he, lo$hel, o$hell, $hello • Queries • For X, look up X$ • For X*, look up $X* a For *X, look up X$* o For *X*, look up X* o For X*Y, look up Y$X* • Example: For hel*o, look up o$hel* • Permuterm index would better be called a permuterm tree. • But permuterm index is the more common name. Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 20 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex cessmg a lookup in tne permuterm index • Rotate query wildcard to the right o Use B-tree lookup as before • Problem: Permuterm more than quadruples the size of the dictionary compared to a regular B-tree. (empirical number) Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 21 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex / 1 K-gram indexes • More space-efficient than permuterm index • Enumerate all character /c-grams (sequence of k characters) occurring in a term • 2-grams are called bigrams. • Example: from April is the cruelest month we get the bigrams: $a ap pr ri il 1$ $i is s$ $t th he e$ $c er ru ue el le es st t$ $m mo on nt th h$ • $ is a special word boundary symbol, as before. • Maintain an inverted index from bigrams to the terms that contain the bigram Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 22 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex ostmgs list in a o-gram inverted maex et r BEETROOT METRIC PETRIFY RETRIEVAL Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 23 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex f /1 ■ . ■ \ ■ 1 K-gram ^Digram, trigram, ... j indexes • Note that we now have two different types of inverted indexes • The term-document inverted index for finding documents based on a query consisting of terms o The /c-gram index for finding terms based on a query consisting of /c-grams Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 24 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex cessmg wi ea terms in a Digram index • Query mon* can now be run as: $m and mo and on • Gets us all terms with the prefix mon ... • . .. but also many "false positives" like moon. • We must postfilter these terms against query. • Surviving terms are then looked up in the term-document inverted index. • /c-gram index vs. permuterm index o /c-gram index is more space efficient. • Permuterm index doesn't require postfiltering. Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 25 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex • Google has very limited support for wildcard queries. • For example, this query doesn't work very well on Google: [gen* universit*] o Intention: you are looking for the University of Geneva, but don't know which accents to use for the French words for university and Geneva. • According to Google search basics, 2010-04-29: "Note that the * operator works only on whole words, not parts of words." • But this is not entirely true. Try [pythag*] and [m*nchen] • Exercise: Why doesn't Google fully support wildcard queries? Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 26 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex cessmg wi queries in tne term-aocument maex o Problem 1: we must potentially execute a large number of Boolean queries. • Most straightforward semantics: Conjunction of disjunctions • For [gen* universit*]: geneva university OR geneva universitě OR geněve university OR geněve universitě OR general universities OR ... • Very expensive • Problem 2: Users hate to type. • If abbreviated queries like [pyth* theo*] for [pythagoras' theorem] are allowed, users will use them a lot. o This would significantly increase the cost of answering queries. • Somewhat alleviated by Google Suggest Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 27 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex pel ling correction • Two principal uses o Correcting documents being indexed • Correcting user queries • Two different methods for spelling correction • Isolated word spelling correction o Check each word on its own for misspelling o Will not catch typos resulting in correctly spelled words, e.g., an asteroid that fell form the sky • Context-sensitive spelling correction • Look at surrounding words • Can correct form/from error above Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 29 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex orrectmg documents • We are not interested in interactive spelling correction of documents (e.g., MS Word) in this class. • In IR, we use document correction primarily for OCR'ed documents. (OCR = optical character recognition) • The general philosophy in IR is: do not change the documents. Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 30 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex correcting queries • First: isolated word spelling correction • Premise 1: There is a list of "correct words" from which the correct spellings come. • Premise 2: We have a way of computing the distance between a misspelled word and a correct word. • Simple spelling correction algorithm: return the "correct" word that has the smallest distance to the misspelled word. • Example: informaton —>► information • For the list of correct words, we can use the vocabulary of all words that occur in our collection. • Why is this problematic? Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 31 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex ernatives to using :erm vocaouiary • A standard dictionary (Webster's, OED etc.) • An industry-specific dictionary (for specialized IR systems) • The term vocabulary of the collection, appropriately weighted Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 32 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex istance oetween misspelled wora ana correc 9 We will study several alternatives. • Edit distance and Levenshtein distance • Weighted edit distance • /c-gram overlap Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 33 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex o The edit distance between string si and string S2 is the minimum number of basic operations that convert s± to S2. • Levenshtein distance: The admissible basic operations are insert, delete, and replace • Levenshtein distance dog-do: 1 • Levenshtein distance cat-cart 1 • Levenshtein distance cat-cut 1 • Levenshtein distance cat-act 2 • Damerau-Levenshtein distance cat-act 1 • Damerau-Levenshtein includes transposition as a fourth possible operation. Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 34 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex I evensntem distance: computation f a s t 0 1 2 3 4 c 1 1 2 3 4 a 2 2 1 2 3 t 3 3 2 2 2 s 4 4 3 2 3 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 35 / 109 Dictionaries Wildcard queries Edit di: _evensntein distance: ■ Edit distance Spelling correction Soundex i goritnm LevenshteinDistance(si, S2) 1 for /'«- 0 to lsi 2 do m[i, 0] = i 3 for 7 <- 0 to 4 do m[0,/] =j 5 for / ^— 1 to Si 6 do for j 4— 1 to ls2| 7 do if si[;] - S2\J] 8 then m[/,_/] 9 else 10 return m[ si min{/i7[/-l,7]+l, m[/J-l]} min{m[/-lj]+l, ^[/J-ll+l, ati[/-15 Operations: insert (cost 1), delete (cost 1), replace (cost 1), copy (cost 0) Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 36 / 109 Dictionaries Wildcard queries Edit di: _evensntein distance: ■ Edit distance Spelling correction Soundex i goritnm LevenshteinDistance(si, S2) 1 for /'«- 0 to lsi 2 do m[i, 0] = i 3 for 7 <- 0 to 4 do m[0J] =j 5 for / ^— 1 to Si 6 do for j 4— 1 to ls2| 7 do if si[;] - S2\J] 8 then m[/,_/] 9 else 10 return m[ si min{/i7[/-l,7]+l, m[/J-l]} Operations: insert (cost 1), delete (cost 1), replace (cost 1), copy (cost 0) Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 37 / 109 Edit distance Dictionaries Wildcard queries Edit di: _evensntein distance: ■ gori Spelling correction Soundex )Ntnm LevenshteinDistance(si, S2) 1 for /'«- 0 to lsi 2 do m[i, 0] = / 3 for j <- 0 to 4 do m[0J] = j 5 for / ^— 1 to Si 6 do for j 4— 1 to ls2| 7 do if si[;] - %[/] 8 then m[/,_/] 9 else 10 return m[ si min{/i7[/-l,7]+l, m[/m[i-lj-l]} m\n{m[i-lj]+l, ^[/J-ll+l, ati[/-15 Operations: insert (cost 1), delete (cost 1), replace (cost 1), copy (cost 0) Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 38 / 109 Dictionaries Wildcard queries Edit di: _evensntein distance: ■ Edit distance Spelling correction Soundex i goritnm LevenshteinDistance(si, S2) 1 for /'«- 0 to lsi 2 do m[i, 0] = i 3 for 7 <- 0 to 4 do m[0,/] =j 5 for / ^— 1 to Si 6 do for j 4— 1 to ls2| 7 do if si[;] - S2\J] 8 then m[/,_/] 9 else 10 return m[ si min{m[/-l A77[/5j/-l]+l, Operations: insert (cost 1), delete (cost 1), replace (cost 1), copy (cost 0) Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 39 / 109 Dictionaries Wildcard queries Edit di: _evensntein distance: ■ Edit distance Spelling correction Soundex i goritnm LevenshteinDistance(si, S2) 1 for /'«- 0 to lsi 2 do m[/', 0] = / 3 for j <- 0 to 4 do m[0J] =j 5 for /'1 to Si 6 do for j 1 to ls2| 7 do if si[;] - S2\J] 8 then m[/',_/] 9 else 10 return m[ si , Sfe|] min{m[/'-l,j']+l, m[ij-l]+l, m[i-l,j-l]} min{m[/-lj] + l, m[i,j-l]+l, m[i-lj-l]+l} Operations: insert (cost 1), delete (cost 1), replace (cost 1), copy (cost 0) Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 40 / 109 f a s t 0 1 1 2 2 3 3 4 4 (- 1 1 2 2 3 3 4 4 5 L. 1 2 i 2 2 3 3 4 4 2 2 2 1 5 3 4 4 5 d 2 3 2 3 1 2 2 3 3 f 3 3 3 3 2 2 3 2 4 L 3 4 3 4 2 3 2 3 2 c 4 4 4 4 3 2 3 3 3 b 4 5 4 5 3 4 2 3 3 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 41 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex cost of getting here from my upper left neighbor (copy or replace) cost of getting here from my upper neighbor (delete) cost of getting here from my left neighbor (insert) the minimum of the three possible "movements"; the cheapest way of getting here Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 42 / 109 f a s t 0 1 1 2 2 3 3 4 4 (- 1 1 2 2 3 3 4 4 5 L. 1 2 i 2 2 3 3 4 4 2 2 2 1 5 3 4 4 5 d 2 3 2 3 1 2 2 3 3 f 3 3 3 3 2 2 3 2 4 L 3 4 3 4 2 3 2 5 2 c 4 4 4 4 3 2 3 5 3 b 4 5 4 5 3 4 2 5 3 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 43 / 109 Dictionaries Wildcard queries Edit distance i Spelling correction Soundex ynamic programming ^ormen et a • Optimal substructure: The optimal solution to the problem contains within it subsolutions, i.e., optimal solutions to subproblems. • Overlapping subsolutions: The subsolutions overlap. These subsolutions are computed over and over again when computing the global optimal solution in a brute-force algorithm. • Subproblem in the case of edit distance: what is the edit distance of two prefixes • Overlapping subsolutions: We need most distances of prefixes 3 times - this corresponds to moving right, diagonally, down. Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 44 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex eigntea ear • As above, but weight of an operation depends on the characters involved. • Meant to capture keyboard errors, e.g., m more likely to be mistyped as n than as q. • Therefore, replacing m by n is a smaller edit distance than by • We now require a weight matrix as input. • Modify dynamic programming to handle weights Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 45 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex 1 sing ea it a istance Tor spei nng correction • Given query, first enumerate all character sequences within a preset (possibly weighted) edit distance • Intersect this set with our list of "correct" words o Then suggest terms in the intersection to the user. • —>► exercise in a few slides Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 46 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex O Compute Levenshtein distance matrix for oslo - snow Q What are the Levenshtein editing operations that transform cat into catcatl Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 47 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 s 2 2 1 3 3 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 48 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 1 2 ? s 2 2 1 3 3 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 49 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 1 2 1 s 2 2 1 3 3 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 50 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 1 2 1 2 ? s 2 2 1 3 3 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 51 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 1 2 1 2 2 s 2 2 1 3 3 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 52 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 1 2 1 2 2 3 ? s 2 2 1 3 3 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 53 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 1 2 1 2 2 3 2 s 2 2 1 3 3 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 54 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 ? s 2 2 1 3 3 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 55 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 2 1 3 3 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 56 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 ? 1 3 3 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 57 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 1 1 3 3 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 58 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 2 3 1 2 ? 1 3 3 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 59 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 2 3 1 2 2 1 3 3 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 60 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 2 3 1 2 2 3 ? 1 3 3 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 61 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 2 3 1 2 2 3 3 1 3 3 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 62 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 ? 1 3 3 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 63 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 64 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 3 4 ? o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 65 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 3 4 2 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 66 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 2 3 ? o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 67 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 2 3 2 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 68 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 3 4 2 3 2 3 ? o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 69 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 3 4 2 3 2 3 3 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 70 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 4 4 3 4 2 3 2 3 3 4 ? o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 71 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 4 4 3 4 2 3 2 3 3 4 4 o 4 4 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 72 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 4 4 3 4 2 3 2 3 3 4 4 o 4 4 3 4 5 ? Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 73 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 4 4 3 4 2 3 2 3 3 4 4 o 4 4 3 4 5 3 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 74 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 4 4 3 4 2 3 2 3 3 4 4 o 4 4 3 3 3 4 5 3 4 ? Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 75 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 4 4 3 4 2 3 2 3 3 4 4 o 4 4 3 3 3 4 5 3 4 3 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 76 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 4 4 3 4 2 3 2 3 3 4 4 o 4 4 3 3 3 2 4 4 5 3 4 3 4 ? Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 77 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 4 4 3 4 2 3 2 3 3 4 4 o 4 4 3 3 3 2 4 4 5 3 4 3 4 2 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 78 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 4 4 3 4 2 3 2 3 3 4 4 o 4 4 3 3 3 2 4 4 5 4 5 3 4 3 4 2 3 ? Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 79 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 4 4 3 4 2 3 2 3 3 4 4 o 4 4 3 3 3 2 4 4 5 4 5 3 4 3 4 2 3 3 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 80 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 4 4 3 4 2 3 2 3 3 4 4 o 4 4 3 3 3 2 4 4 5 4 5 3 4 3 4 2 3 3 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 81 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 4 4 3 4 2 3 2 3 3 4 4 o 4 4 3 3 3 2 4 4 5 4 5 3 4 3 4 2 3 3 How do I read out the editing operations that transform OSLO into SNOW? Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 82 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 4 4 3 4 2 3 2 3 3 4 4 o 4 4 3 3 3 2 4 4 5 4 5 3 4 3 4 2 3 3 cost operation input output 1 insert * w Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 83 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 4 4 3 4 2 3 2 3 3 4 4 o 4 4 3 3 3 2 4 4 5 4 5 3 4 3 4 2 3 3 cost operation input output 0 (copy) o o 1 insert * w Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 84 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 4 4 3 4 2 3 2 3 3 4 4 o 4 4 3 3 3 2 4 4 5 4 5 3 4 3 4 2 3 3 cost operation input output 1 replace 1 n 0 (copy) o o 1 insert * w Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 85 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 4 4 3 4 2 3 2 3 3 4 4 o 4 4 3 3 3 2 4 4 5 4 5 3 4 3 4 2 3 3 cost operation input output 0 (copy) s s 1 replace 1 n 0 (copy) o o 1 insert * w Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 86 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex s n o w 0 1 1 2 2 3 3 4 4 o 1 1 2 2 3 2 4 4 5 1 2 1 2 2 3 2 3 3 s 2 1 2 2 3 3 3 3 4 2 3 1 2 2 3 3 4 3 1 3 3 2 2 3 3 4 4 4 3 4 2 3 2 3 3 4 4 o 4 4 3 3 3 2 4 4 5 4 5 3 4 3 4 2 3 3 cost operation input output 1 delete o * 0 (copy) s s 1 replace 1 n 0 (copy) o o 1 insert * w Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 87 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex c a t c a t 0 1 1 2 2 3 3 4 4 5 5 6 6 c 1 0 2 2 3 3 4 3 5 5 6 6 7 1 2 0 1 1 2 2 3 3 4 4 5 5 a 2 2 1 0 2 2 3 3 4 3 5 5 6 2 3 1 2 0 1 1 2 2 3 3 4 4 t 3 3 2 2 1 0 2 2 3 3 4 3 5 3 4 2 3 1 2 0 1 1 2 2 3 3 Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 88 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex c a t c a t 0 1 1 2 2 3 3 4 4 5 5 6 6 c 1 0 2 2 3 3 4 3 5 5 6 6 7 1 2 0 1 1 2 2 3 3 4 4 5 5 a 2 2 1 0 2 2 3 3 4 3 5 5 6 2 3 1 2 0 1 1 2 2 3 3 4 4 t 3 3 2 2 1 0 2 2 3 3 4 3 5 3 4 2 3 1 2 0 1 1 2 2 3 3 cost operation input output 1 insert * c 1 insert * a 1 insert * t 0 (copy) c c 0 (copy) a a 0 (copy) t t Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 89 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex c a t c a t 0 1 1 2 2 3 3 4 4 5 5 6 6 c 1 0 2 2 3 3 4 3 5 5 6 6 7 1 2 0 1 1 2 2 3 3 4 4 5 5 a 2 2 1 0 2 2 3 3 4 3 5 5 6 2 3 1 2 0 1 1 2 2 3 3 4 4 t 3 3 2 2 1 0 2 2 3 3 4 3 5 3 4 2 3 1 2 0 1 1 2 2 3 3 cost operation input output 0 (copy) c c 1 insert * a 1 insert * t 1 insert * c 0 (copy) a a 0 (copy) t t Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 90 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex c a t c a t 0 1 1 2 2 3 3 4 4 5 5 6 6 c 1 0 2 2 3 3 4 3 5 5 6 6 7 1 2 0 1 1 2 2 3 3 4 4 5 5 a 2 2 1 0 2 2 3 3 4 3 5 5 6 2 3 1 2 0 1 1 2 2 3 3 4 4 t 3 3 2 2 1 0 2 2 3 3 4 3 5 3 4 2 3 1 2 0 1 1 2 2 3 3 cost operation input output 0 (copy) c c 0 (copy) a a 1 insert * t 1 insert * c 1 insert * a 0 (copy) t t Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 91 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex c a t c a t 0 1 1 2 2 3 3 4 4 5 5 6 6 c 1 0 2 2 3 3 4 3 5 5 6 6 7 1 2 0 1 1 2 2 3 3 4 4 5 5 a 2 2 1 0 2 2 3 3 4 3 5 5 6 2 3 1 2 0 1 1 2 2 3 3 4 4 t 3 3 2 2 1 0 2 2 3 3 4 3 5 3 4 2 3 1 2 0 1 1 2 2 3 3 cost operation input output 0 (copy) c c 0 (copy) a a 0 (copy) t t 1 insert * c 1 insert * a 1 insert * t Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 92 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex pel ling correction • Now that we can compute edit distance: how to use it for isolated word spelling correction - this is the last slide in this section. • /c-gram indexes for isolated word spelling correction. • Context-sensitive spelling correction • General issues Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 94 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex £ 1 1 ■ i ■ /c-gram md exes tor spel ling correction • Enumerate all /c-grams in the query term • Example: bigram index, misspelled word bordroom 9 Bigrams: bo, or, rd, dr, ro, oo, om 9 Use the /c-gram index to retrieve "correct" words that match query term /c-grams o Threshold by number of matching /c-grams o E.g., only vocabulary terms that differ by at most 3 /c-grams Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 95 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex K-gram ina exes Tor spei nng correction: i ora roo m BO -► aboard —► about —► boardroom -► border OR -► border —► lord —► morbid -► sordid RD -► aboard —► ardent —► boardroom -► border Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 96 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex 1 ^ontext-sensitive spei nng correctioi 1 • Our example was: an asteroid that fell form the sky • How can we correct form here? • One idea: hit-based spelling correction • Retrieve "correct" terms close to each query term • for flew form munich: flea for flew, from for form, munch for munich • Now try all possible resulting phrases as queries with one word "fixed" at a time o Try query "flea form munich" o Try query "flew from munich" • Try query "flew form munch" o The correct query "flew from munich" has the most hits. • Suppose we have 7 alternatives for flew, 20 for form and 3 for munich, how many "corrected" phrases will we enumerate? Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 97 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex 1 ^ontext-sensitive spei nng correctioi 1 • The "hit-based" algorithm we just outlined is not very efficient. • More efficient alternative: look at "collection" of queries, not documents. Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 98 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex rai issues in spelling correction • User interface o automatic vs. suggested correction o Did you mean only works for one suggestion. • What about multiple possible corrections? o Tradeoff: simple vs. powerful Ul o Cost o Spelling correction is potentially expensive, o Avoid running on every query? Maybe just on queries that match few documents. • Guess: Spelling correction of major search engines is efficient enough to be run on every query. Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 99 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex n reter iMorvig s spelling correc cor import re, collections def words(text): return re.findall(5[a-z]+', text.lower()) def train(features): model = collections.defaultdict(lambda: 1) for f in features: model[f] += 1 return model nwords = train(words(f ile('big.txt')-readO)) alphabet = 5abcdefghijklmnopqrstuvwxyz5 def editsl(word): splits = [(word[:i], word[i:]) for i in range(len(word) +1)] deletes = [a + b[l:] for a, b in splits if b] transposes = [a + b[l] + b[0] + b[2:] for a, b in splits if len(b) gt 1] replaces = [a + c + b[l:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet] return set(deletes + transposes + replaces + inserts) def known_edits2(word): return set(e2 for el in editsl(word) for e2 in editsl(el) if e2 in nwords) def known(words): return set(w for w in words if w in nwords) def correct(word): candidates = known([word]) or known(editsl(word)) or known_edits2(word) or [word] return max(candidates, key=nwords.get) Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 100 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex import re, collections def words(text): return re.findall('[a-z]+1, text.lower()) def train(features): model = collections.defaultdict(lambda: 1) for f in features: model[f] += 1 return model NWORDS = train(words(file('big.txt').read())) alphabet = 'abcdefghijklmnopqrstuvwxyz' def edits1(word): n = len(word) return set([word[0:i]+word[i+1:] for i in range(n)] + # deletion [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-l)] + # transposition [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]) # insertion def known_edits2(word): return set(e2 for el in editsl(word) for e2 in editsl(el) if e2 in NWORDS) def known(words): return set(w for w in words if w in NWORDS) def correct(word): candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] return max{candidates, key-lambda w: NWORDSTwl) Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 101 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex ounaex • Soundex is the basis for finding phonetic (as opposed to orthographic) alternatives. • Example: chebyshev / tchebyscheff • Algorithm: • Turn every token to be indexed into a 4-character reduced form o Do the same with query terms • Build and search an index on the reduced forms Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 103 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex ounaex aigori Q Retain the first letter of the term. Q Change all occurrences of the following letters to '0' (zero): A, E, I, 0, U, H, W, Y O Change letters to digits as follows: o B, F, P, V to 1 o C, G, J, K, Q, S, X, Z to 2 o D,T to 3 • L to 4 o M, N to 5 • R to 6 Q Repeatedly remove one out of each pair of consecutive identical digits 9 Remove all zeros from the resulting string; pad the resulting string with trailing zeros and return the first four positions, which will consist of a letter followed by three digits Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 104 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex • Retain H • ERMAN ORMON • ORMON -»• 06505 • 06505 ->• 06505 • 06505 ->• 655 • Return /-/655 o Note: HERMANN will generate the same code Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 105 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex ow useTui is oounaex • Not very - for information retrieval • Ok for "high recall" tasks in other applications (e.g., Interpol) o Zobel and Dart (1996) suggest better alternatives for phonetic matching in IR. Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 106 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 107 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex E ai Ke-away • Tolerant retrieval: What to do if there is no exact match between query term and document term 9 Wildcard queries • Spelling correction Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 108 / 109 Dictionaries Wildcard queries Edit distance Spelling correction Soundex lesources • Chapter 3 of MR • Resources at http://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and Fl MU library • trie vs hash vs ternary tree • Soundex demo o Edit distance demo o Peter Norvig's spelling corrector o Google: wild card search, spelling correction gone wrong, a misspelling that is more frequent that the correct spelling Sojka, MR Group: PV211: Dictionaries and tolerant retrieval 109 / 109 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/PV211 MR 4: Index construction Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2017-03-07 Sojka, MR Group: PV211: Index construction 1/53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Q Recap Q Introduction Q BSBI algorithm O SPIMI algorithm Q Distributed indexing Q Dynamic indexing Sojka, MR Group: PV211: Index construction 2/53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing term document pointer to frequency postings list a 656,265 —> aachen 65 —> zulu 221 —> space needed: 20 bytes 4 bytes 4 bytes Sojka, MR Group: PV211: Index construction 4/53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing mg up entries in array Sojka, MR Group: PV211: Index construction 5/53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing qu ing a permuierm inaex Queries: o For X, look up X$ o For X*, look up X*$ o For *X, look up X$* 9 For *X*, look up X* • For X*Y, look up Y$X* Sojka, MR Group: PV211: Index construction 6/53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing -gram index r spelling correction: BO -► aboard —► about —► board room -► border OR -► border —► lord —► morbid -► sordid RD -► aboard —► ardent —► board room -► border Sojka, MR Group: PV211: Index construction 7/53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing 1 .evens n tem a istance tor spei nng correction LevenshteinDistance(si, s2) 1 for / 0 to lsi 2 do at?[/, 0] = / 3 for j 0 to 4 do m[0J] =j 5 for / 1 to Si 6 do for y «— 1 to ls2| 7 do if si[i] — S2 [/I 8 then at7[/,7'] 9 else 10 return m[ si > S2|] min{m[/ min{m[/ + 1, m[/,j - 1] + 1, m[/- 1J - 1]} 1,7] + 1, m[/,7 - 1] + 1, m[i - 1,7 - 1] + 1} Operations: insert, delete, replace, copy Sojka, MR Group: PV211: Index construction 8/53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing import re, collections def words(text): return re.findall('[a-z]+', text.lower()) def train(features): model = collections.defaultdict(lambda: 1) for f in features: model [f] += 1 return model NWORDS = train(words(file('big.txt').read())) alphabet = 'abcdefghijklmnopqrstuvwxyz' def edits1(word): splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[l:] for a, b in splits if b] transposes = [a + b[l] + b[0] + b[2:] for a, b in splits if len(b) gt 1] replaces = [a + c + b[l:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet] return set(deletes + transposes + replaces + inserts) def known_edits2(word): return set(e2 for el in editsl(word) for e2 in editsl(el) if e2 in NWORDS) def known(words): return set(w for w in words if w in NWORDS) def correct(word): candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] return max(candidates, key=NWORDS.get) Sojka, MR Group: PV211: Index construction 9/53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing • Two index construction algorithms: BSBI (simple) and SPIMI (more realistic) • Distributed index construction: MapReduce • Dynamic index construction: how to keep the index up-to-date as the collection changes Sojka, MR Group: PV211: Index construction 10 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing ara ware oasics o Many design decisions in information retrieval are based on hardware constraints. d We begin by reviewing hardware basics that we'll need in this course. Sojka, MR Group: PV211: Index construction 12 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing • Access to data is much faster in memory than on disk, (roughly a factor of 10 SSD, 100+ for rotational disks) • Disk seeks are "idle" time: No data is transferred from disk while the disk head is being positioned. • To optimize transfer time from disk to memory: one large chunk is faster than many small chunks. • Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks). Block sizes: 8KB to 256 KB • Assuming an efficient decompression algorithm, the total time of reading and then decompressing compressed data is usually less than reading uncompressed data. • Servers used in IR systems typically have many GBs of main memory and TBs of disk space. • Fault tolerance is expensive: It's cheaper to use many regular machines than one fault tolerant machine. Sojka, MR Group: PV211: Index construction 13 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing symbol statistic value s average seek time 5 ms = 5 x 10-3 s b transfer time per byte 0.02 fis = 2 x 10"8 s processor's clock rate 109 s-1 P lowlevel operation (e.g., compare & swap a word) 0.01 ps = 10"8 s size of main memory several G B size of disk space 1 TB or more Sojka, MR Group: PV211: Index construction 14 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing • Shakespeare's collected works are not large enough for demonstrating many of the points in this course. • As an example for applying scalable index construction algorithms, we will use the Reuters RCV1 collection. • English newswire articles sent over the wire in 1995 and 1996 (one year). Sojka, MR Group: PV211: Index construction 15 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing REUTERS $ You are here; Home > News > Science > Article Go to a Section: U.S. International Business Markets Politics Entertainment Technology Sports Oddly Enc Extreme conditions create rare Antarctic clouds Tue Aug 1, 2006 3:a0am ET Email The Article I Print Ths Article Reprint ["] Teat [+] SYDNEY (Reuters) - Rare: mother-of-pearl colored clouds caused by extreme weather conditions above Antarctica are a possible indication of global warnning: Australian scientists said on Tuesday. Known as nacreous clouds: the spectacular formations showing delicate wisps of colors were photographed in the sky over an Australian Sojka, MR Group: PV211: Index construction 16 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing N L M T documents 800,000 tokens per document 200 terms (= word types) 400,000 bytes per token (incl. spaces/punct.) 6 bytes per token (without spaces/punct.) 4.5 bytes per term (= word type) 7.5 non-positional postings 100,000,000 Exercise: Average frequency of a term (how many tokens)? 4.5 bytes per word token vs. 7.5 bytes per word type: why the difference? How many positional postings? Sojka, MR Group: PV211: Index construction 17 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Brutus —> 1 2 4 11 31 45 173 174 Caesar —> 1 2 4 5 6 16 57 132 . . . Calpurnia —> 2 31 54 101 V-v-' dictionary postings Sojka, MR Group: PV211: Index construction 19 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing term docID term docID 1 1 ambitious 2 did 1 be 2 enact 1 brutus 1 Julius 1 brutus 2 caesar 1 capitol 1 1 1 caesar 1 was 1 caesar 2 killed 1 caesar 2 i' 1 did 1 the 1 enact 1 capitol 1 hath 1 brutus 1 1 1 killed 1 1 1 me 1 —> V 1 so 2 ' it 2 let 2 Julius 1 it 2 killed 1 be 2 killed 1 with 2 let 2 caesar 2 me 1 the 2 noble 2 noble 2 so 2 brutus 2 the 1 hath 2 the 2 told 2 told 2 you 2 you 2 caesar 2 was 1 was 2 was 2 ambitious 2 with 2 mgs in m Sojka, MR Group: PV211: Index construction 20 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing • As we build index, we parse docs one at a time. o The final postings for any term are incomplete until the end. • Can we keep all postings in memory and then do the sort in-memory at the end? o No, not for large collections • Thus: We need to store intermediate results on disk. Sojka, MR Group: PV211: Index construction 21 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing r ai • Can we use the same index construction algorithm for larger collections, but by using disk instead of memory? o No: Sorting very large sets of records on disk is too slow - too many disk seeks. • We need an external sorting algorithm. Sojka, MR Group: PV211: Index construction 22 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing rtmg aigori • We must sort T = 100,000,000 non-positional postings. o Each posting has size 12 bytes (4+4+4: termID, docID, term frequency). • Define a block to consist of 10,000,000 such postings • We can easily fit that many postings into memory, o We will have 10 such blocks for RCV1. • Basic idea of algorithm: o For each block: (i) accumulate postings, (ii) sort in memory, (iii) write to disk • Then merge the blocks into one long sorted order. Sojka, MR Group: PV211: Index construction 23 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing rgmg postings to be merged Block 1 brutus d3 caesar d4 noble d3 with d4 Block 2 brutus 62 caesar dl Julius dl killed 62 brutus 62 brutus 63 caesar 61 caesar d4 Julius dl killed 62 noble 63 with d4 / merged postings Sojka, MR Group: PV211: Index construction 24 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing ea indexing BSBIndexConstruction() 1 n • list(/c, v) —> output -ř list(termlD,doclD) —> (postings_listi, postings_list2, ((C, d2), (died,c/2), (C,c/i), (CAME.di), (C,c/i), (c'ed.c/i)) reduce: ((C,(d2,di,di)), (died, (d2)}, (came, (di)), (c'ed, (di))) -+ ((C,(di:2,d2:l)}, (died, (c/2:1)), (came, (c'ed, (c/i:1)>) Sojka, MR Group: PV211: Index construction 40 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing • What information does the task description contain that the master gives to a parser? • What information does the parser report back to the master upon completion of the task? • What information does the task description contain that the master gives to an inverter? • What information does the inverter report back to the master upon completion of the task? Sojka, MR Group: PV211: Index construction 41 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing ynamic indexing • Up to now, we have assumed that collections are static. o They rarely are: Documents are inserted, deleted and modified. o This means that the dictionary and postings lists have to be dynamically modified. Sojka, MR Group: PV211: Index construction 43 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing E ynamic indexing: oimp ppr • Maintain big main index on disk • New docs go into small auxiliary index in memory. • Search across both, merge results • Periodically, merge auxiliary index into big index • Deletions: o Invalidation bit-vector for deleted docs o Filter docs returned by index using this bit-vector Sojka, MR Group: PV211: Index construction 44 / 53 Recap Introduction BSBI algorithm 5PIMI algorithm Distribut ed indexing Dynamic indexing UXII nary ana mam ma ex • Frequent merges • Poor search performance during index merge Sojka, MR Group: PV211: Index construction 45 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing garitnmic merg • Logarithmic merging amortizes the cost of merging indexes over time. o —>► Users see smaller effect on response times. o Maintain a series of indexes, each twice as large as the previous one. • Keep smallest (Zq) in memory • Larger ones (/n, /i, ...) on disk o If Zq gets too big (> n), write to disk as /n o ... or merge with /q (if /n already exists) and write merger to 11 etc. Sojka, MR Group: PV211: Index construction 46 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing LMergeAddToken(/7?c/exes, Z0, token) 1 Z0 <- Merge(Z0, {token}) 2 if|Z0| = A7 3 then for /V 0 to oo 4 do if // G indexes 5 then Zi+1 <- Merge(//, Z,-) 6 /s 3 temporary index on disk.) 7 indexes <— indexes — {//} 8 else // <— Z/ (Z-, becomes the permanent index \\.) 9 indexes <— indexes U {//} 10 Break 11 Z0 <- 0 LogarithmicMerge() 1 Zo ^— 0 (Zo /s t/?e in-memory index.) 2 indexes <— 0 3 while true 4 do LMERGEADDToKEN(/a?c/exes, Z0, getNextToken()) Sojka, MR Group: PV211: Index construction 47 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing 3ina ry num oers: j o 0001 o 0010 o 0011 o 0100 o 0101 o 0110 o 0111 o 1000 o 1001 o 1010 o 1011 o 1100 Sojka, MR Group: PV211: Index construction 48 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing garitnmic merg • Number of indexes bounded by 0(log 7") (7 is total number of postings read so far) • So query processing requires the merging of 0(log 7") indexes o Time complexity of index construction is 0(7" log 7"). o . .. because each of T postings is merged 0(log 7") times. • Auxiliary index: index construction time is 0(7"2) as each posting is touched in each merge. • Suppose auxiliary index has size a © a + 2a + 3a + 4a + ... + na = a^^- = 0(n2) • So logarithmic merging is an order of magnitude more efficient. Sojka, MR Group: PV211: Index construction 49 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing ynamic indexing • Often a combination o Frequent incremental changes o Rotation of large parts of the index that can then be swapped in • Occasional complete rebuild (becomes harder with increasing size - not clear if Google can do a complete rebuild) Sojka, MR Group: PV211: Index construction 50 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing ing position • Basically the same problem except that the intermediate data structures are large. Sojka, MR Group: PV211: Index construction 51 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing • Two index construction algorithms: BSBI (simple) and SPIMI (more realistic) • Distributed index construction: MapReduce • Dynamic index construction: how to keep the index up-to-date as the collection changes Sojka, MR Group: PV211: Index construction 52 / 53 Recap Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing • Chapter 4 of MR • Resources at http://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and Fl MU library o Original publication on MapReduce by Dean and Ghemawat (2004) o Original publication on SPIMI by Heinz and Zobel (2003) o YouTube video: Google data centers Sojka, MR Group: PV211: Index construction 53 / 53 Compression Term statistics Dictionary compression Postings compression PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/PV211 MR 5: Index compression Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2017-03-28 Sojka, MR Group: PV211: Index compression 1/57 Compression Term statistics Dictionary compression Postings compression Q Compression Q Term statistics Q Dictionary compression Q Postings compression Sojka, MR Group: PV211: Index compression 2/57 Compression Term statistics Dictionary compression Postings compression m • Today: index compression, and vector space model • Next week: the whole picture of complete search system, scoring and ranking • In two weeks time: invited lectures (Seznam, Facebook Al) + midterm Sojka, MR Group: PV211: Index compression 3/57 Compression Term statistics Dictionary compression Postings compression For each term t, we store a list of all documents that contain f. Brutus ~| —► | 1 2 4 | 11 | 31 | 45 | 173 | 174~] Caesar | —► | 1 2 | 4 | 5 | 6 | 16 57 | 132 | ... Calpurnia~| —> | 2 | 31 | 54 | 101 | dictionary postings file • Motivation for compression in information retrieval systems • How can we compress the dictionary component of the inverted index? • How can we compress the postings component of the inverted index? o Term statistics: how are terms distributed in document collections? Sojka, MR Group: PV211: Index compression 4/57 For each term t, we store a list of all documents that contain t. Brutus —> 1 2 4 11 31 45 173 174 Caesar —> 1 2 4 5 6 16 57 132 .. . Calpurnia —> 2 31 54 101 dictionary postings file Today: • How much space do we need for the dictionary? • How much space do we need for the postings file? • How can we compress them? Sojka, MR Group: PV211: Index compression 6/57 • Use less disk space (saves money). • Keep more stuff in memory (increases speed). • Increase speed of transferring data from disk to memory (again, increases speed). [read compressed data and decompress in memory] is faster than [read uncompressed data] • Premise: Decompression algorithms are fast. • This is true of the decompression algorithms we will use. Sojka, MR Group: PV211: Index compression 7/57 Compression Term statistics Dictionary compression Postings compression I n in inTormaiion reiriev • First, we will consider space for dictionary: • Main motivation for dictionary compression: make it small enough to keep in main memory • Then for the postings file • Motivation: reduce disk space needed, decrease time needed to read from disk. o Note: Large search engines keep significant part of postings in memory. • We will devise various compression schemes for dictionary and postings. Sojka, MR Group: PV211: Index compression 8/57 Compression Term statistics Dictionary compression Postings compression mpressi • Lossy compression: Discard some information • Several of the preprocessing steps we frequently use can be viewed as lossy compression: • downcasing, stop words, porter, number elimination • Lossless compression: All information is preserved. o What we mostly do in index compression Sojka, MR Group: PV211: Index compression 9/57 Compression Term statistics Dictionary compression Postings compression symbol statistic value N documents 800,000 L avg. # word tokens per document 200 M word types 400,000 avg. # bytes per word token (incl. spaces/punct.) 6 avg. # bytes per word token (without spaces/punct.) 4.5 avg. # bytes per word type 7.5 T non-positional postings 100,000,000 Sojka, MR Group: PV211: Index compression 11 / 57 Compression Term statistics Dictionary compression Postings compression Effect of pre P rocessing for Reuters size of word types (terms) non-positional postings positional postings (word tokens) dictionary non-positional index positional index size Acml size A cml size Acml unfiltered no numbers case folding 30 stopw's 150 stopw's stemming Explain -3 vs 0 484,494 473,723 -2 -2 391,523-17 -19 391,493 -0 -19 391,373 -0 -19 322,383-17 -33 differences betwe , -14 vs -31, - 109,971,179 100,680,242 -8 -8 96,969,056 -3 -12 83,390,443-14 -24 67,001,847-30 -39 63,812,300 -4 -42 ien numbers non-positi 30 vs -47, -4 vs 0 197,879,290 179,158,204 -9 -9 179,158,204 -0 -9 121,857,825 -31 -38 94,516,599 -47 -52 94,516,599 -0 -52 onal vs positional: Sojka, MR Group: PV211: Index compression 12 / 57 Compression Term statistics Dictionary compression Postings compression rm v • That is, how many distinct words are there? o Can we assume there is an upper bound? • Not really: At least 7020 « 1037 different words of length 20. • The vocabulary will keep growing with collection size. • Heaps' law: M = kTb • M is the size of the vocabulary, T is the number of tokens in the collection. o Typical values for the parameters k and b are: 30 < k < 100 and b « 0.5. • Heaps' law is linear in log-log space. • It is the simplest possible relationship between collection size and vocabulary size in log-log space, a Empirical law Sojka, MR Group: PV211: Index compression 13 / 57 Compression Term statistics Dictionary compression Postings compression Vocabulary size M as a function of collection size T (number of tokens) for Reuters-RCVl. For these data, the dashed line logio M = 0.49 * log10 T + 1.64 is the best least squares fit. Thus, M = 1016470-49 and k = 10164 w 44 and b = 0.49. loglOT Sojka, MR Group: PV211: Index compression 14 / 57 Compression Term statistics Dictionary compression Postings compression • Good, as we just saw in the graph. • Example: for the first 1,000,020 tokens Heaps' law predicts 38,323 terms: 44 x 1,000,020049 « 38,323 o The actual number is 38,365 terms, very close to the prediction. • Empirical observation: fit is good in general. Sojka, MR Group: PV211: Index compression 15 / 57 Compression Term statistics Dictionary compression Postings compression Q What is the effect of including spelling errors vs. automatically correcting spelling errors on Heaps' law? 9 Compute vocabulary size M 9 Looking at a collection of web pages, you find that there are 3,000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens. o Assume a search engine indexes a total of 20,000,000,000 (2 x 1010) pages, containing 200 tokens on average • What is the size of the vocabulary of the indexed collection as predicted by Heaps' law? Sojka, MR Group: PV211: Index compression 16 / 57 Compression Term statistics Dictionary compression Postings compression • Now we have characterized the growth of the vocabulary in collections. • We also want to know how many frequent vs. infrequent terms we should expect in a collection. • In natural language, there are a few very frequent terms and very many very rare terms. • Zipf's law: The /th most frequent term has frequency cf/ proportional to 1//. • cf/ oc i • cf/ is collection frequency: the number of occurrences of the term t\ in the collection. Sojka, MR Group: PV211: Index compression 17 / 57 Compression Term statistics Dictionary compression Postings compression • Zipf's law: The / most frequent term has frequency proportional to 1//. • cf / oc i • cf is collection frequency: the number of occurrences of the term in the collection. • So if the most frequent term (the) occurs cfi times, then the second most frequent term (of) has half as many occurrences cf2 = |cfi .. . o .. . and the third most frequent term (and) has a third as many occurrences cf3 = |cfi etc. • Equivalent: cf; = cik and logcf; = log c +/clog / (for k = —1) o Example of a power law Sojka, MR Group: PV211: Index compression 18 / 57 Compression Term statistics Dictionary compression Postings compression o o o Fit is not great. What is important is the key insight: Few frequent terms, many rare terms. Iog10 rank Sojka, MR Group: PV211: Index compression 19 / 57 Compression Term statistics Dictionary compression Postings compression mpression • The dictionary is small compared to the postings file. • But we want to keep it in memory. • Also: competition with other applications, cell phones, onboard computers, fast startup time • So compressing the dictionary is important. Sojka, MR Group: PV211: Index compression 21 / 57 Compression Term statistics Dictionary compression Postings compression term document pointer to frequency postings list a 656,265 —> aachen 65 —> zulu 221 —> space needed: 20 bytes 4 bytes 4 bytes Space for Reuters: (20+4+4)*400,000 = 11.2 MB Sojka, MR Group: PV211: Index compression 22 / 57 • Most of the bytes in the term column are wasted. • We allot 20 bytes for terms of length 1. • We cannot handle hydrochlorofluorocarbons and supercalifragilisticexpialidocious 9 Average length of a term in English: 8 characters (or a littl bit less) • How can we use on average 8 characters per term? Sojka, MR Group: PV211: Index compression 23 / 57 Compression Term statistics Dictionary compression Postings compression . ..systi Iesyzygeticsyzygia Isyzygyszaibe Iyiteszecinszono... 4 bytes 4 bytes 3 bytes Sojka, MR Group: PV211: Index compression 24 / 57 Compression Term statistics Dictionary compression Postings compression r dictionary • 4 bytes per term for frequency • 4 bytes per term for pointer to postings list • 8 bytes (on average) for term in string • 3 bytes per pointer into string (need log28 • 400,000 < 24 bits to resolve 8 • 400,000 positions) • Space: 400,000 x (4 + 4 + 3 + 8) = 7.6 MB (compared to 11.2 MB for fixed-width array) Sojka, MR Group: PV211: Index compression 25 / 57 Compression Term statistics Dictionary compression Postings compression . . .7syst i Ie9syzyget i cSsyzyg i a Iösyzygyllsza ibelyiteöszecin. Sojka, MR Group: PV211: Index compression 26 / 57 Compression Term statistics Dictionary compression Postings compression ocking r dictionary • Example block size k — 4 • Where we used 4x3 bytes for term pointers without blocking • ... we now use 3 bytes for one pointer plus 4 bytes for indicating the length of each term. • We save 12 - (3 + 4) = 5 bytes per block. • Total savings: 400,000/4*5 = 0.5 MB o This reduces the size of the dictionary from 7.6 MB to 7.1 MB. Sojka, MR Group: PV211: Index compression 27 / 57 Compression Term statistics Dictionary compression Postings compression Sojka, MR Group: PV211: Index compression 28 / 57 Compression Term statistics Dictionary compression Postings compression Sojka, MR Group: PV211: Index compression 29 / 57 Compression Term statistics Dictionary compression Postings compression One block in blocked compression (k = 4) ... 8automata8automate9automatic 10 automation . . .further compressed with front coding. 8automat*aloe2o i c 3 o i o n Sojka, MR Group: PV211: Index compression 30 / 57 Compression Term statistics Dictionary compression Postings compression mpression Tor ummar data structure size in MB dictionary, fixed-width 11.2 dictionary, term pointers into string 7.6 ~, with blocking, k = 4 7.1 ~, with blocking & front coding 5.9 Sojka, MR Group: PV211: Index compression 31 / 57 Compression Term statistics Dictionary compression Postings compression • Which prefixes should be used for front coding? What are the tradeoffs? • Input: list of terms (= the term vocabulary) • Output: list of prefixes that will be used in front coding Sojka, MR Group: PV211: Index compression 32 / 57 Compression Term statistics Dictionary compression Postings compression g o The postings file is much larger than the dictionary, factor of at least 10. • Key desideratum: store each posting compactly • A posting for our purposes is a doclD. 9 For Reuters (800,000 documents), we would use 32 bits per doclD when using 4-byte integers. • Alternatively, we can use log2 800,000 « 19.6 < 20 bits per doclD. • Our goal: use a lot less than 20 bits per doclD. Sojka, MR Group: PV211: Index compression 34 / 57 Compression Term statistics Dictionary compression Postings compression gaps in • Each postings list is ordered in increasing order of doclD. • Example postings list: computer: 283154, 283159, 283202, ... • It suffices to store gaps: 283159 - 283154 = 5, 283202 - 283159 = 43 • Example postings list using gaps: computer: 283154, 5, 43, ... • Gaps for frequent terms are small. • Thus: We can encode small gaps with fewer than 20 bits. Sojka, MR Group: PV211: Index compression 35 / 57 Compression Term statistics Dictionary compression Postings compression encoding postings list THE doclDs gaps 283042 1 283043 1 283044 1 283045 ... COMPUTER doclDs gaps 283047 107 283154 5 283159 43 283202 ... ARACHNOCENTRic doclDs 252000 500100 gaps 252000 248100 Sojka, MR Group: PV211: Index compression 36 / 57 Compression Term statistics Dictionary compression Postings compression • Aim: o For arachnocentric and other rare terms, we will use about 20 bits per gap (= posting). • For the and other very frequent terms, we will use only a few bits per gap (= posting). • In order to implement this, we need to devise some form of variable length encoding. • Variable length encoding uses few bits for small gaps and many bits for large gaps. Sojka, MR Group: PV211: Index compression 37 / 57 Compression Term statistics Dictionary compression Postings compression • Used by many commercial/research systems • Good low-tech blend of variable-length coding and sensitivity to alignment matches (bit-level codes, see later). • Dedicate 1 bit (high bit) to be a continuation bit c. • If the gap G fits within 7 bits, binary-encode it in the 7 available bits and set c = 1. • Else: encode lower-order 7 bits and then use one or more additional bytes to encode the higher order bits using the same algorithm. • At the end set the continuation bit of the last byte to 1 (c = 1) and of the other bytes to 0 (c = 0). Sojka, MR Group: PV211: Index compression 38 / 57 Compression Term statistics Dictionary compression Postings compression doclDs 824 829 215406 gaps 5 214577 VB code 00000110 10111000 10000101 00001101 00001100 10110001 Sojka, MR Group: PV211: Index compression 39 / 57 Compression Term statistics Dictionary compression Postings compression mg aigi VBEncodeNumber(a?) 1 bytes ► 1101 —>► 101 = offset • Length is the length of offset. • For 13 (offset 101), this is 3. o Encode length in unary code: 1110. 9 Gamma code of 13 is the concatenation of length and offset 1110101. Sojka, MR Group: PV211: Index compression 44 / 57 Compression Term statistics Dictionary compression Postings compression Gamma code (7) examples number unary code length offset 7 code 0 0 1 10 0 0 2 110 10 0 10,0 3 1110 10 1 10,1 4 11110 110 00 110,00 9 1111111110 1110 001 1110,001 13 1110 101 1110,101 24 11110 1000 11110,1000 511 111111110 11111111 111111110,11111111 1025 11111111110 0000000001 11111111110,0000000001 Sojka, MR Group: PV211: Index compression 45 / 57 Compression Term statistics Dictionary compression Postings compression i Sojka, MR Group: PV211: Index compression 46 / 57 Compression Term statistics Dictionary compression Postings compression univer v& unary code a(N) = U^J. 0. a(4) = 11110 N » binary code /3(1) = 1, /3(2N + j) = (3(N)j, j = 0,1. /3(4) = 100 ^ f3 is not uniquely decodable (it is not a prefix code). » ternary r(/V) = /3(A/)#. r(4) = 100# » /3'(1) = 6, /3'(2/V) = /3'(A/)0, /3'(2/V + 1) = /3'(A/)1, r'(/V) = /3/(A/)#. /3'(4) = 00. » 7(/V) = a\P'(N)\P'(N). 7(4) = 11000 alternatively 7': every bit /3f(N) is inserted between a pair from a(\/3'(N)\). the same length as 7 (bit permutation 7(A/)), but less readable i®" example: 77(4) = 10100 C7 = {7(/V) : A/ > 0} = (1{0,1})*0 is regular and therefore it is decodable by finite automaton. Sojka, MR Group: PV211: Index compression 46 / 57 Compression Term statistics Dictionary compression Postings compression a, omega: tormai deTinition las codes: gamma, de «r S(N) = >y(\ß{N)\)ß'(N) «" example: 5(4) = 7(3)00 = 11000 es? decoder 6: 6(1) = 1001? K" := 0; while [\og2(N)\ > 0 do begin K :=ß(N)K; N := Llog2(A/)J end. Sojka, MR Group: PV211: Index compression 47 / 57 Compression Term statistics Dictionary compression Postings compression • Compute the variable byte code of 130 o Compute the gamma code of 130 • Compute 5(42) Sojka, MR Group: PV211: Index compression 48 / 57 Compression Term statistics Dictionary compression Postings compression • The length of offset is [log2 G\ bits. • The length of length is [log2 G\ + 1 bits, o So the length of the entire code is 2 x [log2 G\ + 1 bits. • 7 codes are always of odd length. o Gamma codes are within a factor of 2 of the optimal encoding length log2 G. • (assuming the frequency of a gap G is proportional to log2 G -only approximately true) Sojka, MR Group: PV211: Index compression 49 / 57 • Gamma code is prefix-free: a valid code word is not a prefix of any other valid code. • Encoding is optimal within a factor of 3 (and within a factor of 2 making additional assumptions). o This result is independent of the distribution of gaps! • We can use gamma codes for any distribution. Gamma code is universal. • Gamma code is parameter-free. Sojka, MR Group: PV211: Index compression 50 / 57 Compression Term statistics Dictionary compression Postings compression mm [tri ignm 9 Machines have word boundaries - 8, 16, 32 bits • Compressing and manipulating at granularity of bits can be slow. • Variable byte encoding is aligned and thus potentially more efficient. • Another word aligned scheme: Anh and Moffat 2005 • Regardless of efficiency, variable byte is conceptually simpler at little additional space cost. Sojka, MR Group: PV211: Index compression 51 / 57 Compression Term statistics Dictionary compression Postings compression mpression data structure size in MB dictionary, fixed-width 11.2 dictionary, term pointers into string 7.6 ~, with blocking, k = 4 7.1 ~, with blocking & front coding 5.9 collection (text, xml markup etc) 3600.0 collection (text) 960.0 T/D incidence matrix 40,000.0 postings, uncompressed (32-bit words) 400.0 postings, uncompressed (20 bits) 250.0 postings, variable byte encoded 116.0 postings, 7 encoded 101.0 Sojka, MR Group: PV211: Index compression 52 / 57 Compression Term statistics Dictionary compression Postings compression umení inciaen Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Anthony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar. Entry is 0 if term does not occur. Example: Calpurnia doesn't occur in The tempest. Sojka, MR Group: PV211: Index compression 53 / 57 Compression Term statistics Dictionary compression Postings compression mpression data structure size in MB dictionary, fixed-width 11.2 dictionary, term pointers into string 7.6 ~, with blocking, k = 4 7.1 ~, with blocking & front coding 5.9 collection (text, xml markup etc) 3600.0 collection (text) 960.0 T/D incidence matrix 40,000.0 postings, uncompressed (32-bit words) 400.0 postings, uncompressed (20 bits) 250.0 postings, variable byte encoded 116.0 postings, 7 encoded 101.0 Sojka, MR Group: PV211: Index compression 54 / 57 Compression Term statistics Dictionary compression Postings compression ummary • We can now create an index for highly efficient Boolean retrieval that is very space efficient. • Only 4% of the total size of the collection. • Only 10-15% of the total size of the text in the collection. • However, we've ignored positional and frequency information. • For this reason, space savings are less in reality. Sojka, MR Group: PV211: Index compression 55 / 57 Compression Term statistics Dictionary compression Postings compression For each term t, we store a list of all documents that contain f. Brutus ~| —► | 1 2 4 | 11 | 31 | 45 | 173 | 174~] Caesar | —► | 1 2 | 4 | 5 | 6 | 16 57 | 132 | ... Calpurnia~| —> | 2 | 31 | 54 | 101 | dictionary postings file • Motivation for compression in information retrieval systems • How can we compress the dictionary component of the inverted index? • How can we compress the postings component of the inverted index? o Term statistics: how are terms distributed in document collections? Sojka, MR Group: PV211: Index compression 56 / 57 http://ske.f i.muni.cz • Chapter 5 of MR • Resources at http://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and Fl MU library o Original publication on word-aligned binary codes by Anh and Moffat (2005); also: Anh and Moffat (2006a). • Original publication on variable byte codes by Scholer, Williams, Yiannis and Zobel (2002). o More details on compression (including compression of positions and frequencies) in Zobel and Moffat (2006). Sojka, MR Group: PV211: Index compression 57 / 57 Why ranked retrieval? Term frequency tf-idf weighting The vector space model PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/PV211 MR 6: Scoring, term weighting, the vector space model Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2017-03-07 Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 1/56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model Q Why ranked retrieval? Q Term frequency Q tf-idf weighting Q The vector space model Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 2/56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model o Ranking search results: why it is important (as opposed to just presenting a set of unordered Boolean results) o Term frequency: This is a key ingredient for ranking. • Tf-idf ranking: best known traditional ranking scheme • Vector space model: One of the most important formal models for information retrieval (along with Boolean and probabilistic models) Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 3/56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model • Thus far, our queries have been Boolean. • Documents either match or do not. • Good for expert users with precise understanding of their needs and of the collection. • Also good for applications: Applications can easily consume 1000s of results. • Not good for the majority of users • Most users are not capable of writing Boolean queries ... j ... or they are, but they think it's too much work. o Most users don't want to wade through 1000s of results. o This is particularly true of web search. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 5/56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model n mm • Boolean queries often result in either too few (=0) or too many (1000s) results. • Query 1 (boolean conjunction): [standard user dlink 650] o -> 200,000 hits - feast • Query 2 (boolean conjunction): [standard user dlink 650 no card found] o —y 0 hits - famine • In Boolean retrieval, it takes a lot of skill to come up with a query that produces a manageable number of hits. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 6/56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model mm m in ran • With ranking, large result sets are not an issue, o Just show the top 10 results o Doesn't overwhelm the user • Premise: the ranking algorithm works: More relevant results are ranked higher than less relevant results. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 7/56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model • We wish to rank documents that are more relevant higher than documents that are less relevant. • How can we accomplish such a ranking of the documents in the collection with respect to a query? • Assign a score to each query-document pair, say in [0,1]. • This score measures how well document and query "match". Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 8/56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model • How do we compute the score of a query-document pair? o Let's start with a one-term query. o If the query term does not occur in the document: score should be 0. • The more frequent the query term in the document, the higher the score. • We will look at a number of alternatives for doing this. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 9/56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model • A commonly used measure of overlap of two sets • Let A and B be two sets • Jaccard coefficient: jaccard(^, B) = AnB AuB (A ^ 0 or B ^ 0) o jaccard(^,^) = 1 o jaccard(^,e) = oif^ne = o • A and B don't have to be the same size. • Always assigns a number between 0 and 1 Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 10 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model 9 What is the query-document match score that the Jaccard coefficient computes for: o Query: "ides of March" o Document "Caesar died in March" o jaccard(<7, d) = 1/6 Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 11 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model • It doesn't consider term frequency (how many occurrences a term has). • Rare terms are more informative than frequent terms. Jaccard does not consider this information. • We need a more sophisticated way of normalizing for the length of a document. o Later in this lecture, we'll use \A n B\/^/\A U B\ (cosine) ... • .. . instead of \A n B\/\A U B\ (Jaccard) for length normalization. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 12 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model mary in Anthony Julius The Hamlet Othello Macbeth and Caesar Tempest Cleopatra Anthony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 Each document is represented as a binary vector G {0,1} v Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 14 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model Anthony Julius The Hamlet Othello Macbeth and Caesar Tempest Cleopatra Anthony 157 73 0 0 0 1 Brutus 4 157 0 2 0 0 Caesar 232 227 0 2 1 0 Calpurnia 0 10 0 0 0 0 Cleopatra 57 0 0 0 0 0 mercy 2 0 3 8 5 8 worser 2 0 1 1 1 5 Each document is now represented as a count vector G N' Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 15 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model • We do not consider the order of words in a document. • John is quicker than Mary and Mary is quicker than John are represented the same way. o This is called a bag of words model. • In a sense, this is a step back: The positional index was able to distinguish these two documents. • We will look at "recovering" positional information later in this course. • For now: bag of words model Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 16 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model E erm irequency • The term frequency tft?cy of term t in document d is defined as the number of times that t occurs in d. • We want to use tf when computing query-document match scores. • But how? • Raw term frequency is not what we want because: • A document with tf = 10 occurrences of the term is more relevant than a document with tf = 1 occurrence of the term. • But not 10 times more relevant. o Relevance does not increase proportionally with term frequency. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 17 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model quency: Log irequency weigi • The log frequency weight of term t in d is defined as follows w 1 + log10 tft,d 'f tft,d > 0 l'd I 0 otherwise • tft,d ~> ™t,d- 0 0, 1 1, 2 1.3, 10 2, 1000 4, etc. • Score for a document-query pair: sum over terms t in both q and c/: tf-matching-score(q, d) = EteqncX1 + logtft,cy) • The score is 0 if none of the query terms is present in the document. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 18 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model • Compute the Jaccard matching score and the tf matching score for the following query-document pairs. • q: [information on cars] d: "all you've ever wanted to know about cars" o q: [information on cars] d: "information on trucks, information on planes, information on trains" o q: [red cars and red trucks] d: "cops stop red cars more often" Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 19 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model iquency in aocum ;quency in • In addition, to term frequency (the frequency of the term in the document) .. . • ... we also want to use the frequency of the term in the collection for weighting and ranking. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 21 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model • Rare terms are more informative than frequent terms. • Consider a term in the query that is rare in the collection (e.g., ARACHNOCENTRIC). • A document containing this term is very likely to be relevant, o —>► We want high weights for rare terms like ARACHNOCENTRIC. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 22 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model :qu rm • Frequent terms are less informative than rare terms. • Consider a term in the query that is frequent in the collection (e.g., good, increase, line). • A document containing this term is more likely to be relevant than a document that doesn't .. . • ... but words like good, increase and line are not sure indicators of relevance. o —>► For frequent terms like good, increase, and line, we want positive weights . .. • .. . but lower weights than for rare terms. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 23 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model ument Trequen 9 We want high weights for rare terms like arachnocentric. • We want low (positive) weights for frequent words like good, increase, and line. • We will use document frequency to factor this into computing the matching score. o The document frequency is the number of documents in the collection that the term occurs in. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 24 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model weig • dft is the document frequency, the number of documents that t occurs in. • dft is an inverse measure of the informativeness of term t. • We define the idf weight of term t as follows: -a* i N 'dft = logic ^ (A/ is the number of documents in the collection.) • idft is a measure of the informativeness of the term. • [log A//dft] instead of [A//dft] to "dampen" the effect of idf • Note that we use the log transformation for both term frequency and document frequency. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 25 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model term dft idft calpurnia 1 6 animal 100 4 Sunday 1000 3 fly 10,000 2 under 100,000 1 the 1,000,000 0 Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 26 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model n ranKin • idf affects the ranking of documents for queries with at least two terms. • For example, in the query "arachnocentric line", idf weighting increases the relative weight of arachnocentric and decreases the relative weight of line. • idf has little effect on ranking for one-term queries. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 27 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model :ion irequency vs. :umeni irequency word collection frequency document frequency insurance 10440 3997 try 10422 8760 • Collection frequency of t\ number of tokens of t in the collection d Document frequency of t\ number of documents t occurs in • Why these numbers? • Which word is a better search term (and should get a higher weight)? • This example suggests that df (and idf) is better for weighting than cf (and "icf"). Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 28 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model -idT weighting o The tf-idf weight of a term is the product of its tf weight and its idf weight. o wt,c/ = (l + logtftjC/)- log^ a tf-weight a idf-weight o Best known weighting scheme in information retrieval o Note: the "-" in tf-idf is a hyphen, not a minus sign! a Alternative names: tf.idf, tf x idf Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 29 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model ummary: • Assign a tf-idf weight for each term t in each document d wt,d = (1 + logtftjC/) • log j)r o The tf-idf weight ... o . .. increases with the number of occurrences within a document, (term frequency) «... increases with the rarity of the term in the collection. (inverse document frequency) Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 30 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model [•[•MlliiTdilBi Quantity Symbol Definition term frequency tft,d document frequency dft collection frequency cft number of occurrences of t in d number of documents in the collection that t occurs in total number of occurrences of t in the collection • Relationship between df and cf? o Relationship between tf and cf? • Relationship between tf and df? Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 31 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model mary in Anthony Julius The Hamlet Othello Macbeth and Caesar Tempest Cleopatra Anthony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 Each document is represented as a binary vector G {0,1} v Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 33 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model Anthony Julius The Hamlet Othello Macbeth and Caesar Tempest Cleopatra Anthony 157 73 0 0 0 1 Brutus 4 157 0 2 0 0 Caesar 232 227 0 2 1 0 Calpurnia 0 10 0 0 0 0 Cleopatra 57 0 0 0 0 0 mercy 2 0 3 8 5 8 worser 2 0 1 1 1 5 Each document is now represented as a count vector G N' Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 34 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model mary Anthony Julius The Hamlet Othello Macbe and Caesar Tempest Cleopatra Anthony 5.25 3.18 0.0 0.0 0.0 0.35 Brutus 1.21 6.10 0.0 1.0 0.0 0.0 Caesar 8.59 2.54 0.0 1.51 0.25 0.0 Calpurnia 0.0 1.54 0.0 0.0 0.0 0.0 Cleopatra 2.85 0.0 0.0 0.0 0.0 0.0 mercy 1.51 0.0 1.90 0.12 5.25 0.88 worser 1.37 0.0 0.11 4.15 0.25 1.95 Each document is now represented as a real-valued vector of tf-idf weights GRl^l. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 35 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model • Each document is now represented as a real-valued vector of tf-idf weights e R v\ • So we have a | \/|-dimensional real-valued vector space, o Terms are axes of the space. • Documents are points or vectors in this space. • Very high-dimensional: tens of millions of dimensions when you apply this to web search engines • Each vector is very sparse - most entries are zero. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 36 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model • Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space • Key idea 2: Rank documents according to their proximity to the query • proximity = similarity • proximity « negative distance • Recall: We're doing this because we want to get away from the you're-either-in-or-out, feast-or-famine Boolean model. • Instead: rank relevant documents higher than nonrelevant documents. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 37 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model • First cut: (negative) distance between two points • ( = distance between the end points of the two vectors) • Euclidean distance? o Euclidean distance is a bad idea . .. o .. . because Euclidean distance is large for vectors of different lengths. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 38 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model 0 1 The Euclidean distance of q and d^ is large although the distribution of terms in the query q and the distribution of terms in the document c/2 are very similar. Questions about basic vector space setup? Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 39 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model • Rank documents according to angle with query o Thought experiment: take a document d and append it to itself. Call this document d'. d' is twice as long as d. • "Semantically" d and d' have the same content. o The angle between the two documents is 0, corresponding to maximal similarity . .. • .. .even though the Euclidean distance between the two documents can be quite large. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 40 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model rom ang • The following two notions are equivalent. • Rank documents according to the angle between query and document in decreasing order • Rank documents according to cosine(querydocument) in increasing order • Cosine is a monotonically decreasing function of the angle for the interval [0°,180°] Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 41 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model 1 Q.5 -0.5 -1 Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 42 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model • How do we compute the cosine? • A vector can be (length-) normalized by dividing each of its components by its length - here we use the L2 norm: • This maps vectors onto the unit sphere .. . • ...since after normalization: ||x||2 = yJ2ixf — 1-0 • As a result, longer documents and shorter documents have weights of the same order of magnitude. • Effect on the two documents d and d' (d appended to itself) from earlier slide: they have identical vectors after length-normalization. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 43 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model me similarity cos(q, d) = sim(c7. d) = q • d £l=U2V£l=W,2 • qi is the tf-idf weight of term / in the query. • d\ is the tf-idf weight of term / in the document. • \q\ and \d\ are the lengths of q and d. • This is the cosine similarity of q and d......or, equivalently, the cosine of the angle between q and d. Sojka, MR Group: PV211: Scoring, term weighting, the vector space model Why ranked retrieval? Term frequency tf-idf weighting The vector space model r normalize* • For normalized vectors, the cosine is equivalent to the dot product or scalar product. o cos(q, d) = q • d = E; Qi ' d; • (if q and d are length-normalized). Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 45 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model me similarity niu Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 46 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model in How similar are these novels? SaS: Sense and Sensibility PaP: Pride and Prejudice WH: Wuthering Heights term frequencies (counts) term SaS PaP WH AFFECTION 115 58 20 JEALOUS 10 7 11 GOSSIP 2 0 6 WUTHERING 0 0 38 Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 47 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model term frequencies (counts) log frequency weighting term SaS PaP WH term SaS PaP WH AFFECTION 115 58 20 AFFECTION 3.06 2.76 2.30 JEALOUS 10 7 11 JEALOUS 2.0 1.85 2.04 GOSSIP 2 0 6 GOSSIP 1.30 0 1.78 WUTHERING 0 0 38 WUTHERING 0 0 2.58 (To simplify this example, we don't do idf weighting.) Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 48 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model log frequency weighting log frequency weighting &i cosine normalization term SaS PaP WH term SaS PaP WH AFFECTION 3.06 2.76 2.30 AFFECTION 0.789 0.832 0.524 JEALOUS 2.0 1.85 2.04 JEALOUS 0.515 0.555 0.465 GOSSIP 1.30 0 1.78 GOSSIP 0.335 0.0 0.405 WUTHERING 0 0 2.58 WUTHERING 0.0 0.0 0.588 • cos(SaS,PaP) fa 0.789 * 0.832 + 0.515 * 0.555 + 0.335 * 0.0 + 0.0 * 0.0 « 0.94. • cos(SaS,WH) « 0.79 • cos(PaP.WH) fa 0.69 • Why do we have cos(SaS.PaP) > cos(SAS,WH)? Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 49 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model mpunng C0SINESC0RE(q) 1 float Scores[N] = 0 2 float Length[N] 3 for each query term t 4 do calculate wt?c/ and fetch postings list for t 5 for each pair(c/,tftc/) in postings list 6 do Scores[d]+ = wtjC/ x wt?c/ 7 Read the array Length 8 for each d 9 do Scores[d] = Scores[d]/Length[d] 10 return Top /< components of Scores[] Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 50 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model mponerr g Term frequency Document frequency Normalization n (natural) tft,d 1 (logarithm) 1 + log(tftjd) a (augmented) 0.5 + °-5x^^ maxt(trt)d) , f. . , f 1 if tft d > 0 b (boolean) 1 . ' . [0 otherwise L floe ave) i+iog(tft,d) n (no) 1 t (idf) log p (prob idf) max{0,log N~dft} n (none) c (cosine) u (pivoted 1/u unique) b (byte size) 1/CharLengtha, a < 1 Best known combination of weighting options Default: no weighting Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 51 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model X • We often use different weightings for queries and documents. • Notation: ddd.qqq o Example: Inc.ltn • document: logarithmic tf, no df weighting, cosine normalization • query: logarithmic tf, idf, no normalization • Isn't it bad to not idf-weight the document? • Example query: "best car insurance" • Example document: "car insurance auto insurance" Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 52 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model Query: "best car insurance". Document: "car insurance auto insurance". word tf-raw tf-wght query df idf weight tf-raw document tf-wght weight n'lized product auto 0 0 5000 2.3 0 1 1 1 0.52 0 best 1 1 50000 1.3 1.3 0 0 0 0 0 car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04 insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 2.04 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n'lized: document weights after cosine normalization, product: the product of final query weight and final document weight Vl2 + 02 + l2 + 1.32 « 1.92 1/1.92 « 0.52 1.3/1.92 « 0.68 Final similarity score between query and document: J2i wqi ' wdi = 0 + 0 + 1.04 + 2.04 = 3.08 Questions? Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 53 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model ummary: o Represent the query as a weighted tf-idf vector • Represent each document as a weighted tf-idf vector • Compute the cosine similarity between the query vector and each document vector • Rank documents with respect to the query • Return the top K (e.g., K = 10) to the user Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 54 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model o Ranking search results: why it is important (as opposed to just presenting a set of unordered Boolean results) o Term frequency: This is a key ingredient for ranking. • Tf-idf ranking: best known traditional ranking scheme • Vector space model: One of the most important formal models for information retrieval (along with Boolean and probabilistic models) Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 55 / 56 Why ranked retrieval? Term frequency tf-idf weighting The vector space model • Chapter 6 of MR • Resources at http://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and Fl MU library o Vector space for dummies o Exploring the similarity space (Moffat and Zobel, 2005) • Okapi BM25 (a state-of-the-art weighting method, 11.4.3 of MR) Sojka, MR Group: PV211: Scoring, term weighting, the vector space model 56 / 56 Recap Why rank? More on cosine The complete search system Implementation of ranking PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/PV211 MR 7: Scores in a complete search system Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2017-03-07 Sojka, MR Group: PV211: Scores in a complete search system 1/61 Recap Why rank? More on cosine The complete search system Implementation of ranking Q Recap Q Why rank? ore on cosine Q The complete search system Q Implementation of ranking Sojka, MR Group: PV211: Scores in a complete search system 2/61 Recap Why rank? More on cosine The complete search system Implementation of ranking E erm 1 frequency weig • The log frequency weight of term t in d is defined as follows w 1 + log10 tftjC/ if tfM > 0 l'd I 0 otherwise Sojka, MR Group: PV211: Scores in a complete search system 4/61 Recap Why rank? More on cosine The complete search system Implementation of ranking ■ jr ht idt weig • The document frequency dft is defined as the number of documents that t occurs in. • df is an inverse measure of the informativeness of the term. • We define the idf weight of term t as follows: ■dft = log10 ^F- • idf is a measure of the informativeness of the term. Sojka, MR Group: PV211: Scores in a complete search system 5/61 Recap Why rank? More on cosine The complete search system Implementation of ranking --id t weig o The tf-idf weight of a term is the product of its tf weight and its idf weight. wt)d = (1 +log tf(,» Ranking is important because it effectively reduces a large set of results to a very small one. • Next: More data on "users only look at a few results" • Actually, in the vast majority of cases they only examine 1, 2, or 3 results. Sojka, MR Group: PV211: Scores in a complete search system 12 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking 'g ■ ran king • The following slides are from Dan Russell's JCDL talk 9 Dan Russell was the "Uber Tech Lead for Search Quality & User Happiness" at Google. • How can we measure how important ranking is? • Observe what searchers do when they are searching in a controlled setting o Videotape them o Ask them to "think aloud" o Interview them o Eye-track them o Time them • Record and count their clicks Sojka, MR Group: PV211: Scores in a complete search system 13 / 61 So.. Did you notice the FTD official site? To be honest, I didn't even look at that At first I saw "from $20" and $20 is what! was looking for. To be honest, 1800-flowers is what I'm familiar with and why I went there next even though I kind of assumed they wouldn't have $20 flowers And you knew they were expensive? I knew they were expensive but I thought "hey, maybe they've got some flowers for under $20 here..." But you didn't notice the FTD? No I didn't, actually... that's really funny. Interview video Rapidly scanning the results Note scan pattern: Page 3: Result 1 Result 2 Result 3 Result 4 Result 3 Result 2 Result 4 Results Result6 Q: Why do this? A: What's learned later influences judgment of earlier content. © © © Google Web Images Vidso News Maps more* n's unicycle Search Web UnicyeleUK.com F -E - 1 = = latsize? unicycle. this^s a(jfmall\Jiitdren's unicycle size. It's good for children who ars o ride a 16" unicycle hut it needs smooth ground ... cle uk com'FAQ asp?iCategory=53 - 23k - Cached - Similar c3Qg5 Seledn^i a unicycle Unicode com HZ buy a unicycle or iearn... 16" wheel unicycle. this is a children's unicycle. the smalj^ieel r^kes it only suitable for smooth areas. Best«ed indoors or on smooth ground: www unicycle.co □■View.php?actian=Page&Name=Se!e':;ftaunic^e - 22k-Cached - Similar pa; 100 Miles tor KictB- The Goal 'The Afghan Mobile Mini Circus » Children is aAstabhshed ... alter GUINNESS WORLD RECORD#for the OWE UNICYCLE DISTJ www_unicycle4kids org/ - 'Jk -Bached - SimilaJages rt to break the ^JCE RECORD. month are Unicvdes page at Juoalin-cAarlc This is a children's unicycle thWmsli Jffieel makes, it only suitable for ver Best used indoors or on smooth ground not so good outdoors ... www.jugglmgworld.bizrshop/prDducts_unicycles.html - 1Q0k - Cached - Simfci pages Buy a Unicycle: Unicycle com AU - buy a unicycle or learn unfurling Check out a Unicycle Learners Pack far an easy and economical way to take yaurfi into the One Wheeled World ... Suitable ae a Children's Unicycle.... www unicycle.au.com/View.php?act[on=Page&Name=Unicycles - 10k -Cached - Similar paii^s Article - News - A unicycle ride for children Adam Brody. 21. of San Juan Capistrano. led a charity event Saturday that benefits the Orangewood Children's Foundation The Unicycle Club ol Southern ... www. oc register com''crcregister/news/homepage/ar1icle_1293785.php - 31k -Cached - Similar pane; li* steps V Go #Ie 25 Kinds of behaviors we see in the data Query reform Stacking behavior Go £le How many links do users view? Total number of abstracts viewed per page 120 Total number of abstracts viewed Go #le Mean: 3.07 Median/Mode: 2.00 28 Looking vs. Clicking Si E 180 160 120 100 80 60 40 20 0 ■ # times result selected □ time spent in abstract 4 Li 11 - - ■ L Q_D _L 1 5 Rank of result 8 9 10 Users view results one and two more often / thoroughly Users click most frequently on result one 11 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 o E a> E Go sjle 29 Presentation bias - reversed results Order of presentation influences where users look AND where they click normal sw apped Go £le Recap Why rank? More on cosine The complete search system Implementation of ranking m porta n ranking: nummary 9 Viewing abstracts: Users are a lot more likely to read the abstracts of the top-ranked pages (1, 2, 3, 4) than the abstracts of the lower ranked pages (7, 8, 9, 10). • Clicking: Distribution is even more skewed for clicking • In 1 out of 2 cases, users click on the top-ranked page. • Even if the top-ranked page is not relevant, 30% of users will click on it. • —>► Getting the ranking right is very important. • —>► Getting the top-ranked page right is most important. Sojka, MR Group: PV211: Scores in a complete search system 20 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking • Ranking is also one of the high barriers to entry for competitors to established players in the search engine market. o Why? Sojka, MR Group: PV211: Scores in a complete search system 21 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking 0 1 The Euclidean distance of q and d^ is large although the distribution of terms in the query q and the distribution of terms in the document c/2 are very similar. That's why we do length normalization or, equivalently, use cosine to compute query-document matching scores. Sojka, MR Group: PV211: Scores in a complete search system 23 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking • Query q\ "anti-doping rules Beijing 2008 Olympics" • Compare three documents • d\\ a short document on anti-doping rules at 2008 Olympics • d2'. a long document that consists of a copy of d\ and 5 other news stories, all on topics different from Olympics/anti-doping • dsm. a short document on anti-doping rules at the 2004 Athens Olympics • What ranking do we expect in the vector space model? • What can we do about this? Sojka, MR Group: PV211: Scores in a complete search system 24 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking z • Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). • Adjust cosine normalization by linear adjustment: "turning" the average normalization on the pivot • Effect: Similarities of short documents with query decrease; similarities of long documents with query increase. • This removes the unfair advantage that short documents have. Sojka, MR Group: PV211: Scores in a complete search system 25 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking Relevance vs Retrieval with cosine normalization document leneth source: Lillian Lee Sojka, MR Group: PV211: Scores in a complete search system 26 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking Cosine Normalization Factor source: Lillian Le Sojka, MR Group: PV211: Scores in a complete search system 27 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking xperimen g z Pivoted Cosine Normalization Cosine Slope 0.60 0.65 0.70 0.75 0.80 6,526 6,342 6,458 6,574 6,629 6,671 0.2840 0.3024 0.3097 0.3144 0.3171 0,3162 Improvement + 6.5% + 9.0% +10.7% +11.7% +11.3% (relevant documents retrieved and (change in) average precision) Sojka, MR Group: PV211: Scores in a complete search system 28 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking Parsing Linguistics Documents Document cache Indexers ,.se^ q,.erv Free text query parser Results page Spell correction Scoring and ranking Metadata in zone and field indexes Inexact top K retrieval Tiered inverted positional index /f-gram Indexes training set Sojka, MR Group: PV211: Scores in a complete search system 30 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking • Basic idea: o Create several tiers of indexes, corresponding to importance of indexing terms o During query processing, start with highest-tier index • If highest-tier index returns at least k (e.g., k = 100) results: stop and return results to user o If we've only found < k hits: repeat for next index in tier cascade • Example: two-tier system o Tier 1: Index of all titles o Tier 2: Index of the rest of documents • Pages containing the search words in the title are better hits than pages containing the search words in the body of the text. Sojka, MR Group: PV211: Scores in a complete search system 31 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking Tier 1 auto -► Doc2 best car Doc1 Doc3 insurance Doc2 Doc3 auto Tier 2 best Doc1 Doc3 -► -► car insurance auto Doc1 -► car Doc2 -► insurance Sojka, MR Group: PV211: Scores in a complete search system 32 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking o The use of tiered indexes is believed to be one of the reasons that Google search quality was significantly higher initially (2000/01) than that of competitors. • (along with PageRank, use of anchor text and proximity constraints) Sojka, MR Group: PV211: Scores in a complete search system 33 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking Parsing Linguistics Documents Document cache Indexers ,.se^ q,.erv Free text query parser Results page Spell correction Scoring and ranking Metadata in zone and field indexes Inexact top K retrieval Tiered inverted positional index /f-gram Indexes training set Sojka, MR Group: PV211: Scores in a complete search system 34 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking mp iniroau • Document preprocessing (linguistic and otherwise) • Positional indexes o Tiered indexes • Spelling correction • k-gram indexes for wildcard queries and spelling correction • Query processing o Document scoring Sojka, MR Group: PV211: Scores in a complete search system 35 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking mp • Document cache: we need this for generating snippets (= dynamic summaries) • Zone indexes: They separate the indexes for different zones: the body of the document, all highlighted text in the document, anchor text, text in metadata fields,.. . • Machine-learned ranking functions • Proximity ranking (e.g., rank documents in which the query terms occur in the same local window higher than documents in which the query terms occur far from each other) • Query parser Sojka, MR Group: PV211: Scores in a complete search system 36 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking mp o IR systems often guess what the user intended. • The two-term query London tower (without quotes) may be interpreted as the phrase query "London tower". • The query 100 Madison Avenue, New York may be interpreted as a request for a map. • How do we "parse" the query and translate it into a formal specification containing phrase operators, proximity operators, indexes to search etc.? Sojka, MR Group: PV211: Scores in a complete search system 37 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking • How do we combine phrase retrieval with vector space retrieval? • We do not want to compute document frequency / idf for every possible phrase. Why? • How do we combine Boolean retrieval with vector space retrieval? • For example: " + "-constraints and " — "-constraints • Postfiltering is simple, but can be very inefficient - no easy answer. • How do we combine wild cards with vector space retrieval? • Again, no easy answer. Sojka, MR Group: PV211: Scores in a complete search system 38 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking o Design criteria for tiered system o Each tier should be an order of magnitude smaller than the next tier. o The top 100 hits for most queries should be in tier 1, the top 100 hits for most of the remaining queries in tier 2 etc. o We need a simple test for "can I stop at this tier or do I have to go to the next one?" o There is no advantage to tiering if we have to hit most tiers for most queries anyway. • Consider a two-tier system where the first tier indexes titles and the second tier everything. • Question: Can you think of a better way of setting up a multitier system? Which "zones" of a document should be indexed in the different tiers (title, body of document, others?)? What criterion do you want to use for including a document in tier 1? Sojka, MR Group: PV211: Scores in a complete search system 39 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking w w rm it it u Brutus —> 1,2 7,3 83,1 87,2 . . . Caesar —> 1,1 5,1 13,1 17,1 . . . Calpurnia —> 7,1 8,2 40,1 97,3 term frequencies We also need positions. Not shown here. Sojka, MR Group: PV211: Scores in a complete search system 41 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking erm rrequencies in in • Thus: In each posting, store tftcy in addition to docID d. • As an integer frequency, not as a (log-)weighted real number • .. . because real numbers are difficult to compress. • Overall, additional space requirements are small: a byte per posting or less Sojka, MR Group: PV211: Scores in a complete search system 42 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking [•J lilt • We usually do not need a complete ranking. • We just need the top k for a small k (e.g., k = 100). o If we don't need a complete ranking, is there an efficient of computing just the top kl • Naive: o Compute scores for all N documents o Sort o Return the top k • Not very efficient • Alternative: min heap Sojka, MR Group: PV211: Scores in a complete search system 43 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking mm • A binary min heap is a binary tree in which each node's value is less than the values of its children. • Takes 0(A/log/c) operations to construct (where N is the number of documents) . .. • .. . then read off k winners in 0(k log k) steps Sojka, MR Group: PV211: Scores in a complete search system 44 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking Sojka, MR Group: PV211: Scores in a complete search system 45 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking • Goal: Keep the top k documents seen so far • Use a binary min heap • To process a new document d' with score s'\ o Get current minimum hm of heap (0(1)) • If s/ < hm skip to next document o If s/ > hm heap-delete-root (0(log/c)) o Heap-add d'/s' (0(log/c)) Sojka, MR Group: PV211: Scores in a complete search system 46 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking mputation • Ranking has time complexity 0(/V) where N is the number of documents. • Optimizations reduce the constant factor, but they are still 0(/V), N > 1010 • Are there sublinear algorithms? • What we're doing in effect: solving the /c-nearest neighbor (kNN) problem for the query vector (= query point). o There are no general solutions to this problem that are sublinear. Sojka, MR Group: PV211: Scores in a complete search system 47 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking • Idea 1: Reorder postings lists • Instead of ordering according to docID . .. • . .. order according to some measure of "expected relevance". • Idea 2: Heuristics to prune the search space o Not guaranteed to be correct ... o . .. but fails rarely, o In practice, close to constant time, o For this, we'll need the concepts of document-at-a-time processing and term-at-a-time processing. Sojka, MR Group: PV211: Scores in a complete search system 48 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking • So far: postings lists have been ordered according to doclD. • Alternative: a query-independent measure of "goodness" (credibility) of a page • Example: PageRank g(d) of page d, a measure of how many "good" pages hyperlink to d (chapter 21) • Order documents in postings lists according to PageRank: S(c/i) >g(d2) > g(d3) > ... • Define composite score of a document: net-score(q, d) = g(d) + cos(q, d) • This scheme supports early termination: We do not have to process postings lists in their entirety to find top k. Sojka, MR Group: PV211: Scores in a complete search system 49 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking • Order documents in postings lists according to PageRank: g(d{) >g(d2) > g(d3) > ... • Define composite score of a document: net-score(q, d) = g(d) + cos(q, d) • Suppose: (i) g [0,1]; (ii) g(d) < 0.1 for the document d we're currently processing; (iii) smallest top k score we've found so far is 1.2 • Then all subsequent scores will be < 1.1. • So we've already found the top k and can stop processing the remainder of postings lists. • Questions? Sojka, MR Group: PV211: Scores in a complete search system 50 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking • Both doclD-ordering and PageRank-ordering impose a consistent ordering on documents in postings lists. • Computing cosines in this scheme is document-at-a-time. • We complete computation of the query-document similarity score of document d\ before starting to compute the query-document similarity score of c//+i. • Alternative: term-at-a-time processing Sojka, MR Group: PV211: Scores in a complete search system 51 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking a Idea: don't process postings that contribute little to final score • Order documents in postings list according to weight • Simplest case: normalized tf-idf weight (rarely done: hard to compress) o Documents in the top k are likely to occur early in these ordered lists. • —>► Early termination while processing postings lists is unlikely to change the top k. • But: o We no longer have a consistent ordering of documents in postings lists. o We no longer can employ document-at-a-time processing. Sojka, MR Group: PV211: Scores in a complete search system 52 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking • Simplest case: completely process the postings list of the first query term • Create an accumulator for each docID you encounter • Then completely process the postings list of the second query term • .. . and so forth Sojka, MR Group: PV211: Scores in a complete search system 53 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking C0SINESC0RE(q) 1 float Scores[N] = 0 2 float Length[N] 3 for each query term t 4 do calculate wt?c/ and fetch postings list for t 5 for each pair(c/,tftc/) in postings list 6 do Scores[d]+ = wtjC/ x wt?c/ 7 Read the array Length 8 for each d 9 do Scores[d] = Scores[d]/Length[d] 10 return Top k components of Scores[] The elements of the array "Scores" are called accumulators. Sojka, MR Group: PV211: Scores in a complete search system 54 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking mputing ine scor • Use inverted index • At query time use an array of accumulators A to store sum (= the cosine score) o Aj = wqk • wdjk k (for document dj) • "Accumulate" scores as postings lists are being processed. Sojka, MR Group: PV211: Scores in a complete search system 55 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking • For the web (20 billion documents), an array of accumulators A in memory is infeasible. • Thus: Only create accumulators for docs occurring in postings lists o This is equivalent to: Do not create accumulators for docs with zero scores (i.e., docs that do not contain any of the query terms) Sojka, MR Group: PV211: Scores in a complete search system 56 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking Brutus —> 1,2 7,3 83,1 87,2 . . . Caesar —> 1,1 5,1 13,1 17,1 . . . Calpurnia —> 7,1 8,2 40,1 97,3 • For query: [Brutus Caesar]: • Only need accumulators for 1, 5, 7, 13, 17, 83, 87 • Don't need accumulators for 3, 8 etc. Sojka, MR Group: PV211: Scores in a complete search system 57 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking njunctiv • We can enforce conjunctive search (a la Google): only consider documents (and create accumulators) if all terms occur. • Example: just one accumulator for [Brutus Caesar] in the example above .. . • .. . because only d\ contains both words. Sojka, MR Group: PV211: Scores in a complete search system 58 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking mp ranKing: summary • Ranking is very expensive in applications where we have to compute similarity scores for all documents in the collection. • In most applications, the vast majority of documents have similarity score 0 for a given query —>► lots of potential for speeding things up. • However, there is no fast nearest neighbor algorithm that is guaranteed to be correct even in this scenario. • In practice: use heuristics to prune search space - usually works very well. Sojka, MR Group: PV211: Scores in a complete search system 59 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking • The importance of ranking: User studies at Google • Length normalization: Pivot normalization o The complete search system • Implementation of ranking Sojka, MR Group: PV211: Scores in a complete search system 60 / 61 Recap Why rank? More on cosine The complete search system Implementation of ranking • Chapter 6 of MR • Chapter 7 of MR • Resources at http://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and Fl MU library • How Google tweaks its ranking function o Interview with Google search guru Udi Manber o Amit Singhal on Google ranking • SEO perspective: ranking factors • Yahoo Search BOSS: Opens up the search engine to developers. For example, you can rerank search results. • Compare Google and Yahoo ranking for a query. o How Google uses eye tracking for improving search. Sojka, MR Group: PV211: Scores in a complete search system 61 / 61 PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/PV211 MR 8: Evaluation & Result Summaries Handout version Petr Sojka, Martin Líska, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2017-03-07 Sojka, MR Group: PV211: Evaluation & Result Summaries 1/67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries O Recap Q Introduction Q Unranked evaluation Q Ranked evaluation Q Benchmarks Q Result summaries Sojka, MR Group: PV211: Evaluation & Result Summaries 2/67 Looking vs. Clicking ■o & o 2 tfl E f WP* 180 160 140 120 100 80 60 40 20 0 ■ # times result selected □ time spent in abstract 5 Rank of result 8 10 Users view results one and two more often / thoroughly Users click most frequently on result one 11 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Go sle 29 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries Relevance v s Retrieval with cosine normalization fa F' 0 cosine norm crossing poin 'true" relevance document le 112th source: Lillian Lee Sojka, MR Group: PV211: Evaluation & Result Summaries 5/67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries Goal: Keep the k top documents seen so far o Use a binary min heap • To process a new document d' with score s'\ o Get current minimum hm of heap (in 0(1)) If sf < hm skip to next document • If s/ > hm heap-delete-root (in 0(log/c)) • Heap-add d'/s' (in 0(1)) • Reheapify (in 0(log/c)) Sojka, MR Group: PV211: Evaluation & Result Summaries 6/67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries mg • Document-at-a-time processing • We complete computation of the query-document similarity score of document d\ before starting to compute the query-document similarity score of c//+i. • Requires a consistent ordering of documents in the postings lists • Term-at-a-time processing • We complete processing the postings list of query term t; before starting to process the postings list of t/+i. Requires an accumulator for each document "still in the running" • The most effective heuristics switch back and forth between term-at-a-time and document-at-a-time processing. Sojka, MR Group: PV211: Evaluation & Result Summaries 7/67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries auto Doc2 w Tier 1 best car Doc1 Doc3 w w insurance Doc2 Doc3 w w auto Tier 2 best Doc1 Doc3 -► -► car insurance auto Doc1 w Tier 3 best car Doc2 w insurance Sojka, MR Group: PV211: Evaluation &. Result Summaries 8/67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries Introduction to evaluation: Measures of an IR system o Evaluation of unranked and ranked retrieval • Evaluation benchmarks • Result summaries Sojka, MR Group: PV211: Evaluation & Result Summaries 9/67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries How well does an IR system work? Sojka, MR Group: PV211: Evaluation & Result Summaries 11 / 67 pv211 Internet Mapy Obrázky Nákupy Videa Více T Vyhledávací nástroje Přibližný počet výsledků: 108 000 000 (0,00002 s) Při poskytování služeb nám pomáhají soubory cookie. Používáním našich služeb vyjadřujete souhlas s naším používáním souborů cookie. Další informace Rihanna - Wikipedie cs.wikipedia.org/wiki/Rihanna T Rihanna /[jí a.na]/, narozena jako Robyn Rihanna Fenty (* 20. února 1988. Saint Michael. Barbados} je barbadoská zpěvačka, která je ve své tvorbě ... Biografie - Hudební kariéra - Turné - Diskografie Rihanna - Osobnosti.cz www.osobnosti.cz/rihanna.php T Rihanna. narozena jako Robyn Rihanna Fenty je barbadoská zpěvačka, která je ve své tvorbě ovlivněna styly R&B. reggae. dancehall a dance. Nahrává u ... Životopis - Tapety (185) - LOUD Tour 2011 - All Of The Lights Rihanna Fenty - Super.cz www.super.cz/celebrity/rihanna-fenty/ T Počet položek: 5-i- - Bulvární status: Známá provokatérka se už dávno ... Největší sígr Hollywoodu už bručí v base: Porušil podmínku, kterou dostal za ... Rihanna má v Česku dvojnici! Začínající zpěvačka jako by z oka vypadla ... Rihanna (rihanna) on Twitter https://twitter.com/rihanna T Přeložit tuto stránku The latPst frnm Rihanna íňlrihannal WHAT NDW nn VFVtlľ C.\ \CK here tn WATHH - Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries o How fast does it index • e.g., number of bytes per hour o How fast does it search o e.g., latency as a function of queries per second • What is the cost per query? • in dollars Sojka, MR Group: PV211: Evaluation & Result Summaries 14 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries • All of the preceding criteria are measurable: we can quantify speed / size / money o However, the key measure for a search engine is user happiness. • What is user happiness? o Factors include: 9 Speed of response • Size of index • Uncluttered Ul Most important: relevance (actually, maybe even more important: it's free) • Note that none of these is sufficient: blindingly fast, but useless answers won't make a user happy. 9 How can we quantify user happiness? Sojka, MR Group: PV211: Evaluation & Result Summaries 15 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries U • Who is the user we are trying to make happy? o Web search engine: searcher. Success: Searcher finds what she was looking for. Measure: rate of return to this search engine 9 Web search engine: advertiser. Success: Searcher clicks on ad. Measure: clickthrough rate 9 E-commerce: buyer. Success: Buyer buys something. Measures: time to purchase, fraction of "conversions" of searchers to buyers • E-commerce: seller. Success: Seller sells something. Measure: profit per item sold • Enterprise: CEO. Success: Employees are more productive (because of effective search). Measure: profit of the company Sojka, MR Group: PV211: Evaluation & Result Summaries 16 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries • User happiness is equated with the relevance of search results to the query. o But how do you measure relevance? 9 Standard methodology in information retrieval consists of three elements. • A benchmark document collection • A benchmark suite of queries 9 An assessment of the relevance of each query-document pair Sojka, MR Group: PV211: Evaluation & Result Summaries 17 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries : query vs. inTormaiion n • Relevance to what? 9 First take: relevance to the query • "Relevance to the query" is very problematic. • Information need /: "I am looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine." • This is an information need, not a query. o Query q\ [red wine white wine heart attack] o Consider document d'\ At the heart of his speech was an attack on the wine industry lobby for downplaying the role of red and white wine in drunk driving. • d' is an excellent match for query q .. . <* dr is not relevant to the information need /. Sojka, MR Group: PV211: Evaluation & Result Summaries 18 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries : query vs. inTormaiion n • User happiness can only be measured by relevance to an information need, not by relevance to queries. 9 Our terminology is sloppy in these slides and in MR: we talk about query-document relevance judgments even though we mean information-need-document relevance judgments. Sojka, MR Group: PV211: Evaluation & Result Summaries 19 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries • Precision (P) is the fraction of retrieved documents that are relevant _ ^(relevant items retrieved) Precision =-—-■---= Pfrelevant retrieved) #(retrieved items) Recall (P) is the fraction of relevant documents that are retrieved _ ^(relevant items retrieved) Recall =-——----= Pfretrieved relevant) #(relevant items) Sojka, MR Group: PV211: Evaluation & Result Summaries 21 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries Relevant Nonrelevant Retrieved true positives (TP) false positives (FP) Not retrieved false negatives (FN) true negatives (TN) p = TP/(TP+FP) R = TP/(TP+FN) Sojka, MR Group: PV211: Evaluation & Result Summaries 22 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries • You can increase recall by returning more docs. 9 Recall is a non-decreasing function of the number of docs retrieved. 9 A system that returns all docs has 100% recall! • The converse is also true (usually): It's easy to get high precision for very low recall. • Suppose the document with the largest score is relevant. How can we maximize precision? Sojka, MR Group: PV211: Evaluation & Result Summaries 23 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries • F allows us to trade off precision against recall 4 + f32P+R H a • a e [0,1] and thus /52 e [0, oo] o Most frequently used: balanced F with /3 = 1 or a = 0.5 • This is the harmonic mean of P and R: -p = |(-^ + ^) • What value range of /3 weights recall higher than precision? Sojka, MR Group: PV211: Evaluation & Result Summaries 24 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries relevant not relevant retrieved 20 40 60 not retrieved 60 1,000,000 1,000,060 80 1,000,040 1,000,120 • P = 20/(20 + 40) = 1/3 • R = 20/(20 + 60) = 1/4 0 Fi = 2^r = 2/7 t + t 3 4 Sojka, MR Group: PV211: Evaluation & Result Summaries 25 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries • Why do we use complex measures like precision, recall, and F? • Why not something simple like accuracy? • Accuracy is the fraction of decisions (relevant/nonrelevant) that are correct. 9 In terms of the contingency table above, accuracy = (TP + TN)/(TP + FP + FN + TN). 9 Why is accuracy not a useful measure for web information retrieval? Sojka, MR Group: PV211: Evaluation & Result Summaries 26 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries • Compute precision, recall and F\ for this result set: relevant not relevant retrieved 18 2 not retrieved 82 1,000,000,000 • The snoogle search engine below always returns 0 results ("0 matching results found"), regardless of the query. Why does snoogle demonstrate that accuracy is not a useful measure in IR? Search for: 0 matching results found. Sojka, MR Group: PV211: Evaluation & Result Summaries 27 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries o Simple trick to maximize accuracy in IR: always say no and return nothing 9 You then get 99.99% accuracy on most queries. 9 Searchers on the web (and in IR in general) want to find something and have a certain tolerance for junk. 9 It's better to return some bad hits as long as you return something. • —>► We use precision, recall, and F for evaluation, not accuracy. Sojka, MR Group: PV211: Evaluation & Result Summaries 28 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries o Why don't we use a different mean of P and R as a measure? • e.g., the arithmetic mean • The simple (arithmetic) mean is 50% for "return-everything" search engine, which is too high. • Desideratum: Punish really bad performance on either precision or recall. 9 Taking the minimum achieves this. • But minimum is not smooth and hard to weight. • F (harmonic mean) is a kind of smooth minimum. Sojka, MR Group: PV211: Evaluation & Result Summaries 29 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries Precision (Recall fixed at 70%) • We can view the harmonic mean as a kind of soft minimum Sojka, MR Group: PV211: Evaluation & Result Summaries 30 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries ng precision, re n 9 We need relevance judgments for information-need-document pairs - but they are expensive to produce. o For alternatives to using precision/recall and having to produce relevance judgments - see end of this lecture. Sojka, MR Group: PV211: Evaluation & Result Summaries 31 / 67 » MAP(Q) = Ej='i ^ E£i Precision^*) a For one query it is the area under the uninterpolated precision-recall curve, o and so the MAP is roughly the average area under the precision-recall curve for a set of queries. Sojka, MR Group: PV211: Evaluation & Result Summaries 32 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries • Precision/recall/F are measures for unranked sets. • We can easily turn set measures into measures of ranked lists. • Just compute the set measure for each "prefix": the top 1 (P@l), top 2, top 3, top 4 etc results 9 Doing this for precision and recall gives you a precision-recall curve. Sojka, MR Group: PV211: Evaluation & Result Summaries 34 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries Recall • Each point corresponds to a result for the top k ranked hits (/c = l,2,3,4,...). 9 Interpolation (in red): Take maximum of all future points 9 Rationale for interpolation: The user is willing to look at more stuff if both precision and recall get better. • Questions? Sojka, MR Group: PV211: Evaluation & Result Summaries 35 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries -P' verage precisi Recall Interpolated Precision 0.0 1.00 0.1 0.67 0.2 0.63 0.3 0.55 0.4 0.45 0.5 0.41 0.6 0.36 0.7 0.29 0.8 0.13 0.9 0.10 1.0 0.08 11-point average 0.425 How can precision at 0.0 be > 0? Sojka, MR Group: PV211: Evaluation & Result Summaries 36 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries 0 0.2 0.4 0.6 0.8 1 Recall a Compute interpolated precision at recall levels 0.0, 0.1, 0.2, 9 Do this for each of the queries in the evaluation benchmark o Average over queries • This measure measures performance at all recall levels. 9 The curve is typical of performance levels at TREC. 9 Note that performance is not very good! Sojka, MR Group: PV211: Evaluation & Result Summaries 37 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries 1 - specificity • Similar to precision-recall graph • But we are only interested in the small area in the lower left corner. • Precision-recall graph "blows up" this area. Sojka, MR Group: PV211: Evaluation & Result Summaries 38 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries precision 9 For a test collection, it is usual that a system does badly on some information needs (e.g., P — 0.2 at R = 0.1) and really well on others (e.g., P = 0.95 at R = 0.1). • Indeed, it is usually the case that the variance of the same system across queries is much greater than the variance of different systems on the same query. • That is, there are easy information needs and hard ones. Sojka, MR Group: PV211: Evaluation & Result Summaries 39 / 67 • A collection of documents o Documents must be representative of the documents we expect to see in reality. • A collection of information needs • . . .which we will often incorrectly refer to as queries • Information needs must be representative of the information needs we expect to see in reality. • Human relevance assessments • We need to hire/pay "judges" or assessors to do this. 9 Expensive, time-consuming • Judges must be representative of the users we expect to see in reality. Sojka, MR Group: PV211: Evaluation & Result Summaries 41 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries rirst sianaara relevance oe n mar o Pioneering: first testbed allowing precise quantitative measures of information retrieval effectiveness • Late 1950s, UK 9 1398 abstracts of aerodynamics journal articles, a set of 225 queries, exhaustive relevance judgments of all query-document-pairs 9 Too small, too untypical for serious IR evaluation today Sojka, MR Group: PV211: Evaluation & Result Summaries 42 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks mar Result summaries 9 TREC = Text Retrieval Conference (TREC) • Organized by the U.S. National Institute of Standards and Technology (NIST) • TREC is actually a set of several different relevance benchmarks. 9 Best known: TREC Ad Hoc, used for first 8 TREC evaluations between 1992 and 1999 • 1.89 million documents, mainly newswire articles, 450 information needs o No exhaustive relevance judgments - too expensive 9 Rather, NIST assessors' relevance judgments are available only for the documents that were among the top k returned for some system which was entered in the TREC evaluation for which the information need was developed. Sojka, MR Group: PV211: Evaluation & Result Summaries 43 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries • G0V2 9 Another TREC/NIST collection • 25 million web pages • Used to be largest collection that is easily available • But still 3 orders of magnitude smaller than what Google/Yahoo/MSN index • NTCIR: East Asian language and cross-language information retrieval • CLEF: Cross Language Evaluation Forum: This evaluation series has concentrated on European languages and cross-language information retrieval. • Many others Sojka, MR Group: PV211: Evaluation & Result Summaries 44 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries Clueweb09: 9 1 billion web pages, 25 terabytes (compressed: 5 terabyte) collected during January/February 2009 9 crawl of pages in 10 languages Unique URLs: 4,780,950,903 (325 GB uncompressed, 105 GB compressed) • Total Outlinks: 7,944,351,835 (71 GB uncompressed, 24 GB compressed) Cluewebl2: 9 733,019,372 docs, 27.3 TB (5.54 TB compressed) Indexed in Sketch Engine, cf. LREC 2012 paper. Sojka, MR Group: PV211: Evaluation & Result Summaries 45 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries • Relevance assessments are only usable if they are consistent. • If they are not consistent, then there is no "truth" and experiments are not repeatable. 9 How can we measure this consistency or agreement among judges? • —>► Kappa measure Sojka, MR Group: PV211: Evaluation & Result Summaries 46 / 67 • Kappa is measure of how much judges agree or disagree. • Designed for categorical judgments o Corrects for chance agreement • P(A) — proportion of time judges agree • P(E) — what agreement would we get by chance _ P(A) - P(E) l-P(E) 9 k —1 for (i) chance agreement (ii) total agreement Sojka, MR Group: PV211: Evaluation & Result Summaries 47 / 67 9 Values of k in the interval [2/3,1.0] are seen as acceptable. o With smaller values: need to redesign relevance assessment methodology used etc. Sojka, MR Group: PV211: Evaluation & Result Summaries 48 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries Judge 2 Relevance Yes No Total Judge 1 Yes 300 20 320 Relevance No 10 70 80 Total 310 90 400 Observed proportion of the times the judges agreed P(A) = (300 + 70)/400 = 370/400 = 0.925 Pooled marginals P(nonrelevant) = (80 + 90)/(400 + 400) = 170/800 = 0.2125 P(relevant) = (320 + 310)/(400 + 400) = 630/800 = 0.7878 Probability that the two judges agreed by chance P(E) = P(nonrelevant)2 + P(relevant)2 = 0.21252 + 0.78782 = 0.665 Kappa statistic k = (P(A) - P(E))/(1 - P(E)) = (0.925 - 0.665)/(l - 0.665) = 0.776 (still in acceptable range) Sojka, MR Group: PV211: Evaluation & Result Summaries 49 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries I merjuage agn information need number of docs judged disagreements 51 211 6 62 400 157 67 400 68 95 400 110 127 400 106 Sojka, MR Group: PV211: Evaluation & Result Summaries 50 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries rjuag o Judges disagree a lot. Does that mean that the results of information retrieval experiments are meaningless? o No. Large impact on absolute performance numbers o Virtually no impact on ranking of systems • Supposes we want to know if algorithm A is better than algorithm B. o An information retrieval experiment will give us a reliable answer to this question... o . .. even if there is a lot of disagreement between judges. Sojka, MR Group: PV211: Evaluation & Result Summaries 51 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries engin Recall is difficult to measure on the web Search engines often use precision at top k, e.g., k = 10 ... . .. or use measures that reward you more for getting rank 1 right than for getting rank 10 right. Search engines also use non-relevance-based measures. Example 1: clickthrough on first result Not very reliable if you look at a single clickthrough (you may realize after clicking that the summary was misleading and the document is nonrelevant). .. . .. but pretty reliable in the aggregate. Example 2: Ongoing studies of user behavior in the lab - recall last lecture Example 3: A/B testing Sojka, MR Group: PV211: Evaluation & Result Summaries 52 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries • Purpose: Test a single innovation 9 Prerequisite: You have a large search engine up and running. • Have most users use old system • Divert a small proportion of traffic (e.g., 1%) to the new system that includes the innovation o Evaluate with an "automatic" measure like clickthrough on first result o Now we can directly see if the innovation does improve user happiness. Probably the evaluation methodology that large search engines trust most o Variant: Give users the option to switch to new algorithm/interface Sojka, MR Group: PV211: Evaluation & Result Summaries 53 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries pun o We've defined relevance for an isolated query-document pair. • Alternative definition: marginal relevance • The marginal relevance of a document at position k in the result list is the additional information it contributes over and above the information that was contained in documents d\... dk-i- • Exercise • Why is marginal relevance a more realistic measure of user happiness? • Give an example where a non-marginal measure like precision or recall is a misleading measure of user happiness, but marginal relevance is a good measure. • In a practical application, what is the difficulty of using marginal measures instead of non-marginal measures? Sojka, MR Group: PV211: Evaluation & Result Summaries 54 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries Internet Obrázky Nákupy Videa Mapy Více T Vyhledávací nástroje Přibližný počet výsledků: 87 (0.42 s) Studijní materiály předmětu FI:PV211 - Masarykova univerzita Studijní materiály předmětu FIPV211 - Masarykova univerzita Studijní materiály předmětu FI:PV211 - Masarykova univerzita PV211 -- Úvod do získavaní informací {jaro 2014) FI:PV211 Získavaní informací - Informace o předmětu Sojka, MR Group: PV211: Evaluation & Result Summaries 56 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries Internet Obrázky Nákupy Videa Mapy Více T Vyhledávací nástroje Přibližný počet výsledků 87 [0.42 s) Studijní materiály předmětu FIPV211 - Masarykova univerzita htt ps ://i s. m u n i .cz/e 1/14-3 3/j a ro2 01 HP V211 / ^ Studijní materiály predmetu FI:PV211 /PV211/ Popis: FI:PV211 Úvod do získávání informací [Introduction to Information Retrieval}, jaro 2014, 5. 6. 2013. Číst smí:. Studijní materiály předmětu FI:PV211 - Masarykova univerzita htt ps ://i s. m u n i .cz/el/143 3/j a ro2 013/P V211 / - Studijní materiály predmetu FI:PV211 /PV211/ Popis: FI:PV211 Introduction to Information Retrieval (Introduction to Information Retrieval}, jaro 2013, 18. 1. 2014. Studijní materiály předmětu FIPV211 - Masarykova univerzita https://is.muni.cz/el/1433/jaro2009/PV211/ T Studijní materiály předmětu FI:PV211 /PV211/ Popis: FI:PV211 Introduction to Information Retrieval (Introduction to Information Retrieval}, jaro 2009, 30. 3. 2008. PV211 -- Úvod do získávání informací (jaro 2014) www.fi.muni.cz/-sojka/PV211/ T Support of lecture PV211 given by Petr Sojka at Fl MU, Brno, CZ. ... materiálů kursu PV211. 27.1.: Založeny studijní materiály předmětu s trailerem kurzu. 26.1. Sojka, IIR Group: PV211: Evaluation & Result Summaries 57 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries 9 Most often: as a list - aka "10 blue links" o How should each document in the list be described? • This description is crucial. • The user often can identify good hits (= relevant hits) based on the description. o No need to actually view any document Sojka, MR Group: PV211: Evaluation & Result Summaries 58 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries ripiion in resu • Most commonly: doc title, url, some metadata ... • . .. and a summary • How do we "compute" the summary? Sojka, MR Group: PV211: Evaluation & Result Summaries 59 / 67 • Two basic kinds: (i) static (ii) dynamic • A static summary of a document is always the same, regardless of the query that was issued by the user. • Dynamic summaries are query-dependent. They attempt to explain why the document was retrieved for the query at hand Sojka, MR Group: PV211: Evaluation & Result Summaries 60 / 67 • In typical systems, the static summary is a subset of the document. • Simplest heuristic: the first 50 or so words of the document o More sophisticated: extract from each document a set of "key" sentences • Simple NLP heuristics to score each sentence • Summary is made up of top-scoring sentences. o Machine learning approach: see MR 13 o Most sophisticated: complex NLP to synthesize/generate a summary 9 For most IR applications: not quite ready for prime time yet Sojka, MR Group: PV211: Evaluation & Result Summaries 61 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries ynamic summari • Present one or more "windows" or snippets within the document that contain several of the query terms. • Prefer snippets in which query terms occurred as a phrase • Prefer snippets in which query terms occurred jointly in a small window • The summary that is computed this way gives the entire content of the window - all terms, not just the query terms. Sojka, MR Group: PV211: Evaluation & Result Summaries 62 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries ynamic summari No Meal Athlete I Vegetarian Running and Fitness www.noimeataUilete.com/ T Vegetarian Running and Fitness. ... (Oh, and did I mention Rich did it aJI on a 9 plant-based diet?) In £his episode of No Meat Athlete Radio, Doug and I had the ... Vegetarian Recipes for Athletes - Vegetarian Shirts - How to Run Long - About Running on a vegetarian diet - Top tips I Freedom2Train Blog www.freedonn2train.com/blog/7p-4 ▼ Nov 0, 2012 - !n this article we look to tackle the Issues faced by Jong distance runners on a vegetarian diet. By its very nature, a vegetarian diet can lead to ... 9 HowStuffWorks "5 Nutrition Tips for Vegetarian Runners" www. hcwstuffworks .com/.. ./run n in g/.. ./5-nutrition-ti ps-for-vg geta ri a n -r... T Even without meat, you can get enough fuel to keep on running. Stockbyte/Thinkstook ... Unfortunately, a vegetarian diet is not a panacea for runners, it could, for... Nutrition Guide for Vegetarian and Vegan Runners - The Running Bug theru n ni ng bug .co. u kA. Vn utrition -guide-for-vegetarian-and■ vegan-runne... t Feb 26, 2012 - The Running Bug's guide to nutrition for vegetarian and vegan ... different types of vegetarian diet ranging from Iacto-ovo-vegetarians who eat... Vegetarian Runner www.vegetarianrunner.com/ * Vegetarian Runner - A resource center for vegetarianism and running and how to make sure you have proper nutrition as an athlete with a vegetarian diet. running Good example that snippet selection is non-trivial. Criteria: occurrence of keywords, density of keywords, coherence of snippet, number of different snippets in summary, good cutting points etc Sojka, MR Group: PV211: Evaluation & Result Summaries 63 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries ynamic summar Query: [new guinea economic development] Snippets (in bold) that were extracted from a document: ... In recent years, Papua New Guinea has faced severe economic difficulties and economic growth has slowed, partly as a result of weak governance and civil war, and partly as a result of external factors such as the Bougainville civil war which led to the closure in 1989 of the Panguna mine (at that time the most important foreign exchange earner and contributor to Government finances), the Asian financial crisis, a decline in the prices of gold and copper, and a fall in the production of oil. PNG's economic development record over the past few years is evidence that governance issues underly many of the country's problems. Good governance, which may be defined as the transparent and accountable management of human, natural, economic and financial resources for the purposes of equitable and sustainable development, flows from proper public sector management, efficient fiscal and accounting mechanisms, and a willingness to make service delivery a priority in practice. ... Sojka, MR Group: PV211: Evaluation & Result Summaries 64 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries mg dynamic summari • Where do we get these other terms in the snippet from? o We cannot construct a dynamic summary from the positional inverted index - at least not efficiently. 9 We need to cache documents. • The positional index tells us: query term occurs at position 4378 in the document. • Byte offset or word offset? 9 Note that the cached copy can be outdated • Don't cache very long documents - just cache a short prefix Sojka, MR Group: PV211: Evaluation & Result Summaries 65 / 67 Recap Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries ynamic summari 9 Real estate on the search result page is limited —>► snippets must be short . .. • . .. but snippets must be long enough to be meaningful. 9 Snippets should communicate whether and how the document answers the query. 9 Ideally: linguistically well-formed snippets 9 Ideally: the snippet should answer the query, so we don't have to look at the document. 9 Dynamic summaries are a big part of user happiness because • . .. we can quickly scan them to find the relevant document we then click on. • ... in many cases, we don't have to click at all and save time. Sojka, MR Group: PV211: Evaluation & Result Summaries 66 / 67 • Chapter 8 of MR 9 Resources at http://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and Fl MU library • The TREC home page - TREC had a huge impact on information retrieval evaluation. • Originator of F-measure: Keith van Rijsbergen • More on A/B testing • Too much A/B testing at Google? • Tombros & Sanderson 1998: one of the first papers on dynamic summaries • Google VP of Engineering on search quality evaluation at Google » ClueWebl2 or other datasets available in Sketch Engine Sojka, MR Group: PV211: Evaluation & Result Summaries 67 / 67 PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/PV211 MR 9: Relevance feedback & Query expansion Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2017-03-31 Sojka, MR Group: PV211: Relevance feedback & Query expansion Sojka, MR Group: PV211: Relevance feedback & Query expansion 2/ • Interactive relevance feedback: improve initial retrieval results by telling the IR system which docs are relevant / nonrelevant o Best known relevance feedback method: Rocchio feedback • Query expansion: improve retrieval results by adding synonyms / related terms to the query • Sources for related terms: Manual thesauri, automatic thesauri, query logs Sojka, MR Group: PV211: Relevance feedback & Query expansion 3/ • Main topic today: two ways of improving recall: relevance feedback and query expansion • As an example consider query q\ [aircraft] .. . o .. . and document d containing "plane", but not containing aircraft • A simple IR system will not return d for q. • Even if d is the most relevant document for q\ • We want to change this: Return relevant documents even if there is no term match with the (original) query Sojka, MR Group: PV211: Relevance feedback & Query expansion 5/ • Loose definition of recall in this lecture: "increasing the number of relevant documents returned to user" • This may actually decrease recall on some measures, e.g., when expanding "jaguar" with "panthera" o . . .which eliminates some relevant documents, but increases relevant documents returned on top pages Sojka, MR Group: PV211: Relevance feedback & Query expansion 6/ ions Tor improving r • Local: Do a "local", on-demand analysis for a user query • Main local method: relevance feedback • Part 1 • Global: Do a global analysis once (e.g., of collection) to produce thesaurus o Use thesaurus for query expansion o Part 2 Sojka, MR Group: PV211: Relevance feedback & Query expansion 7/1 'g m X r query expansion • One that works well • "flights - flight 9 One that doesn't work so well o "dogs -dog Sojka, MR Group: PV211: Relevance feedback & Query expansion 8/1 • The user issues a (short, simple) query. • The search engine returns a set of documents. • User marks some docs as relevant, some as non-relevant. • Search engine computes a new representation of the information need. Hope: better than the initial query. • Search engine runs new query and returns new results. • New results have (hopefully) better recall. • We can iterate this: several rounds of relevance feedback. • We will use the term ad hoc retrieval to refer to regular retrieval without relevance feedback. Sojka, MR Group: PV211: Relevance feedback & Query expansion 10/1 9 We will now look at three different examples of relevance feedback that highlight different aspects of the process. Sojka, MR Group: PV211: Relevance feedback & Query expansion 11/1 tíÓNew Page 1 - Netscape _|n x| a. File Edit View Ge Bookmarks Toels Window Help Q Q Q Q http: ŕJnayana, ece, ucih, edu/i ■ -ji Home ^sŕ Browsing and ,,. Shopping related 6u7r0uu images are indexed and classified in the database Only One keyword is allowed!! I bikd Search Designed by Bans Sumeiigen and Shawn News am Powered by JLAMF20O0 (Java, Linux, Apache, Mysql PerL WmdovtsIQQQ) Sojka, MR Group: PV211: Relevance feedback & Query expansion 12/1 u or initial query Sojka, MR Group: PV211: Relevance feedback & Query expansion 13/1 Sojka, MR Group: PV211: Relevance feedback & Query expansion 14/1 u Browse Search P rev Next Random | (144538, 5234Pirj (1444 73, 1«4Í)) (U44:5(5, 245X5Í4; Q 4Ó35 (h453*ř 525529) 0.55427$ O.iSGöSL [14445(1, JJJöPi] 0.+7(54 5 0.2ÖÜ4JL Ll4445fi,35i5ÖSj 0.64501 0.3513« 0.293ÖL5 [14447Í, lóí^S) 0.351557 iL4445G, 2535*58] Ů.41L745 Ú 23*53 (14 4538,5237^ f Ú 353033 Ú 3 (WOJí 1144483, 2052(54j íl JfllTOľSo ň 3(5176 (144478,512410 í 0.7020 j Ú 4(5*111 ú .233539 Sojka, MR Group: PV211: Relevance feedback & Query expansion 15/1 Sojka, MR Group: PV211: Relevance feedback & Query expansion 16/1 source: Fernand Sojka, MR Group: PV211: Relevance feedback & Query expansion 17/1 Sojka, MR Group: PV211: Relevance feedback & Query expansion 18/1 u source: Fernando Diaz Sojka, MR Group: PV211: Relevance feedback & Query expansion 19/1 non-image) examp Initial query: [new space satellite applications] Results for initial query: (r = rank) r + 1 0.539 NASA Hasn't Scrapped Imaging Spectrometer + 2 0.533 NASA Scratches Environment Gear From Satellite Plan 3 0.528 Science Panel Backs NASA Satellite Plan, But Urges Launches of Smaller Probes 4 0.526 A NASA Satellite Project Accomplishes Incredible Feat: Staying Within Budget 5 0.525 Scientist Who Exposed Global Warming Proposes Satellites for Climate Research 6 0.524 Report Provides Support for the Critics Of Using Big Satellites to Study Climate 7 0.516 Arianespace Receives Satellite Launch Pact From Telesat Canada + 8 0.509 Telecommunications Tale of Two Companies User then marks relevant documents with "+". Sojka, MR Group: PV211: Relevance feedback & Query expansion 20/1 2.074 new 15.106 space 30.816 satellite 5.660 application 5.991 nasa 5.196 eos 4.196 launch 3.972 aster 3.516 instrument 3.446 arianespace 3.004 bundespost 2.806 ss 2.790 rocket 2.053 scientist 2.003 broadcast 1.172 earth 0.836 oil 0.646 measure Compare to original query: [new space satellite applications] Sojka, MR Group: PV211: Relevance feedback & Query expansion 21/1 or expan query 1(2) 2(1) 3 5(8) 6 7 8 0.513 NASA Scratches Environment Gear From Satellite Plan 0.500 NASA Hasn't Scrapped Imaging Spectrometer 0.493 When the Pentagon Launches a Secret Satellite, Space Sleuths Do Some Spy Work of Their Own 0.493 NASA Uses 'Warm' Superconductors For Fast Circuit 0.492 Telecommunications Tale of Two Companies 0.491 Soviets May Adapt Parts of SS-20 Missile For Commercial Use 0.490 Gaping Gap: Pentagon Lags in Race To Match the Soviets In Rocket Launchers 0.490 Rescue of Satellite By Space Agency To Cost $90 Million Sojka, MR Group: PV211: Relevance feedback & Query expansion 22/1 y • The centroid is the center of mass of a set of points. • Recall that we represent documents as points in a high-dimensional space. o Thus: we can compute centroids of documents. • Definition: where D is a set of documents and v(d) — d is the vector we use to represent document d. Sojka, MR Group: PV211: Relevance feedback & Query expansion 24/1 Sojka, MR Group: PV211: Relevance feedback & Query expansion 25/1 o The Rocchio algorithm implements relevance feedback in the vector space model. • Rocchio chooses the query qopt that maximizes Qopt = arg max[sim(q, /i(Dr)) - sim(q, /i(Dnr))] q Dr\ set of relevant docs; Dnr\ set of nonrelevant docs • Intent: qopt is the vector that separates relevant and non-relevant docs maximally. • Making some additional assumptions, we can rewrite qopt as: Sojka, MR Group: PV211: Relevance feedback &. Query expansion 26/1 g • The optimal query vector is: qopt = (i(Dr) + \ji(Dr) - n{Dnr)] = ,£i E 3j h- l~ E 3; - _E 41 djeDr djeDr djeDnr 9 We move the centroid of the relevant documents by the difference between the two centroids. Sojka, MR Group: PV211: Relevance feedback & Query expansion 27/1 v circles: relevant documents, Xs: nonrelevant documents compute: qopt = /i(Dr) + [pt(Dr) - ß{Dnr)\ Sojka, MR Group: PV211: Relevance feedback & Query expansion 28/1 circles: relevant documents, Xs: nonrelevant documents centroid of relevant documents jIr does not separate relevant/nonrelevant. jInr'- centroid of nonrelevant documents tf>r ~ tf>nr'- difference vector Add difference vector to jIr......to get q0pt q0pt separates relevant/nonrelevant perfectly._ Sojka, MR Group: PV211: Relevance feedback & Query expansion 29/1 ermin • So far, we have used the name Rocchio for the theoretically better motivated original version of Rocchio. • The implementation that is actually used in most cases is the SMART implementation - this SMART version of Rocchio is what we will refer to from now on. Sojka, MR Group: PV211: Relevance feedback & Query expansion 30/1 goritnm • Used in practice: qm = aq0 +ß^(Dr)-'y^Dnr) 1^1 dj€D, D £ 3j nr djeDnr Dn modified query vector; q$\ original query vector; Dr and sets of known relevant and nonrelevant documents respectively; a, (3, and 7: weights • New query moves towards relevant documents and away from nonrelevant documents. • Tradeoff a vs. /3/7: If we ^ave a lot of judged documents, we want a higher /3/7- • Set negative term weights to 0. o "Negative weight" for a term doesn't make sense in the vector space model. Sojka, MR Group: PV211: Relevance feedback & Query expansion itive vs. negative r • Positive feedback is more valuable than negative feedback. • For example, set (3 — 0.75, 7 = 0.25 to give higher weight to positive feedback. o Many systems only allow positive feedback. Sojka, MR Group: PV211: Relevance feedback & Query expansion 32/1 • When can relevance feedback enhance recall? • Assumption Al: The user knows the terms in the collection well enough for an initial query. • Assumption A2: Relevant documents contain similar terms (so I can "hop" from one relevant document to a different one when giving relevance feedback). Sojka, MR Group: PV211: Relevance feedback & Query expansion 33/1 • Assumption Al: The user knows the terms in the collection well enough for an initial query. • Violation: Mismatch of searcher's vocabulary and collection vocabulary • Example: cosmonaut / astronaut Sojka, MR Group: PV211: Relevance feedback & Query expansion 34/1 • Assumption A2: Relevant documents are similar. • Example for violation: [contradictory government policies] • Several unrelated "prototypes" o Subsidies for tobacco farmers vs. anti-smoking campaigns o Aid for developing countries vs. high tariffs on imports from developing countries 9 Relevance feedback on tobacco docs will not help with finding docs on developing countries. Sojka, MR Group: PV211: Relevance feedback & Query expansion 35/1 • When can relevance feedback enhance recall? • Assumption Al: The user knows the terms in the collection well enough for an initial query. • Assumption A2: Relevant documents contain similar terms (so I can "hop" from one relevant document to a different one when giving relevance feedback). Sojka, MR Group: PV211: Relevance feedback & Query expansion 36/1 • Pick an evaluation measure, e.g., precision in top 10: P@10 • Compute P@10 for original query go • Compute PO10 for modified relevance feedback query q\ o In most cases: q\ is spectacularly better than q§\ a Is this a fair evaluation? Sojka, MR Group: PV211: Relevance feedback & Query expansion 37/1 • Fair evaluation must be on "residual" collection: docs not yet judged by user. a Studies have shown that relevance feedback is successful when evaluated this way. • Empirically, one round of relevance feedback is often very useful. Two rounds are marginally useful. Sojka, MR Group: PV211: Relevance feedback & Query expansion 38/1 v • True evaluation of usefulness must compare to other methods taking the same amount of time. • Alternative to relevance feedback: User revises and resubmits query. • Users may prefer revision/resubmission to having to judge relevance of documents. o There is no clear evidence that relevance feedback is the "best use" of the user's time. Sojka, MR Group: PV211: Relevance feedback & Query expansion 39/1 • Do search engines use relevance feedback? • Why? Sojka, MR Group: PV211: Relevance feedback & Query expansion 40/1 m • Relevance feedback is expensive. • Relevance feedback creates long modified queries, o Long queries are expensive to process. • Users are reluctant to provide explicit feedback. • It's often hard to understand why a particular document was retrieved after applying relevance feedback. • The search engine Excite had full relevance feedback at one point, but abandoned it later. Sojka, MR Group: PV211: Relevance feedback & Query expansion 41/1 u • Pseudo-relevance feedback automates the "manual" part of true relevance feedback. o Pseudo-relevance feedback algorithm: • Retrieve a ranked list of hits for the user's query o Assume that the top k documents are relevant, o Do relevance feedback (e.g., Rocchio) • Works very well on average • But can go horribly wrong for some queries. o Because of query drift o If you do several iterations of pseudo-relevance feedback, then you will get query drift for a large proportion of queries. Sojka, MR Group: PV211: Relevance feedback & Query expansion 42/1 u • Cornell SMART system • Results show number of relevant documents out of top 100 for 50 queries (so total number of documents is 5000): method number of relevant documents Incite 3210 Inc.ltc-PsRF 3634 Lnu.ltu 3709 Lnu.ltu-PsRF 4350 • Results contrast two length normalization schemes (L vs. I) and pseudo-relevance feedback (PsRF). • The pseudo-relevance feedback method used added only 20 terms to the query. (Rocchio will add many more.) • This demonstrates that pseudo-relevance feedback is effective on average. Sojka, MR Group: PV211: Relevance feedback & Query expansion 43/1 uery expansion: 'XXHOO.ř S E A R C H Web Images Video Audio Directory Local News Shopping More » palm Search Answers My Web Search Services | Advanced Search Preferences Search Results Also try: palm springs, palm pilot, palm trees, palm reading More SPONSOR RESULT. • Official Palm Store store.palm.com Free shipping on all handhelds and more at the official Palm store. • Palms Hotel - Best Rate Guarantee www.vegas.com Book the Palms Hotel Casino with our best rate guarantee at VEGAS.com, the official Vegas travel site. v> Palm Pilots - Palm Downloads Yahoo! Shortcut - About 1. Palm, Inc. ^ Maker of handheld PDA devices that allow mobile users to manage schedules, contacts, and other personal and business information. Category: 52B > Personal Digital Assistants (PDAs) www.palm.com - 20k - Cached - More from this site - Save 1 - 10 of about 160.000.000 for palm - 0.07 sec. fAbout this paoel SPONSOR RESULTS Palm Memory Memory Giant is fast and easy. Guaranteed compatible memory. Great... www.memorygiant.com The Palms, Turks and Caicos Islands Resort/Condo photos, rates, availability and reservations.... www.wo rl dwi d e re s e rvat i o n sy st erns.c The Palms Casino Resort, Las Vegas Low price guarantee at the Palms Casino resort in Las Vegas. Book... lasvegas.hotelscorp.com Sojka, MR Group: PV211: Relevance feedback & Query expansion 45/1 • User gives feedback on documents. o More common in relevance feedback • User gives feedback on words or phrases. a More common in query expansion Sojka, MR Group: PV211: Relevance feedback & Query expansion 46/1 uery expansion • Query expansion is another method for increasing recall. • We use "global query expansion" to refer to "global methods for query reformulation". • In global query expansion, the query is modified based on some global resource, i.e. a resource that is not query-dependent. • Main information we use: (near-)synonymy Sojka, MR Group: PV211: Relevance feedback & Query expansion 47/1 r query expansion • A publication or database that collects (near-)synonyms is called a thesaurus. o Manual thesaurus (maintained by editors, e.g., PubMed) • Automatically derived thesaurus (e.g., based on co-occurrence statistics) o Query-equivalence based on query log mining (common on the web as in the "palm" example) Sojka, MR Group: PV211: Relevance feedback & Query expansion 48/1 uru query expansion • For each term t in the query, expand the query with words the thesaurus lists as semantically related with t. • Example from earlier: hospital —>► medical • Generally increases recall o May significantly decrease precision, particularly with ambiguous terms O INTEREST RATE —>► INTEREST RATE FASCINATE • Widely used in specialized search engines for science and engineering • It's very expensive to create a manual thesaurus and to maintain it over time. Sojka, MR Group: PV211: Relevance feedback & Query expansion 49/1 r manu uru NCBI PubtySjed National Library | of Medicine M M Pub Mí d N'.icliotidí Protein Stiuctur-; PopSet I Sfiarch jPubM^d *j f<>r jtancer Limits Go I Cigar 'revtewflnflex Histcry Glipboard Taxonomy Abcut Enliez Texf Version Entrez Pud m e a Overview Heap I FAQ Tutorial New/Noteworthy E-Utilities PutrMed Services Journals Database MeSH Browser Single Citation FubMed Query: ("neoplasms"{HeSH Terms] OR cancer[Text Word]) Search | URL Sojka, MR Group: PV211: Relevance feedback & Query expansion 50/1 m urus generation • Attempt to generate a thesaurus automatically by analyzing the distribution of words in documents • Fundamental notion: similarity between two words o Definition 1: Two words are similar if they co-occur with similar words. o ca r "motorcycle" because both occur with "road", "gas" and "license", so they must be similar. • Definition 2: Two words are similar if they occur in a given grammatical relation with the same words. • You can harvest, peel, eat, prepare, etc. apples and pears, so apples and pears must be similar. • Co-occurrence is more robust, grammatical relations are more accurate. Sojka, MR Group: PV211: Relevance feedback & Query expansion 51/1 Word Nearest neighbors absolutely bottomed captivating doghouse makeup mediating keeping lithographs pathogens senses absurd whatsoever totally exactly nothing dip copper drops topped slide trimmed shimmer stunningly superbly plucky witty dog porch crawling beside downstairs repellent lotion glossy sunscreen skin gel reconciliation negotiate case conciliation hoping bring wiping could some would drawings Picasso Dali sculptures Gauguin toxins bacteria organisms bacterial parasite grasp psyche truly clumsy naive innate WordSpace demo on web Sojka, MR Group: PV211: Relevance feedback & Query expansion 52/1 uery expansion g • Main source of query expansion at search engines: query logs • Example 1: After issuing the query [herbs], users frequently search for [herbal remedies]. o —> "herbal remedies" is potential expansion of "herb". • Example 2: Users searching for [flower pix] frequently click on the URL photobucket.com/flower. Users searching for [flower clipart] frequently click on the same URL. • —> "flower clipart" and "flower pix" are potential expansions of each other. Sojka, MR Group: PV211: Relevance feedback & Query expansion 53/1 • Interactive relevance feedback: improve initial retrieval results by telling the IR system which docs are relevant / nonrelevant o Best known relevance feedback method: Rocchio feedback • Query expansion: improve retrieval results by adding synonyms / related terms to the query • Sources for related terms: Manual thesauri, automatic thesauri, query logs Sojka, MR Group: PV211: Relevance feedback & Query expansion 54/1 • Chapter 9 of MR • Resources at http://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and Fl MU library • Daniel Tunkelang's articles on query understanding, namely on query relaxation and query expansion. • Salton and Buckley 1990 (original relevance feedback paper) • Spink, Jansen, Ozmultu 2000: Relevance feedback at Excite o Justin Bieber: related searches fail o Word Space • Schütze 1998: Automatic word sense discrimination (describes a simple method for automatic thesaurus generation) Sojka, MR Group: PV211: Relevance feedback & Query expansion 55/1 Introduction Basic XML concepts Challenges in XML IR Vector space model for XML IR Evaluation of XML IR PV211: Introduction to Information Retrieva http://www.f i.muni.cz/~ s oj ka/PV211 R 10: XML retrieva Handout version Petr Sojka, Hinrich Schütze et al Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2017-05-09 Sojka, MR Group: PV211: XML retrieval 1/44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR verview Q Introduction Q Basic XML concepts O Challenges in XML IR Q Vector space model for XML IR 0 Evaluation of XML IR 0 Math (MathML) retrieval Sojka, MR Group: PV211: xml retrieval 2/44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR IR systems are often contrasted with relational databases (RDB). • Traditionally, IR systems retrieve information from unstructured text ("raw" text without markup). o RDB systems are used for querying relational data: sets of records that have values for predefined attributes such as employee number, title and salary. RDB search unstructured IR objects records unstructured docs main data structure table inverted index model relational model vector space & others queries SQL free text queries Some structured data sources containing text are best modeled as structured documents rather than relational data (Structured retrieval). Sojka, MR Group: PV211: xml retrieval 4/44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR :ruciurea ret rieva Basic setting: queries are structured or unstructured; documents are structured. Applications of structured retrieval Digital libraries, patent databases, blogs, tagged text with entities like persons and locations (named entity tagging). Example 9 Digital libraries: give me a full-length article on fast Fourier transforms o Patents: give me patents whose claims mention RSA public key encryption and that cite US patent 4,405,829 9 Entity-tagged text: give me articles about sightseeing tours of the Vatican and the Coliseum Sojka, MR Group: PV211: xml retrieval 5/44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR Three main problems O An unranked system (DB) would return a potentially large number of articles that mention the Vatican, the Coliseum and sightseeing tours without ranking them by relevance to the query. Q Difficult for users to precisely state structural constraints—may not know which structured elements are supported by the system. tours AND (COUNTRY: Vatican OR LANDMARK: Coliseum) ? tours AND (STATE: Vatican OR BUILDING: Coliseum) O Users may be completely unfamiliar with structured search and advanced search interfaces or unwilling to use them. Solution: adapt ranked retrieval to structured documents to address these problems. Sojka, MR Group: PV211: xml retrieval 6/44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR r RDB search, Unstructured IR, Structured IR RDB search unstructured retrieval structured retrieval objects records unstructured docs trees with text at leaves main data table inverted index ? structure model relational model vector space & others ? queries SQL free text queries ? Standard for encoding structured documents: Extensible Markup Language (XML) • structured IR XML IR • also applicable to other types of markup (HTML, SGML,...) Sojka, MR Group: PV211: xml retrieval 7/44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR ocumerr • Ordered, labeled tree • Each node of the tree is an XML element, written with an opening and closing XML tag (e.g. , ) • An element can have one or more XML attributes (e.g. number) • Attributes can have values (e.g. vii) • Attributes can have child elements (e.g. title, verse) Shakespeare Macbeth Macbeth's castle Will I with wine ... Sojka, MR Group: PV211: xml retrieval 9/44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR ocumen attribute number="l" attribute number="vii" root element play element act element scene element verse text Will I with 1 element title text Macbeth 1 element title text Macbeth's castte Sojka, MR Group: PV211: xml retrieval 10 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR ocumen The leaf nodes consist of text element author text Shakespeare attribute number="l" attribute iumber="vii root element play element act element scene ± element verse I text Will I with element title text Macbeth element title fv text acbeth's casile Sojka, MR Group: PV211: xml retrieval 11 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR ocumen The internal nodes encode document structure or metadata functions £ element author text Shakespeare £ attribute number="l" £ attribute number="vii" root element play element act element scene element verse text Will I with ... element title text Macbeth 1 element title text Macbeth's castHe Sojka, MR Group: PV211: xml retrieval 12 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR asic o XML Document Object Model (XML DOM): standard for accessing and processing XML documents o The DOM represents elements, attributes and text within elements as nodes in a tree, o With a DOM API, we can process an XML document by starting at the root element and then descending down the tree from parents to children. • XPath: standard for enumerating paths in an XML document collection. • We will also refer to paths as XML contexts or simply contexts • Schema: puts constraints on the structure of allowable XML documents. E.g. a schema for Shakespeare's plays: scenes can only occur as children of acts. j Two standards for schemas for XML documents are: XML DTD (document type definition) and XML Schema. Sojka, MR Group: PV211: xml retrieval 13 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR íaiienge: document parts to retrieve Structured or XML retrieval: users want us to return parts of documents (i.e., XML elements), not entire documents as IR systems usually do in unstructured retrieval. Example If we query Shakespeare's plays for Macbeth's castle, should we return the scene, the act or the entire play? • In this case, the user is probably looking for the scene. • However, an otherwise unspecified search for Macbeth should return the play of this name, not a subunit. Solution: structured document retrieval principle Sojka, MR Group: PV211: xml retrieval 15 / 44 Introduction Basic xml concepts Chal lenges in xml IR Vector space model for xml IR Evaluation of xml IR retrieval principle Structured document retrieval principle One criterion for selecting the most appropriate part of a document: A system should always retrieve the most specific part of a document answering the query. • Motivates a retrieval strategy that returns the smallest unit that contains the information sought, but does not go below this level. • Hard to implement this principle algorithmically. E.g. query: title:Macbeth can match both the title of the tragedy, Macbeth, and the title of Act I, Scene vii, Macbeth's castle. 9 But in this case, the title of the tragedy (higher node) is preferred. o Difficult to decide which level of the tree satisfies the query. Sojka, MR Group: PV211: xml retrieval 16 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR C J 1 11 J j_ j_ j 1 becond challenge: document parts 1 :o index Central notion for indexing and ranking in IR: document unit or indexing unit. • In unstructured retrieval, usually straightforward: files on your desktop, email messages, web pages on the web etc. • In structured retrieval, there are four main different approaches to defining the indexing unit. 9 non-overlapping pseudodocuments Q top down O bottom up Q all Sojka, MR Group: PV211: xml retrieval 17 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR indexing unit: approac Group nodes into non-overlapping pseudodocuments. Indexing units: books, chapters, sections, but without overlap. Disadvantage: pseudodocuments may not make sense to the user because they are not coherent units. Sojka, MR Group: PV211: xml retrieval 18 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR indexing unit: approac Top down (2-stage process): O start with one of the largest elements as the indexing unit, e.g. the book element in a collection of books Q then, postprocess search results to find for each book the subelement that is the best hit. This two-stage retrieval process often fails to return the best subelement because the relevance of a whole book is often not a good predictor of the relevance of small subelements within it. Sojka, MR Group: PV211: xml retrieval 19 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR indexing unit: approac Bottom up: Instead of retrieving large units and identifying subelements (top down), we can search all leaves, select the most relevant ones and then extend them to larger units in postprocessing. Similar problem as top down: the relevance of a leaf element is often not a good predictor of the relevance of elements it is contained in. Sojka, MR Group: PV211: xml retrieval 20 / 44 Introduction E Jasic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR _ ma exmg unit: approaci Index all elements: the least restrictive approach. Also problematic: • many XML elements are not meaningful search results, e.g., an ISBN number. • indexing all elements means that search results will be highly redundant. Example For the query Macbeth s castle we would return all of the play, act, scene and title elements on the path between the root node and Macbeth's castle. The leaf node would then occur 4 times in the result set: 1 directly and 3 as part of other elements. We call elements that are contained within each other nested elements. Returning redundant nested elements in a list of returned hits is not very user-friendly. Sojka, MR Group: PV211: xml retrieval 21 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR m cnanenge: nestea elements Because of the redundancy caused by nested elements it is common to restrict the set of elements eligible for retrieval. Restriction strategies include: • discard all small elements • discard all element types that users do not look at (working XML retrieval system logs) • discard all element types that assessors generally do not judge to be relevant (if relevance assessments are available) • only keep element types that a system designer or librarian has deemed to be useful search results In most of these approaches, result sets will still contain nested elements. Sojka, MR Group: PV211: xml retrieval 22 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR m cnanenge: nestea elements Further techniques: • remove nested elements in a postprocessing step to reduce redundancy. • collapse several nested elements in the results list and use highlighting of query terms to draw the user's attention to the relevant passages. Highlighting 9 Gain 1: enables users to scan medium-sized elements (e.g., a section); thus, if the section and the paragraph both occur in the results list, it is sufficient to show the section. 9 Gain 2: paragraphs are presented in-context (i.e., their embedding section). This context may be helpful in interpreting the paragraph. Sojka, MR Group: PV211: xml retrieval 23 / 44 Introduction Basic XML concepts Chal lenges in XML IR Vector space model for XML IR Evaluation of XML IR estea elements an i :erm statistic s Further challenge related to nesting: we may need to distinguish different contexts of a term when we compute term statistics for ranking, in particular inverse document frequency (idfi). Example The term Gates under the node author is unrelated to an occurrence under a content node like section if used to refer to the plural of gate. It makes little sense to compute a single document frequency for Gates in this example. Solution: compute idf for XML-con text term pairs. • sparse data problems (many XML-context pairs occur too rarely to reliably estimate df) • compromise: consider the parent node x of the term and not the rest of the path from the root to x to distinguish contexts. Sojka, MR Group: PV211: XML retrieval 24 / 44 Introduction Basic xml concepts Challenge: ; in xml IR Vector space model for xml IR Evaluation of xml IR am ia ea: exica isea su Aim: to have each dimension of the vector space encode a word together with its position within the XML tree. How: Map XML documents to lexicalised subtrees. Sojka, MR Group: PV211: xml retrieval 26 / 44 Introduction Basic xml concepts Challenge: ; in xml IR Vector space model for xml IR Evaluation of xml IR am ia ea: exica isea su 9 Take each text node (leaf) and break it into multiple nodes, one for each word. E.g. split Bill Gates into Bill and Gates. 9 Define the dimensions of the vector space to be lexicalized subtrees of documents - subtrees that contain at least one vocabulary term. Sojka, MR Group: PV211: xml retrieval 27 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR We can now represent queries and documents as vectors in this space of lexicalized subtrees and compute matches between them, e.g. using the vector space formalism. Vector space formalism in unstructured VS. structured IR The main difference is that the dimensions of vector space in unstructured retrieval are vocabulary terms whereas they are lexicalized subtrees in XML retrieval. Sojka, MR Group: PV211: xml retrieval 28 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR There is a tradeoff between the dimensionality of the space and accuracy of query results. • If we restrict dimensions to vocabulary terms, then we have a standard vector space retrieval system that will retrieve many documents that do not match the structure of the query (e.g., Gates in the title as opposed to the author element). • If we create a separate dimension for each lexicalized subtree occurring in the collection, the dimensionality of the space becomes too large. Compromise: index all paths that end in a single vocabulary term, in other words, all XML-con text term pairs. We call such an XML-context term pair a structural term and denote it by (c, t): a pair of XML-context c and vocabulary term t. Sojka, MR Group: PV211: xml retrieval 29 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR I ontext resemoiance A simple measure of the similarity of a path cq in a query and path q in a document is the following context resemblance function Cr: GR(cq, cd) = I SUf if c« matches c* [0 \f cq does not match Qy and |q/| are the number of nodes in the query path and document path, resp. cq matches q/ iff we can transform cq into q/ by inserting additional nodes. Sojka, MR Group: PV211: xml retrieval 30 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR 1 cexi c resem D ance example Cr(cc/4, q/2) = 3/4 = 0.75. The value of Cr(cc/, q/) is 1.0 if q and d are identical. Sojka, MR Group: PV211: xml retrieval 31 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR ontext resemoiance exercise book Gates Q3 book ± Gates Q4 book creator Gates d2 CR(cq,Cd) = i íílSl 0 if cq matches cd if cq does not match cd CR(cq4, cd3) =? CR(cq4, cd3) = 3/5 = 0.6 Sojka, MR Group: PV211: xml retrieval 32 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR •ocument simn larity measure The final score for a document is computed as a variant of the cosine measure, which we call SimNoMerge. SimNoMerge((7, d) = Ev^ ^ / \ v ^ . . , x weightfcf, t, q) 2^ CR(ckl q) 2^ weighty t, ck) _ = ckeBqeB tev \/J2ceB,tev we'ght (c/, t, c) • V is the vocabulary of non-structural terms • B is the set of all XML contexts • weight(q, t, c), weight(c/, t, c) are the weights of term t in XML context c in query q and document d, resp. (standard weighting e.g. idft • wft?cy, where idft depends on which elements we use to compute dft. ) SimNoMerge((7, d) is not a true cosine measure since its value can be larger than 1.0. Sojka, MR Group: PV211: xml retrieval 33 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR ScoreDocumentsWithSimNoMerge((7, ß, V, A/, normalizer) 1 for n <- 1 to N 2 do score[n] <— 0 3 for each (cq, t) G q 4 do n/qf- weight(q, t, cq) 5 for each c G ß 6 do if CR(cq, c) > 0 7 then postings <- GetPostings((c, t)) 8 for each posting G postings 9 do xf- Cr(cc/, c) * i/i/q * weight (posting) 10 score[c/oc/D(post/A7g-)]+ = x 11 for a? <- 1 to A/ 12 do score[n] ^- score[n]/normalizer[n] 13 return score Sojka, MR Group: PV211: xml retrieval 34 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR INEX: standard benchmark evaluation (yearly) that has produced test collections (documents, sets of queries, and relevance judgments). Based on IEEE journal collection (since 2006 INEX uses the much larger English Wikipedia as a test collection). The relevance of documents is judged by human assessors. INEX 2002 collection statistics 12,107 number of documents 494 MB size 1995-2002 time of publication of articles 1,532 average number of XML nodes per document 6.9 average depth of a node 30 number of CAS topics 30 number of CO topics Sojka, MR Group: PV211: xml retrieval 36 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR Two types: Q content-only or CO topics: regular keyword queries as in unstructured information retrieval Q content-and-structure or CAS topics: have structural constraints in addition to keywords Since CAS queries have both structural and content criteria, relevance assessments are more complicated than in unstructured retrieval. Sojka, MR Group: PV211: xml retrieval 37 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR IN EX 2002 defined component coverage and topical relevance as orthogonal dimensions of relevance. Component coverage Evaluates whether the element retrieved is "structurally" correct, i.e., neither too low nor too high in the tree. We distinguish four cases: O Exact coverage (E): The information sought is the main topic of the component and the component is a meaningful unit of information. O Too small (S): The information sought is the main topic of the component, but the component is not a meaningful (self-contained) unit of information. O Too large (L): The information sought is present in the component, but is not the main topic. Q No coverage (N): The information sought is not a topic of the component. Sojka, MR Group: PV211: xml retrieval 38 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR assessments The topical relevance dimension also has four levels: highly relevant (3), fairly relevant (2), marginally relevant (1) and nonrelevant (0). Combining the relevance dimensions Components are judged on both dimensions and the judgments are then combined into a digit-letter code, e.g. 2S is a fairly relevant component that is too small. In theory, there are 16 combinations of coverage and relevance, but many cannot occur. For example, a nonrelevant component cannot have exact coverage, so the combination 3N is not possible. Sojka, MR Group: PV211: xml retrieval 39 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR assessments The relevance-coverage combinations are quantized as follows: Q(re/, cov) = < 1.00 if (rel, cov) = 3E 0.75 if (re/, cov) G{2E,3L} 0.50 if (re/, cov) G{1E,2L,2S} 0.25 if (re/, cov) G{1S,1L} 0.00 if (rel, cov) = ON This evaluation scheme takes account of the fact that binary relevance judgments, which are standard in unstructured IR, are not appropriate for XML retrieval. The quantization function Q does not impose a binary choice relevant/nonrelevant and instead allows us to grade the component as partially relevant. The number of relevant components in a retrieved set A of components can then be computed as: ^(relevant items retrieved) = ^ Q(re/(c), cov(c)) ceA Sojka, MR Group: PV211: xml retrieval 40 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR z a evai luai □on measures As an approximation, the standard definitions of precision and recall can be applied to this modified definition of relevant items retrieved, with some subtleties because we sum graded as opposed to binary relevance assessments. Drawback Overlap is not accounted for. Accentuated by the problem of multiple nested elements occurring in a search result. Recent INEX focus: develop algorithms and evaluation measures that return non-redundant results lists and evaluate them properly. Sojka, MR Group: PV211: xml retrieval 41 / 44 Introduction Basic xml concepts Challenges in '. xml IR Vector space model for xml IR Evaluation of xml IR naexer an earci https://mir.f i.muni.cz Sojka, MR Group: PV211: xml retrieval 43 / 44 Introduction Basic xml concepts Challenges in xml IR Vector space model for xml IR Evaluation of xml IR • Structured or XML IR: effort to port unstructured (standard) IR know-how onto a scenario that uses structured (DB-like) data • Specialised applications (e.g. patents, digital libraries) • A decade old, unsolved problem • http: //inex. is. i nformatik.uniduisburg.de/ Sojka, MR Group: PV211: xml retrieval 44 / 44 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/PV211 MR 11: Probabilistic Information Retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2015-02-18 Sojka, MR Group: PV211: Probabilistic Information Retrieval 1/51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions Q Recap Q Probabilistic Approach to IR Q Basic Probability Theory Q Probability Ra nking Principle Q Appraisal&Extensions Sojka, MR Group: PV211: Probabilistic Information Retrieval 2/51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions • The user issues a (short, simple) query. • The search engine returns a set of documents. • User marks some docs as relevant, some as irrelevant. • Search engine computes a new representation of the information need - should be better than the initial query. • Search engine runs new query and returns new results. • New results have (hopefully) better recall. Sojka, MR Group: PV211: Probabilistic Information Retrieval 4/51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions Sojka, MR Group: PV211: Probabilistic Information Retrieval 5/51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions query expansi o Manual thesaurus (maintained by editors, e.g., PubMed) • Automatically derived thesaurus (e.g., based on co-occurrence statistics) o Query-equivalence based on query log mining (common on the web as in the "palm" example) Sojka, MR Group: PV211: Probabilistic Information Retrieval 6/51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions uery expansion g • Main source of query expansion at search engines: query logs • Example 1: After issuing the query [herbs], users frequently search for [herbal remedies]. o —>» "herbal remedies" is potential expansion of "herb". • Example 2: Users searching for [flower pix] frequently click on the URL photobucket.com/flower. Users searching for [flower clipart] frequently click on the same URL. • —>» "flower clipart" and "flower pix" are potential expansions of each other. Sojka, MR Group: PV211: Probabilistic Information Retrieval 7/51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions • Probabilistically grounded approach to IR • Probability Ranking Principle • Models: BIM, BM25 • Assumptions these models make Sojka, MR Group: PV211: Probabilistic Information Retrieval 8/51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions • Previous lecture: in relevance feedback, the user marks documents as relevant/irrelevant • Given some known relevant and irrelevant documents, we compute weights for non-query terms that indicate how likely they will occur in relevant documents. o Today: develop a probabilistic approach for relevance feedback and also a general probabilistic model for IR. Sojka, MR Group: PV211: Probabilistic Information Retrieval 10 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions • Given a user information need (represented as a query) and a collection of documents (transformed into document representations), a system must determine how well the documents satisfy the query o An IR system has an uncertain understanding of the user query, and makes an uncertain guess of whether a document satisfies the query • Probability theory provides a principled foundation for such reasoning under uncertainty • Probabilistic models exploit this foundation to estimate how likely it is that a document is relevant to a query Sojka, MR Group: PV211: Probabilistic Information Retrieval 11 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions • Classical probabilistic retrieval model o Probability ranking principle o Binary Independence Model, Best Match 25 (Okapi) • Bayesian networks for text retrieval • Language model approach to IR o Important recent work, will be covered in the next lecture o Probabilistic methods are one of the oldest but also one of the currently hottest topics in IR Sojka, MR Group: PV211: Probabilistic Information Retrieval 12 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions Boolean model o Probabilistic models support ranking and thus are better than the simple Boolean model. Vector space model o The vector space model is also a formally defined model that supports ranking, o Why would we want to look for an alternative to the vector space model? Sojka, MR Group: PV211: Probabilistic Information Retrieval 13 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions nistic vs. vector sp m • Vector space model: rank documents according to similarity to query. • The notion of similarity does not translate directly into an assessment of "is the document a good document to give to the user or not?" o The most similar document can be highly relevant or completely irrelevant. • Probability theory is arguably a cleaner formalization of what we really want an IR system to do: give relevant documents to the user. Sojka, MR Group: PV211: Probabilistic Information Retrieval 14 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions 9 For events A and B • Joint probability P(A D B) of both events occurring • Conditional probability P(A\B) of event A occurring given that event B has occurred • Chain rule gives fundamental relationship between joint and conditional probabilities: P(AB) = P(A n B) = P(A\B)P(B) = P(B\A)P(A) • Similarly for the complement of an event P(A)\ P(AB) = P(B\A)P(A) • Partition rule: if B can be divided into an exhaustive set of disjoint subcases, then P(B) is the sum of the probabilities of the subcases. A special case of this rule gives: P(B) = P(AB) + P(AB) Sojka, MR Group: PV211: Probabilistic Information Retrieval 16 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions Bayes' Rule for inverting conditional probabilities: p{a\b) = p(b\a)p(a) p(b) p(b\a) i^xe{A,Ä} p(b\x)p(x)\ p(a) Can be thought of as a way of updating probabilities: • Start off with prior probability p{a) (initial estimate of how likely event a is in the absence of any other information) • Derive a posterior probability p(a\b) after having seen the evidence b, based on the likelihood of b occurring in the two cases that a does or does not hold Odds of an event provide a kind of multiplier for how probabilities change: Odds: o(a)=m = pw p(a) 1 - p{a) Sojka, MR Group: PV211: Probabilistic Information Retrieval 17 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions nKing rr • Ranked retrieval setup: given a collection of documents, the user issues a query, and an ordered list of documents is returned • Assume binary notion of relevance: Rdq is a random dichotomous variable, such that d Rd,q = 1 if document d is relevant w.r.t query q • Rd,q = 0 otherwise • Probabilistic ranking orders documents decreasingly by their estimated probability of relevance w.r.t. query: P(R = l|c/, q) a Assume that the relevance of each document is independent of the relevance of other documents Sojka, MR Group: PV211: Probabilistic Information Retrieval 19 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions ar|d (ii) the odds of the term appearing if the document is irrelevant (ut/(l — ut)) • ct = 0: term has equal odds of appearing in relevant and irrelevant docs • ct positive: higher odds to appear in relevant documents • ct negative: higher odds to appear in irrelevant documents Sojka, MR Group: PV211: Probabilistic Information Retrieval 33 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions erm weig • ct — log ~~ l°g functions as a term weight. 9 Retrieval status value for document d\ RSVj = J2Xt=qt=i ct- 9 So BIM and vector space model are identical on an operational level . .. o .. . except that the term weights are different. • In particular: we can use the same data structures (inverted index etc) for the two models. Sojka, MR Group: PV211: Probabilistic Information Retrieval 34 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions mpu mtisit inty For each term t in a query, estimate ct in the whole collection using a contingency table of counts of documents in the collection, where dft is the number of documents that contain term t\ documents relevant irrelevant Total Term present xt = 1 Term absent xt = 0 s df t — s S-s (A/ - dft) - (S - s) dfŕ A/-dft Total S N-S N ut = ct = K(N,dít,S,s) = log Pt = s/S (dfŕ - s)/(N - S) s/(S - s) (dfŕ-s)/((/V-dft)-(S-s)) Sojka, MR Group: PV211: Probabilistic Information Retrieval 35 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions mg • If any of the counts is a zero, then the term weight is not well-defined. o Maximum likelihood estimates do not work for rare events. • To avoid zeros: add 0.5 to each count (expected likelihood estimation = ELE) • For example, use S — s + 0.5 in formula for S — s Sojka, MR Group: PV211: Probabilistic Information Retrieval 36 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions • Query: Obama health plan • Docl: Obama rejects allegations about his own bad health • Doc2: The plan is to visit Obama o Doc3: Obama raises concerns with US health plan reforms Estimate the probability that the above documents are relevant to the query. Use a contingency table. These are the only three documents in the collection Sojka, MR Group: PV211: Probabilistic Information Retrieval 37 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions g umption • Assuming that relevant documents are a very small percentage of the collection, approximate statistics for irrelevant documents by statistics from the whole collection • Hence, ut (the probability of term occurrence in irrelevant documents for a query) is dit/N and log[(l - ut)/ut] = log[(/V - dft)/dft] « log A//dft • This should look familiar to you ... o The above approximation cannot easily be extended to relevant documents Sojka, MR Group: PV211: Probabilistic Information Retrieval 38 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions in r • Statistics of relevant documents (pt) in relevance feedback can be estimated using maximum likelihood estimation or ELE (add 0.5). • Use the frequency of term occurrence in known relevant documents. • This is the basis of probabilistic approaches to relevance feedback weighting in a feedback loop • The exercise we just did was a probabilistic relevance feedback exercise since we were assuming the availability of relevance judgments. Sojka, MR Group: PV211: Probabilistic Information Retrieval 39 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions • Ad-hoc retrieval: no user-supplied relevance judgments available • In this case: assume that pt is constant over all terms xt in the query and that pt — 0.5 • Each term is equally likely to occur in a relevant document, and so the pt and (1 — pt) factors cancel out in the expression for RSV • Weak estimate, but doesn't disagree violently with expectation that query terms appear in many but not all relevant documents • Combining this method with the earlier approximation for ut, the document ranking is determined simply by which query terms occur in documents scaled by their idf weighting 9 For short documents (titles or abstracts) in one-pass retrieval situations, this estimate can be quite satisfactory Sojka, MR Group: PV211: Probabilistic Information Retrieval 40 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions ry an ummary 9 Among the oldest formal models in IR • Maron & Kuhns, 1960: Since an IR system cannot predict with certainty which document is relevant, we should deal with probabilities • Assumptions for getting reasonable approximations of the needed probabilities (in the BIM): o Boolean representation of documents/queries/relevance o Term independence • Out-of-query terms do not affect retrieval • Document relevance values are independent Sojka, MR Group: PV211: Probabilistic Information Retrieval 42 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions re vector s p • They are not that different. • In either case you build an information retrieval scheme in the exact same way. • For probabilistic IR, at the end, you score queries not by cosine similarity and tf-idf in a vector space, but by a slightly different formula motivated by probability theory. • Next: how to add term frequency and length normalization to the probabilistic model. Sojka, MR Group: PV211: Probabilistic Information Retrieval 43 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions • Okapi BM25 is a probabilistic model that incorporates term frequency (i.e., it's nonbinary) and length normalization. • BIM was originally designed for short catalog records of fairly consistent length, and it works reasonably in these contexts 9 For modern full-text search collections, a model should pay attention to term frequency and document length 9 Best Match 25 (a.k.a BM25 or Okapi) is sensitive to these quantities • BM25 is one of the most widely used and robust retrieval models Sojka, MR Group: PV211: Probabilistic Information Retrieval 44 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions • The simplest score for document d is just idf weighting of the query terms present in the document: N RSVd = £ log df, Sojka, MR Group: PV211: Probabilistic Information Retrieval 45 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions SftK ic weignting • Improve idf term [log N/df] by factoring in term frequency and document length. RSVd = ]T log teq N df t J (ki + l)titd /c1((l-/?) + 6x(/.d//.ave)) + tfrd • tftcy: term frequency in document d • Lfj (Laye)\ length of document d (average document length in the whole collection) • ki\ tuning parameter controlling the document term frequency scaling o b\ tuning parameter controlling the scaling by document length Sojka, MR Group: PV211: Probabilistic Information Retrieval 46 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions • Interpret BM25 weighting formula for ki = 0 • Interpret BM25 weighting formula for ki = 1 and b = 0 • Interpret BM25 weighting formula for ki i—>► oo and b = 0 • Interpret BM25 weighting formula for ki oo and b — 1 Sojka, MR Group: PV211: Probabilistic Information Retrieval 47 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions P' weigr iting Tor ong queri es • For long queries, use similar weighting for query terms __(fr + l)tf td__(*3 + l)tf tq '/ci((l-b) + bx (Ld/Laye))+titd' k3+titq • titq\ term frequency in the query q o /C3: tuning parameter controlling term frequency scaling of the query • No length normalization of queries (because retrieval is being done with respect to a single fixed query) • The above tuning parameters should ideally be set to optimize performance on a development test collection. In the absence of such optimization, experiments have shown reasonable values are to set ki and k$ to a value between 1.2 and 2 and b = 0.75 RSVd = ]T log N ďft\ Sojka, MR Group: PV211: Probabilistic Information Retrieval 48 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions ranking m u u • I want something basic and simple —> use vector space with tf-idf weighting. o I want to use a state-of-the-art ranking model with excellent performance —>► use language models or BM25 with tuned parameters o In between: BM25 or language models with no or just one tuned parameter Sojka, MR Group: PV211: Probabilistic Information Retrieval 49 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions • Probabilistically grounded approach to IR • Probability Ranking Principle • Models: BIM, BM25 • Assumptions these models make Sojka, MR Group: PV211: Probabilistic Information Retrieval 50 / 51 Recap Probabilistic Approach to IR Basic Probability Theory Probability Ranking Principle Appraisal&Extensions • Chapter 11 of MR • Resources at http://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and Fl MU library Sojka, MR Group: PV211: Probabilistic Information Retrieval 51 / 51 Recap Feature selection Language models Language Models for IR Discussion PV211: Introduction to Information Retrieval http://www.f i.muni.cz/~ soj ka/PV211 MR 12: Language Models for IR Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2015-02-18 Sojka, MR Group: PV211: Language Models for IR 1/50 Recap Feature selection Language models Language Models for IR Discussion Q Recap Q Feature selection Q Language models Q Language Models for IR iscussion Sojka, MR Group: PV211: Language Models for IR 2/50 Recap Feature selection Language models Language Models for IR Discussion Qnap = argmax [log P(c) + Yl loS^(r/ 0. Sojka, MR Group: PV211: Language Models for IR 6/50 Recap Feature selection Language models Language Models for IR Discussion P(c| = S^jM o P{q) is the same for all documents, so ignore • P{d) is the prior - often treated as the same for all d o But we can give a higher prior to "high-quality" documents, e.g., those with high PageRank. • P{q\d) is the probability of q given d . • For uniform prior: ranking documents according according to P{q\d) and P{d\q) is equivalent. Sojka, MR Group: PV211: Language Models for IR 28 / 50 Recap Feature selection Language models Language Models for IR Discussion re w 9 In the LM approach to IR, we attempt to model the query generation process. o Then we rank documents by the probability that a query would be observed as a random sample from the respective document model. o That is, we rank according to P(q\d). • Next: how do we compute P(q\d)l Sojka, MR Group: PV211: Language Models for IR 29 / 50 Recap Feature selection Language models Language Models for IR Discussion • We will make the same conditional independence assumption as for Naive Bayes. P(q\Md) = P((t1,...^q\)\Md)= J] P(tk\Md) l d\ Sojka, MR Group: PV211: Language Models for IR 35 / 50 Recap Feature selection Language models Language Models for IR Discussion mpu mg • Collection: d\ and d2 • d\\ Xerox reports a profit but revenue is down • d2'. Lucene narrows quarter loss but revenue decreases furthe • Query q\ revenue down • Use mixture model with A = 1/2 o P(q\d1) = [(1/8 + 2/16)/2] • [(1/8 + l/16)/2] = 1/8 • 3/32 3/256 o P(q\d2) = [(1/8 + 2/16)/2] • [(0/8 + l/16)/2] = 1/8 • 1/32 1/256 • Ranking: d\ > d2 Sojka, MR Group: PV211: Language Models for IR 36 / 50 p(tld) = tft,d+ aP(t\Mc) Ld + a • The background distribution P(t\Mc) is the prior for P(t\d). • Intuition: Before having seen any part of the document we start with the background distribution as our estimate. • As we read the document and count terms we update the background distribution. • The weighting factor a determines how strong an effect the prior has. Sojka, MR Group: PV211: Language Models for IR 37 / 50 Recap Feature selection Language models Language Models for IR Discussion • Dirichlet performs better for keyword queries, Jelinek-Mercer performs better for verbose queries. • Both models are sensitive to the smoothing parameters - you shouldn't use these models without parameter tuning. Sojka, MR Group: PV211: Language Models for IR 38 / 50 Recap Feature selection Language models Language Models for IR Discussion fi is the Dirichlet smoothing parameter (called a on the previous slides) Sojka, MR Group: PV211: Language Models for IR Recap Feature selection Language models Language Models for IR Discussion nguage m Mi re g We have assumed that queries are generated by a probabilistic process that looks like this: (as in Naive Bayes) C= China 4=Beijing) C X2=and X3=Taipei X4=JOIN X5=WTO Sojka, MR Group: PV211: Language Models for IR 41 / 50 Recap Feature selection Language models Language Models for IR Discussion generative m 9 We want to classify document d. We want to classify a query q. o Classes: e.g., geographical regions like China, UK, Kenya. Each document in the collection is a different class. • Assume that d was generated by the generative model. Assume that q was generated by a generative model • Key question: Which of the classes is most likely to have generated the document? Which document (=class) is most likely to have generated the query ql o Or: for which class do we have the most evidence? For which document (as the source of the query) do we have the most evidence? Sojka, MR Group: PV211: Language Models for IR 42 / 50 Recap Feature selection Language models Language Models for IR Discussion uitinomi nguag< eis C=China Xi=Beijing X2=and X3=Taipei X4=join X5=WTO Sojka, MR Group: PV211: Language Models for IR 43 / 50 Recap Feature selection Language models Language Models for IR Discussion Sojka, MR Group: PV211: Language Models for IR 44 / 50 multinomial model / IR language model Bernoulli model / BIM event model random variable(s) doc. representation parameter estimation dec. rule: maximize multiple occurrences length of docs # features estimate for the generation of (multi)set of tokens X = t iff t occurs at given pos d=(t! tnd),tk g V P(X = t\c) Hc)U1 Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There is no need to spend hundreds or even thousands for similar cours I am 22 years old and I have already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW ! Click Below to order: http://www.wholesaledaily.com/sales/nmd.htm How would you write a program that would automatically detect and delete this type of message? Sojka, MR Group: PV211: Text Classification & Naive Bayes 5/52 Text classification Naive Bayes NB theory Evaluation of TC rm ining Given: • A document space X o Documents are represented in this space - typically some type of high-dimensional space. • A fixed set of classes C = {ci, c2,..., cj} o The classes are human-defined for the needs of an application (e.g., spam vs. nonspam). • A training set D of labeled documents. Each labeled document (d,c) e X x C Using a learning method or learning algorithm, we then wish to learn a classifier 7 that maps documents to classes: 7:X^C Sojka, MR Group: PV211: Text Classification & Naive Bayes 6/52 Text classification Naive Bayes NB theory Evaluation of TC Given: a description d e X of a document Determine: 7(d) e C, that is, the class that is most appropriate for d Sojka, MR Group: PV211: Text Classification & Naive Bayes 7/52 Text classification Naive Bayes NB theory Evaluation of TC classes: training set: SUBJECT AREAS UK China poultry^) (^ofFee^) (^hctions^) (^sports congestion London Olympics Beijing feed chicken roasting beans recount votes diamond baseball Parliament Big Ben tourism Ireat Wall pate ducks arabica robusta seat run-off forward soccer Windsor the Queen Mao communist bird flu turkey Kenya harvest TV ads campaign team captain 7(ď) =China ď test set: first private Chinese airline Sojka, MR Group: PV211: Text Classification & Naive Bayes 8/52 Text classification Naive Bayes NB theory Evaluation of TC • Find examples of uses of text classification in information retrieval Sojka, MR Group: PV211: Text Classification & Naive Bayes 9/52 Text classification Naive Bayes NB theory Evaluation of TC o Language identification (classes: English vs. French etc.) o The automatic detection of spam pages (spam vs. nonspam) • Sentiment detection: is a movie or product review positive or negative (positive vs. negative) • Topic-specific or vertical search - restrict search to a "vertical" like "related to health" (relevant to vertical vs. not) Sojka, MR Group: PV211: Text Classification & Naive Bayes 10 / 52 Text classification Naive Bayes NB theory Evaluation of TC • Manual classification was used by Yahoo in the beginning of the web. Also: MathSciNet (MSC), DMOZ - the Open Directory Project, PubMed • Very accurate if job is done by experts. • Consistent when the problem size and team is small. • Scaling manual classification is difficult and expensive. • —> We need automatic methods for classification. Sojka, MR Group: PV211: Text Classification & Naive Bayes 11 / 52 Text classification Naive Bayes NB theory Evaluation of TC • E.g., Google Alerts is rule-based classification. • There are IDE-type development environments for writing very complex rules efficiently, (e.g., Verity) • Often: Boolean combinations (as in Google Alerts). • Accuracy is very high if a rule has been carefully refined over time by a subject expert. • Building and maintaining rule-based classification systems is cumbersome and expensive. Sojka, MR Group: PV211: Text Classification & Naive Bayes 12 / 52 Text classification Naive Bayes NB theory Evaluation of TC mpi comment line top-le\eltop ic topic deinition modifiers ■ subtopi atopic eviden ectopic topicdeinrtion modifier evidencetopic topicdelnition modifier evidencetopic topic delnrtion modifier evidencetopic topic delnition modifier SubtOpC * Beginning of art topic definition art ACCRUE /author = "fsmith" /date = "30-Dec-01" /annotation = "Topic created by fsmith" * 0.70 performing-arts ACCRUE ** 0.50 WORD /wordtext = ballet ** 0.50 STEM /wordtext = dance ** 0.50 WORD /wordtext = opera ** 0.30 WORD /wordtext = symphony * 0.70 visual-arts ACCRUE ** 0.50 WORD /wordtext = painting ** 0.50 WORD /wordtext = sculpture subtopic subtope gjbto p c * 0.70 film ACCRUE ** 0.50 STEM /wordtext = f ilm ** 0.50 motion-picture PHRASE *** 1.00 WORD /wordtext = motion *** l. 00 WORD /wordte>:t - picture ** 0.50 STEM /wardtext = movie * 0 . 50 video ACCRUE ** 0.50 STEM /wordtext = video ** 0.50 STEM /wordtext = vcr * End of art topic Sojka, MR Group: PV211: Text Classification & Naive Bayes 13 / 52 Text classification Naive Bayes NB theory Evaluation of TC • This was our definition of the classification problem - text classification as a learning problem. • (i) Supervised learning of the classification function 7 and (ii) application of 7 to classifying new documents. • We will look at two methods for doing this: Naive Bayes and SVMs • No free lunch: requires hand-classified training data. • But this manual classification can be done by non-experts. Sojka, MR Group: PV211: Text Classification & Naive Bayes 14 / 52 Text classification Naive Bayes NB theory Evaluation of TC • The Naive Bayes classifier is a probabilistic classifier. • We compute the probability of a document d being in a class c as follows: p(c\d) We will get P(China\d) — 0 for any document that contains WTO Sojka, MR Group: PV211: Text Classification & Naive Bayes 22 / 52 Text classification Naive Bayes NB theory Evaluation of TC • Before: Ht\c) = ct • Now: Add one to each count to avoid zeros: Tct + 1 Tct + 1 P(t\c) =_,ct-r±_=_,ct-r±_ J2t'ev{Tct' + 1) (J2t'ev Tct1) + B • B is the number of bins - in this case the number of different words or the size of the vocabulary V = M Sojka, MR Group: PV211: Text Classification & Naive Bayes 23 / 52 Text classification Naive Bayes NB theory Evaluation of TC • Estimate parameters from the training corpus using add-one smoothing o For a new document, for each class, compute sum of (i) log of prior and (ii) logs of conditional probabilities of the terms • Assign the document to the class with the largest score Sojka, MR Group: PV211: Text Classification & Naive Bayes 24 / 52 Text classification Naive Bayes NB theory Evaluation of TC Naive Bayes: Training TrainMultinomialNB(C, D) 1 V <- ExtractVocabulary(D) 2 N <- CountDocs(D) 3 for each c e C 4 do Nc i— CountDocsInClass(D, c) 5 prior[c] News > Science > Article Go to a Section: U.S. International Business Markets Politics Entertainment Technology Sports Oddly Enouc Extreme conditions create rare Antarctic clouds Tue Aug 1. 2006 3:20am ET _Email This Article | Print This Article | Reprints [-] Text [+] SYDNEY (Reuters) - Rare, mother-of-pearl colored clouds caused by extreme weather conditions above Antarctica are a possible indication of global warming, Australian scientists said on Tuesday. Known as nacreous clouds: the spectacular formations showing delicate wisps of colors were photographed in the sky over an Australian Sojka, MR Group: PV211: Text Classification & Naive Bayes 45 / 52 Text classification Naive Bayes NB theory Evaluation of TC • Evaluation must be done on test data that are independent of the training data, i.e., training and test sets are disjoint. • It's easy to get good performance on a test set that was available to the learner during training (e.g., just memorize the test set). • Measures: Precision, recall, F\, classification accuracy • Average measures over multiple training and test sets (splits of the overall data) for best results. Sojka, MR Group: PV211: Text Classification & Naive Bayes 46 / 52 Text classification Naive Bayes NB theory Evaluation of TC n in the class not in the class predicted to be in the class true positives (TP) false positives (FP) predicted to not be in the class false negatives (FN) true negatives (TN) TP, FP, FN, TN are counts of documents. The sum of these four counts is the total number of documents. precisions = TP/{TP + FP) recall:/? = TP/{TP + FN) Sojka, MR Group: PV211: Text Classification & Naive Bayes 47 / 52 Text classification Naive Bayes N B theory Evaluation of TC • Fi allows us to trade off precision against recall _ 1 2PR Fi = 11,11 p i o 2P"h2/? ^ This is the harmonic mean of P and R\ -j= = |(-^ + Sojka, MR Group: PV211: Text Classification & Naive Bayes 48 / 52 Text classification Naive Bayes NB theory Evaluation of TC A Micro vs. Macro Mvera ging: • We now have an evaluation measure (Fi) for one class. o But we also want a single number that measures the aggregate performance over all classes in the collection. • Macroaveraging • Compute Fi for each of the C classes o Average these C numbers • Microaveraging o Compute TP, FP, FN for each of the C classes o Sum these C numbers (e.g., all TP to get aggregate TP) o Compute Fi for aggregate TP, FP, FN Sojka, MR Group: PV211: Text Classification & Naive Bayes 49 / 52 Text classification Naive Bayes NB theory Evaluation of TC NB Rocchio kNN SVM micro-avg-L (90 classes) 80 85 86 89 macro-avg (90 classes) 47 59 60 60 NB Rocchio kNN trees SVM earn 96 93 97 98 98 acq 88 65 92 90 94 money-fx 57 47 78 66 75 grain 79 68 82 85 95 crude 80 70 86 85 89 trade 64 65 77 73 76 interest 65 63 74 67 78 ship 85 49 79 74 86 wheat 70 69 77 93 92 corn 65 48 78 92 90 micro-avg (top 10) 82 65 82 88 92 micro-avg-D (118 classes) 75 62 n/a n/a 87 Evaluation measure: F\ Naive Bayes does pretty well, but some methods beat it consistently (e.g., SVM). Sojka, MR Group: PV211: Text Classification & Naive Bayes 50 / 52 Text classification Naive Bayes NB theory Evaluation of TC • Text classification: definition & relevance to information retrieval • Naive Bayes: simple baseline text classifier o Theory: derivation of Naive Bayes classification rule & analysis • Evaluation of text classification: how do we know it worked / didn't work? Sojka, MR Group: PV211: Text Classification & Naive Bayes 51 / 52 Text classification Naive Bayes NB theory Evaluation of TC • Chapter 13 of MR • Resources at http://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and Fl MU library • Weka: A data mining software package that includes an implementation of Naive Bayes o Reuters-21578 - text classification evaluation set o Vulgarity classifier fail Sojka, MR Group: PV211: Text Classification & Naive Bayes 52 / 52 Intro vector space classification Rocchio kNN Linear classifiers > two classes PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/PV211 MR 14: Vector Space Classification Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2017-04-04 Sojka, MR Group: PV211: Vector Space Classification 1/63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Q Intro vector space classification Q Rocchio Q kNN Q Linear classifiers 0 > two classes Sojka, MR Group: PV211: Vector Space Classification 2/63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • Vector space classification: Basic idea of doing text classification for documents that are represented as vectors • Rocchio classifier: Rocchio relevance feedback idea applied to text classification o k nearest neighbor classification o Linear classifiers • More than two classes Sojka, MR Group: PV211: Vector Space Classification 3/63 Intro vecto r space classificatior i Rocchio kNN Linear classifiers > two classes map for ■ • Naive Bayes is simple and a good baseline. • Use it if you want to get a text classifier up and running in a hurry. • But other classification methods are more accurate. • Perhaps the simplest well performing alternative: kNN o kNN is a vector space classifier. 9 Plan for rest of today Q intro vector space classification 0 very simple vector space classification: Rocchio O kNN Q general properties of classifiers Sojka, MR Group: PV211: Vector Space Classification 5/63 Intro vector space classification Rocchio kNN Linear classifiers > two classes V • Each document is a vector, one component for each term. • Terms are axes. • High dimensionality: 100,000s of dimensions • Normalize vectors (documents) to unit length • How can we do classification in this space? Sojka, MR Group: PV211: Vector Space Classification 6/63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • As before, the training set is a set of documents, each labeled with its class. • In vector space classification, this set corresponds to a labeled set of points or vectors in the vector space. • Premise 1: Documents in the same class form a contiguous region. • Premise 2: Documents from different classes don't overlap. • We define lines, surfaces, hyper-surfaces to divide regions. Sojka, MR Group: PV211: Vector Space Classification 7/63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Should the document * be assigned to China, UK or Kenya? Find separators between the classes Based on these separators: * should be assigned to China How do we find separators that do a good job at classifying new documents like *? - Main topic of today Sojka, MR Group: PV211: Vector Space Classification 8/63 Intro vector space classification Rocchio kNN Linear classifiers > two classes grapns can oe misieaain x2 x3 x4 Left: A projection of the 2D semicircle to ID. For the points xi, X2, X3, X4, X5 at x coordinates -0.9,-0.2,0,0.2,0.9 the distance |x2x3| « 0.201 only differs by 0.5% from IX2X3I = 0.2; but |xiX3|/|x{x3| = c/true/cfprojected ~ 1.06/0.9 « 1.18 is an example of a large distortion (18%) when projecting a large area. Right: The corresponding projection of the 3D hemisphere to 2D. Sojka, MR Group: PV211: Vector Space Classification 9/63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • In relevance feedback, the user marks documents as relevant/non-relevant. • Relevant/non-relevant can be viewed as classes or categories. • For each document, the user decides which of these two classes is correct. o The IR system then uses these class assignments to build a better query ("model") of the information need .. . • .. . and returns better documents. • Relevance feedback is a form of text classification. o The notion of text classification (TC) is very general and has many applications within and beyond information retrieval. Sojka, MR Group: PV211: Vector Space Classification 11 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes o The principal difference between relevance feedback and text classification: • The training set is given as part of the input in text classification, o It is interactively created in relevance feedback. Sojka, MR Group: PV211: Vector Space Classification 12 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • Compute a centroid for each class o The centroid is the average of all documents in the class. • Assign each test document to the class of its closest centroid. Sojka, MR Group: PV211: Vector Space Classification 13 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes deDc where Dc is the set of all documents that belong to class c and v(d) is the vector space representation of d. Sojka, MR Group: PV211: Vector Space Classification 14 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes TrainRocchio(C,D) 1 for each q e C 2 do Dj two classes Sojka, MR Group: PV211: Vector Space Classification 16 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes io properti' • Rocchio forms a simple representation for each class: the centroid o We can interpret the centroid as the prototype of the class. • Classification is based on similarity to / distance from centroid / prototype. 9 Does not guarantee that classifications are consistent with the training data! Sojka, MR Group: PV211: Vector Space Classification 17 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Time complexity of Rocchio mode time complexity training testing e(|D|Lave + |c||\/|)«e(|D|LaVe) e(La + |C|Ma) « 0(|C|Ma) Sojka, MR Group: PV211: Vector Space Classification 18 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • In many cases, Rocchio performs worse than Naive Bayes. • One reason: Rocchio does not handle nonconvex, multimodal classes correctly. Sojka, MR Group: PV211: Vector Space Classification 19 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes nn m u in m Exercise: Why is Rocchio not expected to do well for the classification task a vs. b here? • A is centroid of the a's, B is centroid of the b's. • The point o is closer to A than to B. • But o is a better fit for the b class. • A is a multimodal class with two prototypes. • But in Rocchio we only have one prototype. Sojka, MR Group: PV211: Vector Space Classification 20 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • kNN classification is another vector space classification method. • It also is very simple and easy to implement. a kNN is more accurate (in most cases) than Naive Bayes and Rocchio. • If you need to get a pretty accurate classifier up and running in a short time . .. o .. . and you don't care about efficiency that much . .. • ... use kNN. Sojka, MR Group: PV211: Vector Space Classification 22 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • kNN = k nearest neighbors 9 kNN classification rule for k — 1 (INN): Assign each test document to the class of its nearest neighbor in the training set. • INN is not very robust - one document can be mislabeled or atypical. • kNN classification rule for k > 1 (kNN): Assign each test document to the majority class of its k nearest neighbors in the training set. 9 Rationale of kNN: contiguity hypothesis • We expect a test document d to have the same label as the training documents located in the local region surrounding d. Sojka, MR Group: PV211: Vector Space Classification 23 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes o Probabilistic version of kNN: P(c\d) — fraction of k neighbors of d that are in c o kNN classification rule for probabilistic kNN: Assign d to class c with highest P(c\d) Sojka, MR Group: PV211: Vector Space Classification 24 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Sojka, MR Group: PV211: Vector Space Classification 25 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes g Train-kNN(C,D) 1 W <- Preprocess(D) 2 k i- Select-k(C, W) 3 return D', k Apply-kNN(D', k, d) 1 S k ComputeNearestNeighbors(D/, k, d) 2 for each ej G C(D') 3 do pj two classes O o o o x o x x x x x x x x x How is star classified by: (i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio? Sojka, MR Group: PV211: Vector Space Classification 27 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes E im mpiexiiy kNN with preprocessing of training set training 0(|D|/_ave) testing 0(/_a + |D|MaveMa) = e(|D|MaveMa) • kNN test time proportional to the size of the training set! • The larger the training set, the longer it takes to classify a test document. • kNN is inefficient for very large training sets. • Question: Can we divide up the training set into regions, so that we only have to search in one region to do kNN classification for a given test document? (which perhaps would give us better than linear time complexity) Sojka, MR Group: PV211: Vector Space Classification 28 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • Our intuitions about space are based on the 3D world we live in. • Intuition 1: some things are close by, some things are distant. • Intuition 2: we can carve up space into areas such that: within an area things are close, distances between areas are large. o These two intuitions don't necessarily hold for high dimensions. • In particular: for a set of k uniformly distributed points, let dmin be the smallest distance between any two points and dmax be the largest distance between any two points. • Then dmax — dmin lim -■-= 0 d^oc dmin Sojka, MR Group: PV211: Vector Space Classification 29 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • Simulate dmax — dmin lim -■-= 0 d^oo dmin • Pick a dimensionality d • Generate 10 random points in the c/-dimensional hypercube (uniform distribution) • Compute all 45 distances o Compute dmax-dmin r dmin • We see that intuition 1 (some things are close, others are distant) is not true for high dimensions. Sojka, MR Group: PV211: Vector Space Classification 30 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • Intuition 2: we can carve up space into areas such that: within an area things are close, distances between areas are large. • If this is true, then we have a simple and efficient algorithm for kNN. • To find the k closest neighbors of data point < *i,*2,... ,x do the following. • Using binary search find all data points whose first dimension is in [xi — e, xi + e]. This is 0(log n) where n is the number of data points. • Do this for each dimension, then intersect the d subsets. Sojka, MR Group: PV211: Vector Space Classification 31 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • Size of data set n = 100 • Again, assume uniform distribution in hypercube • Set e = 0.05: we will look in an interval of length 0.1 for neighbors on each dimension. • What is the probability that the nearest neighbor of a new data point x is in this neighborhood in d = 1 dimension? • for d = 1: 1 - (1 - 0.1)100 « 0.99997 • In d = 2 dimensions? • for d = 2: 1 - (1 - 0.12)100 « 0.63 • In d = 3 dimensions? • for d = 3: 1 - (1 - 0.13)100 « 0.095 • In c/ = 4 dimensions? • for c/ = 4: 1 - (1 - 0.14)100 « 0.0095 • In d = 5 dimensions? • for c/ = 5: 1 - (1 - 0.15)100 w 0.0009995 Sojka, MR Group: PV211: Vector Space Classification 32 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • In d = 5 dimensions? o for d = 5: 1 - (1 - 0.15)100 « 0.0009995 • In other words: with enough dimensions, there is only one "local" region that will contain the nearest neighbor with hig certainty: the entire search space. • We cannot carve up high-dimensional space into neat neighborhoods .. . • .. . unless the "true" dimensionality is much lower than d. Sojka, MR Group: PV211: Vector Space Classification 33 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • No training necessary • But linear preprocessing of documents is as expensive as training Naive Bayes. • We always preprocess the training set, so in reality training time of kNN is linear. • kNN is very accurate if training set is large. • Optimality result: asymptotically zero error if Bayes rate is zero. • But kNN can be very inaccurate if training set is small. Sojka, MR Group: PV211: Vector Space Classification 34 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • Definition: • A linear classifier computes a linear combination or weighted sum ^21 WjXj of the feature values, o Classification decision: J2i wixi > ^ 9 . . .where 9 (the threshold) is a parameter. • (First, we only consider binary classifiers.) o Geometrically, this corresponds to a line (2D), a plane (3D) or a hyperplane (higher dimensionalities), the separator. • We find this separator based on training set. o Methods for finding separator: Perceptron, Rocchio, Naive Bayes - as we will explain on the next slides • Assumption: The classes are linearly separable. Sojka, MR Group: PV211: Vector Space Classification 36 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • A linear classifier in ID is a point described by the equation w\d\ — 9 o The point at 9/w\ • Points (d\) with w\d\ > 9 < # >-^ are in the class c. • Points (di) with w\d\ < 9 are in the complement class c. Sojka, MR Group: PV211: Vector Space Classification 37 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • A linear classifier in 2D is a line described by the equation w\d\ + 1/1/2^2 = 9 • Example for a 2D linear classifier • Points (di c/2) with w\d\ + \N2d2 > 6 are in the class c. • Points {d\ c/2) with w\d\ + \N2d2 < 0 are in the complement class c. Sojka, MR Group: PV211: Vector Space Classification 38 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • A linear classifier in 3D is a plane described by the equation w\d\ + W2Ö2 + w^ds = 9 o Example for a 3D linear classifier • Points (di d2 ds) with widi + \N2_d2 + 1/1/3C/3 > 9 are in the class c. • Points (di d2 ds) with are in the complement class c. Sojka, MR Group: PV211: Vector Space Classification 39 / 63 Intro vector space classificatior Rocchio kNN Linear classifiers > two classes in • Rocchio is a linear classifier defined by: M Wjdj = wd = 9 i=i where w is the normal vector /2(ci) — /2(c2) and ^ = 0.5*(|/i(ci)|2-|/I(c2)|2). Sojka, MR Group: PV211: Vector Space Classification 40 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Multinomial Naive Bayes is a linear classifier (in log space) defined by: M widi = 9 i=i where Wj = log[P(t,-|c)/P(t/|c)]f d\ — number of occurrences of t\ in d, and 6 = - log[P(c)/P(c)j. Here, the index /, 1 < / < M, refers to terms of the vocabulary (not to positions in d as k did in our original definition of Naive Bayes) Sojka, MR Group: PV211: Vector Space Classification 41 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • Classification decision based on majority of k nearest neighbors. • The decision boundaries between classes are piecewise linear . .. o ... but they are in general not linear classifiers that can be described as Y!f=1 Midi = 9. Sojka, MR Group: PV211: Vector Space Classification 42 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes ti i/i// du d2i ti i/i/,- du d2i prime 0.70 0 1 dlrs -0.71 1 1 rate 0.67 1 0 world -0.35 1 0 interest 0.63 0 0 sees -0.33 0 0 rates 0.60 0 0 year -0.25 0 0 discount 0.46 1 0 group -0.24 0 0 bundesbank 0.43 0 0 dlr -0.24 0 0 • This is for the class interest in Reuters-21578. • For simplicity: assume a simple 0/1 vector representation • d\. "rate discount dlrs world" • d2\ "prime dlrs" • (9 = 0 • Exercise: Which class is d\ assigned to? Which class is d2 assigned to? • We assign document d\ "rate discount dlrs world" to interest since wTdx = 0.67 • 1 + 0.46 • 1 + (-0.71) • 1 + (-0.35) • 1 = 0.07 > 0 = 0. 9 We assign d2 "prime dlrs" to the complement class (not in interest) since wTd2 -0.01 < 0. Sojka, MR Group: PV211: Vector Space Classification 43 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes n Sojka, MR Group: PV211: Vector Space Classification 44 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes rnmg aig< r v • In terms of actual computation, there are two types of learning algorithms. • (i) Simple learning algorithms that estimate the parameters of the classifier directly from the training data, often in one linear pass. • Naive Bayes, Rocchio, kNN are all examples of this. • (ii) Iterative algorithms • Support vector machines • Perceptron (example available as PDF on website: http://cislmu.org) o The best performing learning algorithms usually require iterative learning. Sojka, MR Group: PV211: Vector Space Classification 45 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes srcepiron up< ru • Randomly initialize linear separator w • Do until convergence: o Pick data point x • If s\gn(wTx) is correct class (1 or -1): do nothing • Otherwise: w = w — s\gn(wTx)x Sojka, MR Group: PV211: Vector Space Classification 46 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Sojka, MR Group: PV211: Vector Space Classification 47 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Sojka, MR Group: PV211: Vector Space Classification 48 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Sojka, MR Group: PV211: Vector Space Classification 49 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Sojka, MR Group: PV211: Vector Space Classification 50 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes n Sojka, MR Group: PV211: Vector Space Classification 51 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes n • For linearly separable training sets: there are infinitely many separating hyperplanes. • They all separate the training set perfectly ... o .. . but they behave differently on test data. • Error rates on new data are low for some, high for others. • How do we find a low-error separator? • Perceptron: generally bad; Naive Bayes, Rocchio: ok; linear SVM: good Sojka, MR Group: PV211: Vector Space Classification 52 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • Many common text classifiers are linear classifiers: Naive Bayes, Rocchio, logistic regression, linear support vector machines etc. • Each method has a different way of selecting the separating hyperplane o Huge differences in performance on test documents • Can we get better performance with more powerful nonlinear classifiers? • Not in general: A given amount of training data may suffice for estimating a linear boundary, but not for estimating a more complex nonlinear boundary. Sojka, MR Group: PV211: Vector Space Classification 53 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes A nonlinear pre • Linear classifier like Rocchio does badly on this task. • kNN will do well (assuming enough training data) Sojka, MR Group: PV211: Vector Space Classification 54 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes r a given u m • Is there a learning method that is optimal for all text classification problems? • No, because there is a trade-off between bias and variance. • Factors to take into account: o How much training data is available? o How simple/complex is the problem? (linear vs. nonlinear decision boundary) o How noisy is the problem? • How stable is the problem over time? o For an unstable problem, it's better to use a simple and robust classifier. Sojka, MR Group: PV211: Vector Space Classification 55 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes mDin yperpian Sojka, MR Group: PV211: Vector Space Classification 57 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • One-of or multiclass classification o Classes are mutually exclusive, o Each document belongs to exactly one class, o Example: language of a document (assumption: no document contains multiple languages) Sojka, MR Group: PV211: Vector Space Classification 58 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes mcation wi • Combine two-class linear classifiers as follows for one-of classification: Run each classifier separately o Rank classifiers (e.g., according to score) o Pick the class with the highest score Sojka, MR Group: PV211: Vector Space Classification 59 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes y- 9 Any-of or multilabel classification o A document can be a member of 0, 1, or many classes, o A decision on one class leaves decisions open on all other classes. • A type of "independence" (but not statistical independence) o Example: topic classification o Usually: make decisions on the region, on the subject area, on the industry and so on "independently" Sojka, MR Group: PV211: Vector Space Classification 60 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • Combine two-class linear classifiers as follows for any-of classification: o Simply run each two-class classifier separately on the test document and assign document accordingly Sojka, MR Group: PV211: Vector Space Classification 61 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • Vector space classification: Basic idea of doing text classification for documents that are represented as vectors • Rocchio classifier: Rocchio relevance feedback idea applied to text classification o k nearest neighbor classification o Linear classifiers • More than two classes Sojka, MR Group: PV211: Vector Space Classification 62 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes • Chapter 13 of MR (feature selection) • Chapter 14 of MR • Resources at http://cislmu.org o Perceptron example • General overview of text classification: Sebastiani (2002) • Text classification chapter on decision tress and perceptrons: Manning & Schutze (1999) • One of the best machine learning textbooks: Hastie, Tibshirani & Friedman (2003) Sojka, MR Group: PV211: Vector Space Classification 63 / 63 Zone scoring Machine-learned scoring Ranking SVMs PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/PV211 MR 15-2: Learning to rank Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2017-03-17 Sojka, MR Group: PV211: Learning to rank 1/46 Zone scoring Machine-learned scoring Ranking SVMs Q Zone scoring Q Mach ine-learned scoring Q Ranking SVMs Sojka, MR Group: PV211: Learning to rank 2/46 Zone scoring Machine-learned scoring Ranking SVMs o Basic idea of learning to rank (LTR): We use machine learning to learn the relevance score (retrieval status value) of a document with respect to a query. • Zone scoring: a particularly simple instance of LTR • Machine-learned scoring as a general approach to ranking • Ranking SVMs Sojka, MR Group: PV211: Learning to rank 3/46 Zone scoring Machine-learned scoring Ranking SVMs o The aim of term weights (e.g., tf-idf) is to measure term salience. o The sum of term weights is a measure of the relevance of a document to a query and the basis for ranking. • Now we view this ranking problem as a machine learning problem - we will learn the weighting or, more generally, the ranking. • Term weights can be learned using training examples that have been judged. o This methodology falls under a general class of approaches known as machine learned relevance or learning to rank. Sojka, MR Group: PV211: Learning to rank 5/46 Zone scoring Machine-learned scoring Ranking SVMs rning weigf Main methodology • Given a set of training examples, each of which is a tuple of: a query q, a document d, a relevance judgment for d on q o Simplest case: R(d, q) is either relevant (1) or non-relevant (0) • More sophisticated cases: graded relevance judgments • Learn weights from these examples, so that the learned scores approximate the relevance judgments in the training examples Sojka, MR Group: PV211: Learning to rank 6/46 Zone scoring Machine-learned scoring Ranking SVMs ry in Is what BIM does a form of learning to rank? Recap BIM: • Estimate classifier of probability of relevance on training • Apply to all documents o Rank documents according to probability of relevance Sojka, MR Group: PV211: Learning to rank 7/46 Zone scoring Machine-learned scoring Ranking SVMs • Both are machine learning approaches o Text classification, BIM and relevance feedback (if solved by text classification) are query-specific. 9 We need a query-specific training set to learn the ranker. We need to learn a new ranker for each query. • Learning to rank usually refers to query-independent ranking. • We learn a single classifier. • We can then rank documents for a query that we don't have any relevance judgments for. Sojka, MR Group: PV211: Learning to rank 8/46 Zone scoring Machine-learned scoring Ranking SVMs • One approach to learning to rank is to represent each query-document pair as a data point, represented as a vector. • We have two classes. o Class 1: the query is relevant to the document, o Class 2: the query is not relevant to the document. • This is a standard classification problem, except that the data points are query-document pairs (as opposed to documents). o Documents are ranked according to probability of relevance of corresponding document-query pairs. • What features/dimensions would you use to represent a query-document pair? Sojka, MR Group: PV211: Learning to rank 9/46 Zone scoring Machine-learned scoring Ranking SVMs • Given: a collection where documents have three zones (a.k.a. fields): author, title, body • Weighted zone scoring requires a separate weight for each zone, e.g. glt g2, g3 • Not all zones are equally important: e.g. author < title < body gi — 0.2, g2 = 0.3, g3 = 0.5 (so that they add up to 1) • Score for a zone = 1 if the query term occurs in that zone, 0 otherwise (Boolean) Example Query term appears in title and body only Document score: (0.3 • 1) + (0.5 • 1) = 0.8. Sojka, MR Group: PV211: Learning to rank 10 / 46 Zone scoring Machine-learned scoring Ranking SVMs rm rin Given query q and document d, weighted zone scoring assigns to the pair (q, d) a score in the interval [0,1] by computing a linear combination of document zone scores, where each zone contributes a value. o Consider a set of documents, which have / zones • Let gi,... ,gi e [0,1], such that E;=ig7 = 1 • For 1 < / < /, let si be the Boolean score denoting a match (or non-match) between q and the ith zone o Si = 1 if a query term occurs in zone /, 0 otherwise Weighted zone score a.k.a ranked Boolean retrieva Rank documents according to Y^i=iSisi Sojka, MR Group: PV211: Learning to rank 11 / 46 Zone scoring Machine-learned scoring Ranking SVMs rnmg weig • Weighted zone scoring may be viewed as learning a linear function of the Boolean match scores contributed by the various zones. • No free lunch: labor-intensive assembly of user-generated relevance judgments from which to learn the weights o Especially in a dynamic collection (such as the Web) • Major search engine put considerable resources into creating large training sets for learning to rank. • Good news: once you have a large enough training set, the problem of learning the weights g\ reduces to a simple optimization problem. Sojka, MR Group: PV211: Learning to rank 12 / 46 Zone scoring Machine-learned scoring Ranking SVMs rnmg weigr Z ring: oimp • Let documents have two zones: title, body • The weighted zone scoring formula we saw before: / 1=1 • Given q, d, sj(d, q) — 1 if a query term occurs in title, 0 otherwise; ss(d, q) — 1 if a query term occurs in body, 0 otherwise • We compute a score between 0 and 1 for each (d, q) pair using Sj(d, q) and Ss{d, q) by using a constant g e [0,1]: score(d, q) = g • sT(c/, q) + (1 - g") • s8(c7, q) Sojka, MR Group: PV211: Learning to rank 13 / 46 Zone scoring Machine-learned scoring Ranking SVMs rnmg weigr Brmine g Trom training examp Example dj SB 37 linux 1 1 Relevant 37 penguin 0 1 Nonrelevant 03 238 system 0 1 Relevant 238 penguin 0 0 Nonrelevant *5 1741 kernel 1 1 Relevant *6 2094 driver 0 1 Relevant 07 3194 driver 1 0 Nonrelevant • Training examples: triples of the form j = [dj, qj, r(dj, qj)) • A given training document dj and a given training query qj are assessed by a human who decides r(dj,qj) (either relevant or nonrelevant) Sojka, MR Group: PV211: Learning to rank 14 / 46 Zone scoring Machine-learned scoring Ranking SVMs I earning weignts: determine g Trom training examp Example Example DocID Query 57" SB Judgment «>i 37 linux 1 1 Relevant 37 penguin 0 1 Nonrelevant 03 238 system 0 1 Relevant 04 238 penguin 0 0 Nonrelevant «>5 1741 kernel 1 1 Relevant <>6 2094 driver 0 1 Relevant 07 3194 driver 1 0 Nonrelevant o For each training example j we have Boolean values sr(dji qj) and Ss(dj, qj) that we use to compute a score score(dj, qj) = g • sT(dj, qj) + (1 - g) • sB(dj, qj) Sojka, MR Group: PV211: Learning to rank 15 / 46 Zone scoring Machine-learned scoring Ranking SVMs rning weigf • We compare this score score(dj, qj) with the human relevance judgment for the same document-query pair (c//, qj). • We define the error of the scoring function with weight g as = Wb 2 37 penguin 0 1 (J)3 238 system 0 1 4 238 penguin 0 0 OS 1741 kernel 1 1 <>6 2094 driver 0 1 4>7 3194 driver 1 0 • Compute score: score(dj, qj) = # • sT(dj, qj) + (1 - g) • sß(cf), ty) • Compute total error: J2je(gi fy')» where fy) = (K^ ty) - score(dj, qj))2 • Pick the value of g that minimizes the total error Sojka, MR Group: PV211: Learning to rank 17 / 46 Zone scoring Machine-learned scoring Ranking SVMs inimizing error e: txampie Compute score score(dj, qj) score (d = g ■ 1 + (1 -g) • 1 = g+l -g = 1 score(d2. 92) = g ■ 0 + (1 -g) • 1 = 0 + 1 -g = 1 - g score(d$, 93) = g' 0 + (1 -g) • 1 = 0 + 1 -g = 1 - g scored, q4) = g ' 0 + (1 -g) •0 = 0 + 0 = 0 score (č/5, 95) = g- 1 + (1 -g) • 1 = g + l -g = 1 score(d§, 96 ) = g ' 0 + (1 -g) ■ 1 = 0 + 1 -g = 1 - g score(dj. 97) = g 1 + (1 -g) •0 = g + o = g Compute total error J2je{s^j) (l-l)2 + (0-l+^)2 + (l-l+^)2 + (0-0)2 + (l-l)2 + (l-l + s)2+(0-gf = 0+(-l+g)2+g2+0+0+g2+g2 = l-2g+4g2 Pick the value of g that minimizes the total error Setting derivative to 0, gives you a minimum of g = |. Sojka, MR Group: PV211: Learning to rank 18 / 46 Zone scoring Machine-learned scoring Ranking SVMs • nmmm are the counts of rows of the training set table with the corresponding properties: niOr = 1 sb = 0 document relevant ^lOn st — 1 sb — 0 document nonrelevant noir = 0 sb = 1 document relevant ^oin st — 0 se = 1 document nonrelevant • Derivation: see book • Note that we ignore documents that have 0 match scores for both zones or 1 match scores for both zones - the value of g does not change their final score. Sojka, MR Group: PV211: Learning to rank 19 / 46 Zone scoring Machine-learned scoring Ranking SVMs mpute g minimiz rr DocID Query st sb Judgment 1 37 linux 0 0 Relevant 2 37 penguin 1 1 Nonrelevant 03 238 system 1 0 Relevant <í>4 238 penguin 1 1 Nonrelevant *5 238 redmond 0 1 Nonrelevant $6 1741 kernel 0 0 Relevant 4>7 2094 driver 1 0 Relevant 4>8 3194 driver 0 1 Nonrelevant 4>9 3194 redmond 0 0 Nonrelevant Sojka, MR Group: PV211: Learning to rank 20 / 46 • So far, we have considered a case where we combined match scores (Boolean indicators of relevance). • Now consider more general factors that go beyond Boolean functions of query term presence in document zones. Sojka, MR Group: PV211: Learning to rank 22 / 46 Zone scoring Machine-learned scoring Ranking SVMs wo exam p • The vector space cosine similarity between query and document (denoted a) • The minimum window width within which the query terms lie (denoted uj) • Query term proximity is often indicative of topical relevance. o Thus, we have one feature that captures overall query-document similarity and one features that captures proximity of query terms in the document. Sojka, MR Group: PV211: Learning to rank 23 / 46 Example Example DocID Query a oj Judgment 37 linux 0.032 3 relevant 02 37 penguin 0.02 4 non relevant 03 238 operating system 0.043 2 relevant 04 238 runtime 0.004 2 nonrelevant 05 1741 kernel layer 0.022 3 relevant 06 2094 device driver 0.03 2 relevant 07 3191 device driver 0.027 5 nonrelevant a is the cosine score, uj is the window width. This is exactly the same setup as for zone scoring except we now have more complex features that capture whether a document is relevant to a query. Sojka, MR Group: PV211: Learning to rank 24 / 46 Zone scoring Machine-learned scoring Ranking SVMs 2 3 4 5 Term proximity w This should look familiar. Sojka, MR Group: PV211: Learning to rank 25 / 46 Zone scoring Machine-learned scoring Ranking SVMs ppr iTier in A linear classifier in 2D is a line described by the equation w\d\ + w2d2 = 6 Example for a 2D linear classifier Points (d\ d2) with w\d\ + w2d2 > 6 are in the class c. Points (d\ d2) with w\d\ + w2d2 < 9 are in the complement class c. Sojka, MR Group: PV211: Learning to rank 26 / 46 Zone scoring Machine-learned scoring Ranking SVMs up • Again, two classes: relevant = 1 and nonrelevant = 0 • We now seek a scoring function that combines the values of the features to generate a value that is (close to) 0 or 1. • We wish this function to be in agreement with our set of training examples as much as possible. • A linear classifier is defined by an equation of the form: Score(d, q) — Score{a,u) — aa + bu + c, where we learn the coefficients a, b, c from training data • Regression vs. classification o We have only covered binary classification so far. • We can also cast the problem as a regression problem. • This is what we did for zone scoring just now. Sojka, MR Group: PV211: Learning to rank 27 / 46 Zone scoring Machine-learned scoring Ranking SVMs ric mterpr ppenmg The function Score(a,uo) represents a plane "hanging above" the figure. Ideally this plane assumes values close to 1 above the points marked R, and values close to 0 above the points marked N. 0.05 Si o o 9, we declare the document relevant, otherwise we declare it nonrelevant. • As before, all points that satisfy Score(a, uj) — 9 form a line (dashed here) —± linear classifier that separates relevant from nonrelevant instances. 0.05 Si o o (dj, dj, q) the class y-,jq = +1; otherwise —1. • This gives us a training set of pairs of vectors and "precedence indicators". Each of the vectors is computed as the difference of two document-query vectors. • We can then train an SVM on this training set with the goal of obtaining a classifier that returns wT(di, dj, q) > 0 iff di -< dj Sojka, MR Group: PV211: Learning to rank 38 / 46 Zone scoring Machine-learned scoring Ranking SVMs vamag • Documents can be evaluated relative to other candidate documents for the same query, rather than having to be mapped to a global scale of goodness. • This often is an easier problem to solve since just a ranking is required rather than an absolute measure of relevance. • Especially germane in web search, where the ranking at the very top of the results list is exceedingly important. Sojka, MR Group: PV211: Learning to rank 39 / 46 Zone scoring Machine-learned scoring Ranking SVMs y simple rariKing • Ranking SVMs treat all ranking violations alike. • But some violations are minor problems, e.g., getting the order of two relevant documents wrong. • Other violations are big problems, e.g., ranking a nonrelevant document ahead of a relevant document. • Some queries have many relevant documents, others few. • Depending on the training regime, too much emphasis may be put on queries with many relevant documents. • In most IR settings, getting the order of the top documents right is key. • In the simple setting we have described, top and bottom ranks will not be treated differently. o —>► Learning-to-rank frameworks actually used in IR are more complicated than what we have presented here. Sojka, MR Group: PV211: Learning to rank 40 / 46 Zone scoring Machine-learned scoring Ranking SVMs »rm SVM algorithm that directly optimizes MAP (as opposed to ranking). Proposed by: Yue, Finley, Radlinski, Joachims, ACM SIGIR 2002. Performance compared to state-of-the-art models: cosine, tf-idf, BM25, language models (Dirichlet and Jelinek-Mercer) Model TREC 9 MAP W/L TREC 10 MAP W/L SVM^ Best Func. 2nd Best 3rd Best 0.242 0.204 39/11 ** 0.199 38/12 ** 0.188 34/16 ** 0.236 0.181 37/13 ** 0.174 43/7 ** 0.174 38/12 ** Learning-to-rank clearly better than non-machine-learning approaches Sojka, MR Group: PV211: Learning to rank 41 / 46 Zone scoring Machine-learned scoring Ranking SVMs ing/repr • Both of the methods that we've seen treat the features as given and do not attempt to modify the basic representation of the document-query pairs. • Much of traditional IR weighting involves nonlinear scaling of basic measurements (such as log-weighting of term frequency, or idf). • At the present time, machine learning is very good at producing optimal weights for features in a linear combination, but it is not good at coming up with good nonlinear scalings of basic measurements. • This area remains the domain of human feature engineering. Sojka, MR Group: PV211: Learning to rank 42 / 46 Zone scoring Machine-learned scoring Ranking SVMs • The idea of learning to rank is old. o Early work by Norbert Fuhr and William S. Cooper • But it is only very recently that sufficient machine learning knowledge, training document collections, and computational power have come together to make this method practical and exciting. • While skilled humans can do a very good job at defining ranking functions by hand, hand tuning is difficult, and it has to be done again for each new document collection and class of users. o The more features are used in ranking, the more difficult it is to manually integrate them into one ranking function. • Web search engines use a large number of features —>► web search engines need some form of learning to rank. Sojka, MR Group: PV211: Learning to rank 43 / 46 Zone scoring Machine-learned scoring Ranking SVMs Write down the training set from the last exercise as a training set for a ranking SVM. Sojka, MR Group: PV211: Learning to rank 44 / 46 Zone scoring Machine-learned scoring Ranking SVMs o Basic idea of learning to rank (LTR): We use machine learning to learn the relevance score (retrieval status value) of a document with respect to a query. • Zone scoring: a particularly simple instance of LTR • Machine-learned scoring as a general approach to ranking • Ranking SVMs Sojka, MR Group: PV211: Learning to rank 45 / 46 Zone scoring Machine-learned scoring Ranking SVMs • Chapter 15-2 of MR • Resources at http://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and Fl MU library o References to ranking SVM results • Microsoft learning to rank datasets Sojka, MR Group: PV211: Learning to rank 46 / 46 SVM intro SVM details Classification in the real world PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/PV211 MR 15-1: Support Vector Machines Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2017-03-17 Sojka, MR Group: PV211: Support Vector Machines 1/38 SVM intro SVM details Classification in the real world Q SVM intro O SVM details Q Classification in the real world Sojka, MR Group: PV211: Support Vector Machines 2/38 SVM intro SVM details Classification in the real world • Support vector machines: State-of-the-art text classification methods (linear and nonlinear) o Introduction to SVMs • Formalization • Soft margin case for nonseparable problems • Discussion: Which classifier should I use for my problem? Sojka, MR Group: PV211: Support Vector Machines 3/38 SVM intro SVM details Classification in the real world uppori vector m • Machine-learning research in the last two decades has improved classifier effectiveness. 9 New generation of state-of-the-art classifiers: support vector machines (SVMs), boosted decision trees, regularized logistic regression, maximum entropy, neural networks, and random forests o As we saw in MR: Applications to IR problems, particularly text classification Sojka, MR Group: PV211: Support Vector Machines 5/38 SVM intro SVM details Classification in the real world upp • Vector space classification (similar to Rocchio, kNN, linear classifiers) • Difference from previous methods: large margin classifier • We aim to find a separating hyperplane (decision boundary) that is maximally far from any point in the training data • In case of non-linear-separability: We may have to discount some points as outliers or noise. Sojka, MR Group: PV211: Support Vector Machines 6/38 SVM intro SVM details Classification in the real world • binary classification problem • Decision boundary is linear separator. 9 criterion: being maximally far away from any data point —>► determines classifier margin • Vectors on margin lines are called support vectors o Set of support vectors are a complete specification of classifier Maximum margin decision hyperplane Support vectors \ \ Margin is maximized Sojka, MR Group: PV211: Support Vector Machines 7/38 SVM intro SVM details Classification in the real world y maximiz Points near the decision surface are uncertain classification decisions. A classifier with a large margin makes no low certainty classification decisions (on the training set). Gives classification safety margin with respect to errors and random variation Maximum margin decision hyperplane Ns^k Support vectors \ Margin is maximized Sojka, MR Group: PV211: Support Vector Machines 8/38 SVM intro SVM details Classification in the real world y maximiz o • SVM classification = large margin around decision boundary • We can think of the margin as a "fat separator" - a fatter version of our regular decision hyperplane. • unique solution • decreased memory capacity • increased ability to correctly generalize to test data Sojka, MR Group: PV211: Support Vector Machines 9/38 SVM intro SVM details Classification in the real world parating nyperpian Hyperplane An n-dimensional generalization of a plane (point in 1-D space, line in 2-D space, ordinary plane in 3-D space). Decision hyperplane Can be defined by: o intercept term b (we were calling this 9 before) o normal vector w (weight vector) which is perpendicular to the hyperplane All points x on the hyperplane satisfy: wJx + b = 0 Sojka, MR Group: PV211: Support Vector Machines 10 / 38 SVM intro SVM details Classification in the real world Draw the maximum margin separator. Which vectors are the support vectors? Coordinates of dots: (3,3), (-1,1). Coordinates of triangle: (3,0) Sojka, MR Group: PV211: Support Vector Machines 11 / 38 Training set Consider a binary classification problem: o ic; are the input vectors • y; are the labels For SVMs, the two classes are y; = +1 and y; = —1. The linear c assifier is then: f{x) = = sign(ivTx + b) A value of - -1 indicates one class, and a value of +1 the other class. Sojka, MR Group: PV211: Support Vector Machines 13 / 38 SVM intro SVM details Classification in the real world -unc ciona i ma rgin OT a SVM makes its decision based on the score wTx + b. Clearly, the larger |ivTx + b\ is, the more confidence we can have that the decision is correct. Functional margin o The functional margin of the vector x\ w.r.t the hyperplane (w, b) is: yi(wTXj + b) o The functional margin of a data set w.r.t a decision surface is twice the functional margin of any of the points in the data set with minimal functional margin o Factor 2 comes from measuring across the whole width of the margin. Problem: We can increase functional margin by scaling w and b. —> We need to place some constraint on the size of w. Sojka, MR Group: PV211: Support Vector Machines 14 / 38 SVM intro SVM details Classification in the real world metric margii Geometric margin of the classifier: maximum width of the band that can be drawn separating the support vectors of the two classes. To compute the geometric margin, we need to compute the distance of a vector x from the hyperplane: w x + b (why? we will see that this is so graphically in a few moments) Distance is of course invariant to scaling: if we replace w by 5w and b by 5b, then the distance is the same because it is normalized by the length of w. Sojka, MR Group: PV211: Support Vector Machines 15 / 38 SVM intro SVM details Classification in the real world Assume canonical "functional margin" distance Assume that every data point has at least distance 1 from the hyperplane, then: y/(tfTx/ + b)>l Since each example's distance from the hyperplane is H — y/(wTx; + b)/|iv|, the margin is p — 2/\w\. We want to maximize this margin. That is, we want to find w and b such that: • For all (x/,y/) e D, yi(wTx; + b) > 1 • p = 2/\w\ is maximized Sojka, MR Group: PV211: Support Vector Machines 16 / 38 SVM intro SVM details Classification in the real world in maximum margin decision hyperplane support vectors in red A margin is maximized ^0.5x + 0.5y-2 = 1 wT x + b = 1 0.5x + 0.5y -2 = 0 wTx + b = 0 support vector x Vq.5x + 0.5y-2 = -Í w x + b = —1 Sojka, MR Group: PV211: Support Vector Machines 17 / 38 SVM intro SVM details Classification in the real world wTw + b = 0 b = —wTw b w w Distance of support vector from separator = (length of projection of x onto w) minus (length of w') w w wTx w + w wTx + b w Sojka, MR Group: PV211: Support Vector Machines 18 / 38 SVM intro SVM details Classification in the real world Distance of support vector from separator = (length of projection of x = (1 5)T onto w) minus (length of w') w^x wT wf w w (1 • 2 + 5 • 2)/{l/V2) - (0.5 • 2 + 0.5 • 2)/(l/V2) 3/(1/V2) - 2/(1/V2) wTx b ■ + w w 3/(1/72) + (-2)/(l/V2) 3-2 1/V2 V2 Sojka, MR Group: PV211: Support Vector Machines 19 / 38 SVM intro SVM details Classification in the real world Maximizing 2/ w is the same as minimizing w This gives the final standard formulation of an SVM as a minimization problem: Example Find w and b such that: • \wJw is minimized (because \ w » for all {(x/,y/)}, y;(wTx; + b) > 1 = V lži/1 iži/), and We are now optimizing a quadratic function subject to linear constraints. Quadratic optimization problems are standard mathematical optimization problems, and many algorithms exist for solving them (e.g. Quadratic Programming libraries). Sojka, MR Group: PV211: Support Vector Machines 20 / 38 • We start with a training set. o The data set defines the maximum-margin separating hyperplane (if it is separable). • We use quadratic optimization to find this plane. • Given a new point x to classify, the classification function f(x) computes the functional margin of the point (= normalized distance). • The sign of this function determines the class to assign to the point. • If the point is within the margin of the classifier, the classifier can return "don't know" rather than one of the two classes. o The value of f(x) may also be transformed into a probability of classification Sojka, MR Group: PV211: Support Vector Machines 21 / 38 SVM intro SVM details Classification in the real world 3 T a 2 -- 1 - • 0 -I-1-•-1 0 12 3 Which vectors are the support vectors? Draw the maximum margin separator. What values of w\, 1/1/2 and b (for w\x + w2y + b = 0) describe this separator? Recall that we must have wix + W2y + /)€{!, —1} for the support vectors. Sojka, MR Group: PV211: Support Vector Machines 22 / 38 SVM intro SVM details Classification in the real world rougi m p Working geometrically: • The maximum margin weight vector will be parallel to the shortest line connecting points of the two classes, that is, the line between (1,1) and (2, 3), giving a weight vector of (1, 2) • The optimal decision surface is orthogonal to that line and intersects it at the halfway point. Therefore, it passes through (1.5,2). • The SVM decision boundary is: 2 4 11 0 = x+2y-(M.5+2-2) ^ 0 = -x+-y- — 5 5 5 Sojka, MR Group: PV211: Support Vector Machines 23 / 38 SVM intro SVM detai s Classification in the real world rougr i examp Working algebraically: • With the constraint sign(y/(ivTx/ + b)) > 1, we seek to minimize w 9 We know that the solution is w = (a, 2a) for some a. So: a + 2a + b = -1, 2a + 6a + b = 1 • Hence, a = 2/5 and b = —11/5. So the optimal hyperplane is given by w = (2/5,4/5) and b = -11/5. • The margin p is 2/\w\ = 2/^4/25 + 16/25 = 2/(2^5/5) = V5 = A/(1-2)2+ (l-3)2. 3 -r 2 -- 1 -- 0 0 A + 2 Sojka, MR Group: PV211: Support Vector Machines 24 / 38 What happens if data is not linearly separable? • Standard approach: allow the fat decision margin to make a few mistakes o some points, outliers, noisy examples are inside or on the wrong side of the margin • Pay cost for each misclassified example, depending on how far it is from meeting the margin requirement Slack variable A non-zero value for allows x\ to not meet the margin requirement at a cost proportional to the value of Optimization problem: trading off how fat it can make the margin vs. how many points have to be moved around to allow this margin. The sum of the gives an upper bound on the number of training errors. Soft-margin SVMs minimize training error traded off against margin. Sojka, MR Group: PV211: Support Vector Machines 25 / 38 SVM intro SVM details Classification in the real world r on • Recall how to use binary linear classifiers {k classes) for one-of: train and run k classifiers and then select the class with the highest confidence • Another strategy used with SVMs: build k{k — l)/2 one-versus-one classifiers, and choose the class that is selected by the most classifiers. While this involves building a very large number of classifiers, the time for training classifiers may actually decrease, since the training data set for each classifier is much smaller. • Yet another possibility: structured prediction. Generalization of classification where the classes are not just a set of independent, categorical labels, but may be arbitrary structured objects with relationships defined between them Sojka, MR Group: PV211: Support Vector Machines 26 / 38 SVM intro SVM details Classification in the real world • Many commercial applications • There are many applications of text classification for corporate Intranets, government departments, and Internet publishers. • Often greater performance gains from exploiting domain-specific text features than from changing from one machine learning method to another. • Understanding the data is one of the keys to successful categorization, yet this is an area in which many categorization tool vendors are weak. Sojka, MR Group: PV211: Support Vector Machines 28 / 38 SVM intro SVM details Classification in the real world When building a text classifier, first question: how much training data is there currently available? Practical challenge: creating or obtaining enough training data Hundreds or thousands of examples from each class are required to produce a high performance classifier and many real world contexts involve large sets of categories. • None? • Very little? • Quite a lot? • A huge amount, growing every day? Sojka, MR Group: PV211: Support Vector Machines 29 / 38 SVM intro SVM details Classification in the real world Use hand-written rules! Example IF (wheat OR grain) AND NOT (whole OR bread) THEN c = grain In practice, rules get a lot bigger than this, and can be phrased using more sophisticated query languages than just Boolean expressions, including the use of numeric scores. With careful crafting, the accuracy of such rules can become very high (high 90% precision, high 80% recall). Nevertheless the amount of work to create such well-tuned rules is very large. A reasonable estimate is 2 days per class, and extra time has to go into maintenance of rules, as the content of documents in classes drifts over time. Sojka, MR Group: PV211: Support Vector Machines 30 / 38 SVM intro SVM details Classification in the real world mpi comment line top-le\eltop ic topic deinition modifiers ■ subtopi atopic eviden ectopic topicdeinrtion modifier evidencetopic topicdelnition modifier evidencetopic topic delnrtion modifier evidencetopic topic delnition modifier SubtOpC * Beginning of art topic definition art ACCRUE /author = "fsmith" /date = "30-Dec-01" /annotation = "Topic created by fsmith" * 0.70 performing-arts ACCRUE ** 0.50 WORD /wordtext = ballet ** 0.50 STEM /wordtext = dance ** 0.50 WORD /wordtext = opera ** 0.30 WORD /wordtext = symphony * 0.70 visual-arts ACCRUE ** 0.50 WORD /wordtext = painting ** 0.50 WORD /wordtext = sculpture subtopic subtope gjbto p c * 0.70 film ACCRUE ** 0.50 STEM /wordtext = f ilm ** 0.50 motion-picture PHRASE *** 1.00 WORD /wordtext = motion *** l. 00 WORD /wordte>:t - picture ** 0.50 STEM /wardtext = movie * 0 . 50 video ACCRUE ** 0.50 STEM /wordtext = video ** 0.50 STEM /wordtext = vcr * End of art topic Sojka, MR Group: PV211: Support Vector Machines 31 / 38 SVM intro SVM details Classification in the real world Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competing company Query: "trade secret" /s disclos! /s prevent /s employe! Information need: Requirements for disabled people to be able to access a workplace Query: disab! /p access! /s work-site work-place (employment /3 place) Information need: Cases about a host's responsibility for drunk guests Query: host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest Sojka, MR Group: PV211: Support Vector Machines 32 / 38 SVM intro SVM details Classification in the real world Work out how to get more labeled data as quickly as you can. • Best way: insert yourself into a process where humans will be willing to label data for you as part of their natural tasks. Example Often humans will sort or route email for their own purposes, and these actions give information about classes. Active Learning A system is built which decides which documents a human should label. Usually these are the ones on which a classifier is uncertain of the correct classification. Sojka, MR Group: PV211: Support Vector Machines 33 / 38 SVM intro SVM details Classification in the real world Good amount of labeled data, but not huge Use everything that we have presented about text classification. Consider hybrid approach (overlay Boolean classifier) Huge amount of labeled data Choice of classifier probably has little effect on your results. Choose classifier based on the scalability of training or runtime efficiency. Rule of thumb: each doubling of the training data size produces a linear increase in classifier performance, but with very large amounts of data, the improvement becomes sub-linear. Sojka, MR Group: PV211: Support Vector Machines 34 / 38 SVM intro SVM details Classification in the real world gory taxonomi If you have a small number of well-separated categories, then many classification algorithms are likely to work well. But often: very large number of very similar categories. Example Web directories (e.g. the Yahoo! Directory consists of over 200,000 categories or the Open Directory Project), library classification schemes (Dewey Decimal or Library of Congress), the classification schemes used in legal or medical applications. Accurate classification over large sets of closely related classes is inherently difficult. - No general high-accuracy solution. Sojka, MR Group: PV211: Support Vector Machines 35 / 38 SVM intro SVM details Classification in the real world • Is there a learning method that is optimal for all text classification problems? • No, because there is a trade-off between bias and variance. • Factors to take into account: o How much training data is available? o How simple/complex is the problem? (linear vs. nonlinear decision boundary) o How noisy is the problem? • How stable is the problem over time? o For an unstable problem, it's better to use a simple and robust classifier. Sojka, MR Group: PV211: Support Vector Machines 36 / 38 SVM intro SVM details Classification in the real world • Support vector machines: State-of-the-art text classification methods (linear and nonlinear) o Introduction to SVMs • Formalization • Soft margin case for nonseparable problems • Discussion: Which classifier should I use for my problem? Sojka, MR Group: PV211: Support Vector Machines 37 / 38 SVM intro SVM details Classification in the real world • Chapter 14 of MR (basic vector space classification) • Chapter 15-1 of MR • Resources at http://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and Fl MU library • Discussion of "how to select the right classifier for my problem" in Russell and Norvig Sojka, MR Group: PV211: Support Vector Machines 38 / 38 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/PV211 MR 16: Flat Clustering Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2017-04-18 Sojka, MR Group: PV211: Flat Clustering 1/83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? verview Q Recap Q Clustering: Introduction Q Clustering in IR -means Q Evaluation How many clusters? Sojka, MR Group: PV211: Flat Clustering 2/83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? C ■ i 1 ■ bupport vector machines • Binary classification problem • Simple SVMs are linear classifiers. • criterion: being maximally far away from any data point —>> determines classifier margin • linear separator position defined by support vectors Maximum margin decision hyperplane Support vectors \ \ Margin is maximized Sojka, MR Group: PV211: Flat Clustering 4/83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? Optimization problem solved by SVMs Find w and b such that: • \wJw is minimized (because |w o for all {(x/,y/)}, y;(wTx; + b) > 1 = V w1 w), and Sojka, MR Group: PV211: Flat Clustering 5/83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? icn macmne learning metnoa to cnoose. • Is there a learning method that is optimal for all text classification problems? • No, because there is a tradeoff, a dilemma between bias and variance. • Factors to take into account: • How much training data is available? o How simple/complex is the problem? (linear vs. nonlinear decision boundary) o How noisy is the problem? • How stable is the problem over time? o For an unstable problem, it's better to use a simple and robust classifier. See Fig. 15 in Geman et al. Sojka, MR Group: PV211: Flat Clustering 6/83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? aKe-away to a ay • What is clustering? • Applications of clustering in information retrieval • K-means algorithm • Evaluation of clustering • How many clusters? Sojka, MR Group: PV211: Flat Clustering 7/83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? ustering: u ennition • (Document) clustering is the process of grouping a set of documents into clusters of similar documents. • Documents within a cluster should be similar. o Documents from different clusters should be dissimilar. • Clustering is the most common form of unsupervised learning. • Unsupervised = there are no labeled or annotated data. □ 9 Hard clustering vs. soft clustering. • Cardinality of clustering. Sojka, MR Group: PV211: Flat Clustering 9/83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? ata set wi cluster structure LO O LO LO Ö o Ö cV o ^ 0 o Oco 0 o o o o o o o oo° O oo O o ^8 o & ° 2 o o o 8 o o o 000 Propose algorithm for finding the cluster structure in this example 0.0 0.5 1.0 1.5 2.0 Sojka, MR Group: PV211: Flat Clustering 10 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? assmcation vs. usterm • Classification: supervised learning • Clustering: unsupervised learning • Classification: Classes are human-defined and part of the input to the learning algorithm. • Clustering: Clusters are inferred from the data without human input. • However, there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents, . .. Sojka, MR Group: PV211: Flat Clustering 11 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? e cluster hypothesis Cluster hypothesis. Documents in the same cluster behave similarly with respect to relevance to information needs. All applications of clustering in IR are based (directly or indirectly) on the cluster hypothesis. Van Rijsbergen's original wording (1979): "closely associated documents tend to be relevant to the same requests". Sojka, MR Group: PV211: Flat Clustering 13 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? a 1" j_" £ 1 j_ 1 R Applications ot clustering in 1 application what is clustered? benefit search result clustering search more effective infor- results mation presentation to user Scatter-Gather (subsets of) alternative user inter- collection face: "search without typing" collection clustering collection effective information presentation for ex- ploratory browsing cluster-based retrieval collection higher efficiency: faster search Sojka, MR Group: PV211: Flat Clustering 14 / 83 bearch result clustering Tor better navigation my clusters? Yy Vivisirmo Clustered Results jaguar (SDfl) ©■> Carsf?^ ^> Club {34} '?>> Catf£3) 0■> Animal {13) 0■> Restoration <10) © » Mac OS X (5) ©■> Jaguar Model fflj 0> Request^) 9> Mark Webbers !"'Maya {5) Find fn clustery Enter Keywords aguar the Web Advanced Search + Help Top 205 results of at least 20,373.974 retrieved for the query Jaguar (Details) 1. Jag-lovers - THE source for all Jaguar informati OH [new window] [Irame] [cache] [preview] [clusters] Internetl Serving Enthusiasts since 1 993 The Jag-lovers Web Currently with 40661 members The Premier Jaguar Cars web resource for all enthusiasts Lists and Forums Jag-lovers originally evolved around its ... WWW. jag-lovers, org - Open Directory 2, Wisenuta, Ask Jeeves 3. MSN 9. Looks mart 12, MSN Search IS 2. Jaguar C aTS [new window] [Irame] [cache] [preview] [duelers] [...] redirected to www.Jaguar.com www.jaguarcars.conn ■ Looksmart 1. MSN £. Lycos 3. Wlsenut 6. MSN Search 9.- MSN £9 3. http :%WW,j3gUar,C0m/ [new window] [frame] [preview] [clusters] WWW.jagLiar.com - MSN 1. Ask Jeeves 1. MSN Search 3. Lycos 9 4. Apple - Mac OS X [newwindow] [frame] [prevbw] [dusters] Learn about the new OS X Server, designed for the Internet, digital media and workgroup management. Download a technical factsheet. www,apple.com/macosx - Wiseni.it I. MSN 3. Looksmart26 Sojka, MR Group: PV211: Flat Clustering 15 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? Education Domestic Iraq Arts Sports Oil Germany Legal Deployment Politics Germany Pakistan Africa Markets Oil Hostages Trinidad W. Africa S. Africa Security International Lebanon Pakistan Japan Sojka, MR Group: PV211: Flat Clustering 16 / 83 obal navigation: ' ustering in IR K-means Evaluation How many clusters? "iXHOOI DIR ECTO RY Search: f the Web | * the Directory | ' this category Search Society and Culture Directory =■ Society and Culture Culture www.Dealtime.com Shop and save on Magazines. sponsor re CATEGORIES whais tw>) Most Popular Society and Culture • Crime (5453) new! • Cultures and Groups (11025)new! • Environment and Nature fB55i»N^! • Families (1215) • Food and Drink (9776)newi • Holidays and Observances (3333) Additional Society and Culture Categories • Advice (48) • Chats and Forums (27) • Cultural Policy (10) • Death and Dying (334) • Disabilities (1293) • Employment and Work(5) • Etiquette (54) • Events (27) • Fashion® SITE LISTINGS By Popularity | AlchBbelnal Whafs This?) • Issues and Causes (4942) • Mythology and Folklore (964) • People (16351) • Relationships (595) • Religion and Spirituality (37533) • Sexuality (2312) Gender (21) Home and Garden (1080)newi Magazines (1S4) Museums and Exhibits (6052) Pets® Reunions (226) Social Organizations (336) Web Directories (6) Weddings (371) Slit Sojka, MR Group: PV211: Flat Clustering 17 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? MeSH Tree Structures - 2008 Return to Entry Pape 1. b Anatomy [Al 2. b Organisms [B] 3. □ Diseases [C] ° Bacterial Infections and Mycoses IC011 -+■ ° Virus Diseases TC021 + 0 Parii^iu;. EMseases fC 031 + • Neoplasms IC04' I Müscüibskeletal Diseases TC051 + ° Digestive System Diseases TCD61 + ° Stomatoßnailiic Diseases IC 071 + ° Respiratory Tract Diseases TCQSl + ° OtorhinoLarvtiftoloftic Diseases IC091 + ° Nervous System Diseases TC101 -f ° Eve Diseases rcill + ° Male Urogenital Diseases IC121 + ° Female Urogenital Diseases and Pregnancy Complications 1C131 + ° Cardiovascular Diseases TC141 + ° Hemic and Lymphatic Diseases TC151 + ° Congenital. Hereditary, and Neonatal Diseases and Abnormalities TCI61 + ° Skin and Connective Tissue Diseases [C171 + ° Nutritional and Metabolic Diseases 1C1S1 + ° Endocrine System Diseases TC 191 + ° Imniune System Diseases 1C201 + ° Disorders of Environmental Origin TC211 + ° Animal Diseases 1C221 + ° Pathological Conditions,Signs and Symptoms TC231 + 4. E Chemicals and Drugs [D] 5. E Analytical, Diagnostic and Therapeutic Techniques and Equipment [E] 6. b Psychiatry and Psychology [F] 7. b Biological Sciences [G] 8. b Natural Sciences [H] 9. b Anthropology, Education, Sociology and Social Phenomena [I] 10. b Technology, Industry, Agriculture [J] 11. b Humanities [K] Sojka, MR Group: PV211: Flat Clustering 18 / 83 obal navigation: Neoplasms fC041 Cvsts (C04.1821 +■ Hamartoma fCQ4.4451 + ► Neoplasms by Histologic Type fC04,5571 Histiocytic Disorders. Malignant fC04.557.2271 + Leukemia fC04.557.3371 + Lymphatic Vessel Tumors [C04.557.3751 -+ Lymphoma [C04.557.3861 -+ Neoplasms. Complex, and Mixed [C04.557.4351 Neoplasms. Connective and Soft Tissue [C04.557,4501 + Neoplasms. Germ Cell and Embryonal [C04.557.4651 + Neoplasms. Glandular and Epithelial fC04.557.4701 ■+ Neoplasms. Gonadal Tissue [C04.557.4751 + Neoplasms. Nerve Tissue fC04.557.5801 + Neoplasms .Plasma Cell fC04.557.595] + Neoplasms. Vascular Tissue fC04.557.6451 + Ncvi and Melanomas fC04.557.6651 + Odontogenic Tumors fC04.557.6951 + Neoplasms by Site [C04.5S81 + Neoplasms. Experimental ICQ4.6191 + Neoplasms. Hormone-Dependent fC04.6261 Neoplasms. Multiple Primary fC04.6511 + Neoplasms. Post-Traumatic IC04.6661 Neoplasms. Radiation-Induced [C04.6S21 + Neoplasms. Second Primary [C04.6921 Neoplastic Processes fC04.69 71 + Neoplastic Syndromes. Hereditary fCQ4.7001 + Paraneoplastic Syndromes fC04.7301 + Precancerou s C ond itions fCQ4.8 341 + Pregnancy Complications. Neoplastic [C04.8501 + Tumor Virus Infections fC04.9251 + Sojka, MR Group: PV211: Flat Clustering 19 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? avigationai i r lierarcr lies: anuai i vs. automatic creation • Note: Yahoo/MESH are not examples of clustering. • But they are well known examples for using a global hierarchy for navigation. • Some examples for global navigation/exploration based on clustering: • Arxiv's LDAExplore: https://arxiv.lateral.io/ • Cartia o Themescapes • Google News □ Sojka, MR Group: PV211: Flat Clustering 20 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? Sojka, MR Group: PV211: Flat Clustering 21 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? Sojka, MR Group: PV211: Flat Clustering 22 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? g Tor navigation: < boogie ľ ^ews http://news.google.com Sojka, MR Group: PV211: Flat Clustering 23 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? ustering for improvii • To improve search recall: • Cluster docs in collection a priori o When a query matches a doc d, also return other docs in the cluster containing d • Hope: if we do this: the query "car" will also return docs containing "automobile" • Because the clustering algorithm groups together docs containing "car" with those containing "automobile". o Both types of documents contain words like "parts", "dealer", "mercedes", "road trip". Sojka, MR Group: PV211: Flat Clustering 24 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? ata set wi cluster structure LO c\i o c\i LO LO Ö o Ö cV o ^ 0 o oco ° O O o o o o o oo° O oo O o ^8 o & ° 2 o o o 8 o o o 000 Propose algorithm for finding the cluster structure in this example 0.0 0.5 1.0 1.5 2.0 Sojka, MR Group: PV211: Flat Clustering 25 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? esiaerata Tor clustering • General goal: put related docs in the same cluster, put unrelated docs in different clusters. o We'll see different ways of formalizing this. • The number of clusters should be appropriate for the data set we are clustering. • Initially, we will assume the number of clusters K is given. • Later: Semiautomatic methods for determining K • Secondary goals in clustering • Avoid very small and very large clusters q Define clusters that are easy to explain to the user o Many others .. . Sojka, MR Group: PV211: Flat Clustering 26 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? vs. nierarcmcai clustering Flat algorithms o Usually start with a random (partial) partitioning of docs into groups • Refine iteratively o Main algorithm: K-means Hierarchical algorithms • Create a hierarchy o Bottom-up, agglomerative o Top-down, divisive □ Sojka, MR Group: PV211: Flat Clustering 27 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? • Hard clustering: Each document belongs to exactly one cluster. More common and easier to do • Soft clustering: A document can belong to more than one cluster. q Makes more sense for applications like creating browsable hierarchies • You may want to put sneakers in two clusters: 9 sports apparel o shoes o You can only do that with a soft clustering approach, o This class: flat, hard clustering; next: hierarchical, hard clustering then: latent semantic indexing, a form of soft clustering □ • We won't have time for soft clustering. See MR 16.5, MR 18 • Non-exhaustive clustering: some docs are not assigned to any cluster. See references in MR 16. Sojka, MR Group: PV211: Flat Clustering 28 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? at algorithms • Flat algorithms compute a partition of N documents into a set of K clusters. • Given: a set of documents and the number K o Find: a partition into K clusters that optimizes the chosen partitioning criterion • Global optimization: exhaustively enumerate partitions, pick optimal one o Not tractable • Effective heuristic method: K-means algorithm Sojka, MR Group: PV211: Flat Clustering 29 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? f-means • Perhaps the best known clustering algorithm • Simple, works well in many cases • Use as default / baseline for clustering documents Sojka, MR Group: PV211: Flat Clustering 31 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? E ocument repre ations in clustering • Vector space model • As in vector space classification, we measure relatedness between vectors by Euclidean distance . .. o . . .which is almost equivalent to cosine similarity. • Almost: centroids are not length-normalized. Sojka, MR Group: PV211: Flat Clustering 32 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? o o Each cluster in K-means is defined by a centroid. Objective/partitioning criterion: minimize the average squared difference from the centroid Recall definition of centroid: where we use uj to denote a cluster. We try to find the minimum average squared difference by iterating two steps: • reassignment: assign each vector to its closest centroid • recomputation: recompute each centroid as the average of the vectors that were assigned to it in reassignment Sojka, MR Group: PV211: Flat Clustering 33 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? I ^-means pseua locoa e is centroia OT UJkj K-means({xi, . . . ,x/v}, K) 1 (si, S2,..., sk) <- SelectRandomSeeds({xi, . . . ,x/v}, /<) 2 for k <- 1 to K 3 do /2/c «— s/c 4 while stopping criterion has not been met 5 do for k «— 1 to /< 6 do cj/c «— {} 7 for n ^— 1 to A/ 8 do j <— arg miry \ jly — xn\ 9 ujj <— cjy U {xn} (reassignment of vectors) 10 for /c ^ 1 to K 11 do /!/< «— |^-| X^Gcj/c x (recomputation of centroids) 12 return {/2i,..., jl^} Sojka, MR Group: PV211: Flat Clustering 34 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? points to oe ernstere Exercise: (i) Guess what the optimal clustering into two clusters is in this case; (ii) compute the centroids of the clusters □ Sojka, MR Group: PV211: Flat Clustering 35 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? e initial centroias Sojka, MR Group: PV211: Flat Clustering 36 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? xampie: assign points to closest cen Sojka, MR Group: PV211: Flat Clustering 37 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? xampie: Assignmen Sojka, MR Group: PV211: Flat Clustering 38 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? e ompute cluster centroi Sojka, MR Group: PV211: Flat Clustering 39 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? Assign points 10 ciosesi : centroia Sojka, MR Group: PV211: Flat Clustering 40 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? xampie: Assignmen Sojka, MR Group: PV211: Flat Clustering 41 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? e ompute cluster centroi 2 222 1 i X1 1 1 Sojka, MR Group: PV211: Flat Clustering 42 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? Sojka, MR Group: PV211: Flat Clustering 43 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? xampie: Assignmen Sojka, MR Group: PV211: Flat Clustering 44 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? e ompute cluster centroi Sojka, MR Group: PV211: Flat Clustering 45 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? Sojka, MR Group: PV211: Flat Clustering 46 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? xampie: Assignmen i Sojka, MR Group: PV211: Flat Clustering 47 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? ked Example: Recompute cluster centroii' Sojka, MR Group: PV211: Flat Clustering 48 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? Sojka, MR Group: PV211: Flat Clustering 49 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? xampie: Assignmen i Sojka, MR Group: PV211: Flat Clustering 50 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? ompute cluster centroi Sojka, MR Group: PV211: Flat Clustering 51 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? Sojka, MR Group: PV211: Flat Clustering 52 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? xampie: Assignmen i Sojka, MR Group: PV211: Flat Clustering 53 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? e ompute cluster centroi Sojka, MR Group: PV211: Flat Clustering 54 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? Sojka, MR Group: PV211: Flat Clustering 55 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? xampie: Assignmen i Sojka, MR Group: PV211: Flat Clustering 56 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? e ompute cluster centroi Sojka, MR Group: PV211: Flat Clustering 57 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? entroias ana assignments arter convergenc Sojka, MR Group: PV211: Flat Clustering 58 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? -means is guaranteed to converge: • RSS (Residual Sum of Squares) = sum of all squared distances between document vector and closest centroid • RSS decreases during each reassignment step. o because each vector is moved to a closer centroid a RSS decreases during each recomputation step, o see next slide o There is only a finite number of clusterings. • Thus: We must reach a fixed point. • Assumption: Ties are broken consistently. o Finite set & monotonically decreasing —>► convergence Sojka, MR Group: PV211: Flat Clustering 59 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? ecomputation decreases average distance RSS = Y^k=i RSS/c - the residual sum of squares (the "goodness" measure) M RSS*(?) = dRSSk(v) m E 2= E E(^-^)2 2(vm - Xm) = 0 x£üJk m E x£üJk x m The last line is the componentwise definition of the centroid! We minimize RSSk when the old centroid is replaced with the new centroid. RSS, the sum of the RSS/o must then also decrease during recomputation. Sojka, MR Group: PV211: Flat Clustering 60 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? -means is guaranteed to converge • But we don't know how long convergence will take! • If we don't care about a few docs switching back and forth, then convergence is usually fast (< 10-20 iterations). • However, complete convergence can take many more iterations. Sojka, MR Group: PV211: Flat Clustering 61 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? ptimaiity or n-means 9 Convergence ^ optimality • Convergence does not mean that we converge to the optimal clustering! o This is the great weakness of K-means. • If we start with a bad set of seeds, the resulting clustering can be horrible. Sojka, MR Group: PV211: Flat Clustering 62 / 83 I Recap Clustering: Introduction Clustering in IR K-means zxercise: rjuooptimai clustering Evaluation How many clusters? 2 + 0 di d2 d3 X X X X X X d* ds d6 0 1 2 3 4 • What is the optimal clustering for K — 21 • Do we converge on this clustering for arbitrary seeds di,dj? Sojka, MR Group: PV211: Flat Clustering 63 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? initialization or n-means • Random seed selection is just one of many ways K-means can be initialized. • Random seed selection is not very robust: It's easy to get a suboptimal clustering. • Better ways of computing initial centroids: • Select seeds not randomly, but using some heuristic (e.g., filter out outliers or find a set of seeds that has "good coverage" of the document space) o Use hierarchical clustering to find good seeds • Select / (e.g., / = 10) different random sets of seeds, do a K-means clustering for each, select the clustering with lowest RSS □ Sojka, MR Group: PV211: Flat Clustering 64 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? I ime complexity o" i-means • Computing one distance of two vectors is O(M). • Reassignment step: O(KNM) (we need to compute KN document-centroid distances) • Recomputation step: O(NM) (we need to add each of the document's < M values to one of the centroids) • Assume number of iterations bounded by / o Overall complexity: O(IKNM) - linear in all important dimensions • However: This is not a real worst-case analysis. • In pathological cases, complexity can be worse than linear. Sojka, MR Group: PV211: Flat Clustering 65 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? at is a good clustering • Internal criteria • Example of an internal criterion: RSS in K-means • But an internal criterion often does not evaluate the actual utility of a clustering in the application. • Alternative: External criteria • Evaluate with respect to a human-defined classification Sojka, MR Group: PV211: Flat Clustering 67 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? xternai criteria ror clustering quan • Based on a gold standard data set, e.g., the Reuters collection we also used for the evaluation of classification • Goal: Clustering should reproduce the classes in the gold standard • (But we only want to reproduce how documents are divided into groups, not the class labels.) • First measure for how well we were able to reproduce the classes: purity □ Sojka, MR Group: PV211: Flat Clustering 68 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? xternai criterio purity(Q, C) = N k max j ujk n Cj o fi = {(jJi,LJ2, • • •, uk} is the set of clusters and C = {ci, C2,..., cj} is the set of classes. • For each cluster ujk\ find class Cj with most members n^j in ujk • Sum all rikj and divide by total number of points □ Sojka, MR Group: PV211: Flat Clustering 69 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? xampie Tor computing pun cluster 1 cluster 2 cluster 3 To compute purity: 5 = maxy \cji n q\ (class x, cluster 1); 4 = maxy \uj2 H Cj\ (class o, cluster 2); and 3 = maxy |cj3 n Cj\ (class o, cluster 3). Purity is (1/17) x (5 + 4 + 3) w 0.71. □ Sojka, MR Group: PV211: Flat Clustering 70 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? notner external criterion Purity can be increased easily by increasing K - a measure that does not have this problem: Rand index. o o TP+TN Definition: Rl = -pP+FP+FN+TN Based on 2x2 contingency table of all pairs of documents: same cluster different clusters same class different classes true positives (TP) false negatives (FN) false positives (FP) true negatives (TN) • TP+FN+FP+TN is the total number of pairs. • TP+FN+FP+TN = Q for N documents. • Example: (^) — 136 in 0/0/x example o Each pair is either positive or negative (the clustering puts the two documents in the same or in different clusters) .. . • . . .and either "true" (correct) or "false" (incorrect): the clustering decision is correct or incorrect. Sojka, MR Group: PV211: Flat Clustering 71 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? 1 ina ex: txampie As an example, we compute Rl for the o/o/x example. We first compute TP + FP. The three clusters contain 6, 6, and 5 points, respectively, so the total number of "positives" or pairs of documents that are in the same cluster is: TP + FP = 6 2 + 6 2 + 5 2 = 40 Of these, the x pairs in cluster 1, the o pairs in cluster 2, the o pairs in cluster 3, and the x pair in cluster 3 are true positives: TP = 5 2 + 4 2 + 3 2 + 2 2 = 20 Thus, FP = 40 - 20 = 20. FN and TN are computed similarly. □ Sojka, MR Group: PV211: Flat Clustering 72 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? measure Tor x examp same cluster different clusters same class TP = 20 FN = 24 different classes FP = 20 TN = 72 Rl is then (20 + 72)/(20 + 20 + 24 + 72) « 0.68. Sojka, MR Group: PV211: Flat Clustering 73 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? wo otner external evaluation measures • Two other measures • Normalized mutual information (NMI) o How much information does the clustering contain about the classification? • Singleton clusters (number of clusters = number of docs) have maximum Ml o Therefore: normalize by entropy of clusters and classes • F measure • Like Rand, but "precision" and "recall" can be weighted Sojka, MR Group: PV211: Flat Clustering 74 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? valuation results Tor x example purity NMI Rl F5 lower bound 0.0 0.0 0.0 0.0 maximum 1.0 1.0 1.0 1.0 value for example 0.71 0.36 0.68 0.46 All four measures range from 0 (really bad clustering) to 1 (perfect clustering). □ Sojka, MR Group: PV211: Flat Clustering 75 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? ow many clusters? • Number of clusters K is given in many applications. • E.g., there may be an external constraint on K. Example: In the case of Scatter-Gather, it was hard to show more than 10-20 clusters on a monitor in the 90s. • What if there is no external constraint? Is there a "right" number of clusters? 9 One way to go: define an optimization criterion o Given docs, find K for which the optimum is reached, o What optimization criterion can we use? • We can't use RSS or average squared distance from centroid as criterion: always chooses K = N clusters. Sojka, MR Group: PV211: Flat Clustering 77 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? • Your job is to develop the clustering algorithms for a competitor to news.google.com • You want to use K-means clustering, o How would you determine Kl □ Sojka, MR Group: PV211: Flat Clustering 78 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? impie objective Tunction tor asic iae • Start with 1 cluster {K — 1) • Keep adding clusters (= keep increasing K) • Add a penalty for each new cluster • Then trade off cluster penalties against average squared distance from centroid • Choose the value of K with the best tradeoff □ Sojka, MR Group: PV211: Flat Clustering 79 / 83 Recap Clustering: Introduction Clustering in IR K-\ neans Evaluation How many clusters? oimpie objective Tunction Tor , rormai nzation 9 Given a clustering, define the cost for a document as (squared) distance to centroid • Define total distortion RSS(K) as sum of all individual document costs (corresponds to average distance) • Then: penalize each cluster with a cost A o Thus for a clustering with K clusters, total cluster penalty is KX • Define the total cost of a clustering as distortion plus total cluster penalty: RSS(K) + KX • Select K that minimizes (RSS(K) + KX) • Still need to determine good value for A .. . □ Sojka, MR Group: PV211: Flat Clustering 80 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? mg nee e curve number of clusters Pick the number of clusters where curve "flattens". Here: 4 or 9. □ Sojka, MR Group: PV211: Flat Clustering 81 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? aKe-away to a ay • What is clustering? • Applications of clustering in information retrieval • K-means algorithm • Evaluation of clustering • How many clusters? Sojka, MR Group: PV211: Flat Clustering 82 / 83 Recap Clustering: Introduction Clustering in IR K-means Evaluation How many clusters? esources • Chapter 16 of MR • Resources at http://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and Fl MU library o Keith van Rijsbergen on the cluster hypothesis (he was one of the originators) o Bing/Carrot2/Clusty: search result clustering systems o Stirling number: the number of distinct /c-clusterings of n items Sojka, MR Group: PV211: Flat Clustering 83 / 83 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/PV211 MR 17: Hierarchical clustering Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2017-04-18 Sojka, MR Group: PV211: Hierarchical clustering 1/62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants verview Q Introduction Q Single-link/Complete-link O Centroid/GAAC Q Labeling clusters ^ Variants Sojka, MR Group: PV211: Hierarchical clustering 2/62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants ake-away today o Introduction to hierarchical clustering • Single-link and complete-link clustering • Centroid and group-average agglomerative clustering (GAAC) • Bisecting K-means • How to label clusters automatically □ Sojka, MR Group: PV211: Hierarchical clustering 3/62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants We want to create this hierarchy automatically. We can do this either top-down or bottom-up. The best known bottom-up method is hierarchical agglomerative clustering. □ Sojka, MR Group: PV211: Hierarchical clustering 5/62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants lerarcnicai aggiomerative clustering o HAC creates a hierarchy in the form of a binary tree. • Assumes a similarity measure for determining the similarity of two clusters. • Up to now, our similarity measures were for documents. • We will look at four different cluster similarity measures. Sojka, MR Group: PV211: Hierarchical clustering 6/62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants asic algorithm • Start with each document in a separate cluster o Then repeatedly merge the two clusters that are most similar 9 Until there is only one cluster. • The history of merging is a hierarchy in the form of a binary tree. • The standard way of depicting this history is a dendrogram. □ Sojka, MR Group: PV211: Hierarchical clustering 7/62 1.0 l_ Ag trade reform. Back-to-school spending is up Lloyd's CEO questioned Lloyd's chief / U.S. grilling Viag stays positive Chrysler7 Latin America Ohio Blue Cross Japanese prime minister / Mexico CompuServe reports loss Sprint / Internet access service Planet Hollywood Trocadero: tripling of revenues German unions split War hero Colin Powell War hero Colin Powell Oil prices slip Chains may raise prices Clinton signs law Lawsuit against tobacco companies suits against tobacco firms Indiana tobacco lawsuit Most active stocks Mexican markets Hog prices tumble NYSE closing averages British FTSE index Fed holds interest rates steady Fed to keep interest rates steady Fed keeps interest rates steady Fed keeps interest rates steady 0.8 I 0.6 l 0.4 I 0.2 _J 0.0 I I h-- r- --1 -1 1 1 1 1 —h QJ QJ r+ r+ o Q_ 1—1 rz to o eri o ^—' O Oq CD r+ □ QJ "a QJ n QJ Q_ CD Z3 Q_ O Oq QJ 3 "a o — ■ QJ CD Oq QJ CD CD 3 CD CD QJ n QJ n Oq CD J-1 CD CD QJ n 3 CD Orq — ■ CD QJ — tO QJ o CD_ c/T rz to CD N O r+ QJ Z5 CD bo n QJ —1 r+ r+ Z5 CD O O" -j- 3 CD —\ r+ CD o O QJ r+ Q_ O O -h o "a -h 1 1 -h 3 o CD —\ 3 gers Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants ivisive clustering • Divisive clustering is top-down. • Alternative to HAC (which is bottom up). 9 Divisive clustering: o Start with all docs in one big cluster o Then recursively split clusters o Eventually each node forms a cluster on its own. • —> Bisecting K-means at the end 9 For now: HAC (= bottom-up) Sojka, MR Group: PV211: Hierarchical clustering 9/62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants aive m Z algoritľ im SimpleHAC(c/i, ..., c//v) 1 for n <- 1 to N 2 do for /V 1 to A/ 3 do C[n][i] SiM(c/n, df) 4 l[n] ^- 1 (keeps track of active clusters) 5 A <— [] (collects clustering as a sequence of merges) 6 for k <- 1 to N - 1 7 do (/, m) ) 11 C[/][/] <- Sim(< /, rn >j) 12 C\j\[i\J) 13 /[m] «— 0 (deactivate cluster) 14 return A Sojka, MR Group: PV211: Hierarchical clustering 10 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants I omputationai complexity or tne naive aigoritnm • First, we compute the similarity of all N x N pairs of documents. o Then, in each of N iterations: • We scan the 0(N x N) similarities to find the maximum similarity. • We merge the two clusters with maximum similarity. o We compute the similarity of the new cluster with all other (surviving) clusters. o There are 0(/V) iterations, each performing a 0(/V x N) "scan" operation. • Overall com plexity is 0(A/3). • We'll look at more efficient algorithms later. Sojka, MR Group: PV211: Hierarchical clustering 11 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants ey question: now to aerme cluster similarity • Single-link: Maximum similarity o Maximum similarity of any two documents • Complete-link: Minimum similarity o Minimum similarity of any two documents • Centroid: Average "intersimilarity" • Average similarity of all document pairs (but excluding pairs of docs in the same cluster) o This is equivalent to the similarity of the centroids. • Group-average: Average "intrasimilarity" a Average similary of all document pairs, including pairs of docs in the same cluster □ Sojka, MR Group: PV211: Hierarchical clustering 12 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants uster similarity: txampie Sojka, MR Group: PV211: Hierarchical clustering 13 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants ingie-nn aximum a Sojka, MR Group: PV211: Hierarchical clustering 14 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants ompiete-iin a inimum Sojka, MR Group: PV211: Hierarchical clustering 15 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants : Average intersimi arity intersimilarity = similarity of two documents in different clusters 4 T- 3 + 2 + 1 + 0 0 7 Sojka, MR Group: PV211: Hierarchical clustering 16 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants p average: average intrasimi arity intrasimilarity = similarity of any pair, including cases where the two documents are in the same cluster Sojka, MR Group: PV211: Hierarchical clustering 17 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants uster similarity: Larger txampie Sojka, MR Group: PV211: Hierarchical clustering 18 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants ingie-nn aximum a Sojka, MR Group: PV211: Hierarchical clustering 19 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants ompiete-iin a inimum Sojka, MR Group: PV211: Hierarchical clustering 20 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants verage intersimnarity Sojka, MR Group: PV211: Hierarchical clustering 21 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants p average: Average intrasimnarity Sojka, MR Group: PV211: Hierarchical clustering 22 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants ingle 11n • The similarity of two clusters is the maximum intersimilarity -the maximum similarity of a document from the first cluster and a document from the second cluster. • Once we have merged two clusters, how do we update the similarity matrix? • This is simple for single link: SIM(CJ/, (cj/q UCJ/c2)) = max(SIM(cJ/, CJ/q), SIM(CJ/, UJk2)) □ Sojka, MR Group: PV211: Hierarchical clustering 24 / 62 1.0 l_ Ag trade reform. Back-to-school spending is up Lloyd's CEO questioned Lloyd's chief / U.S. grilling Viag stays positive Chrysler7 Latin America Ohio Blue Cross Japanese prime minister / Mexico CompuServe reports loss Sprint / Internet access service Planet Hollywood Trocadero: tripling of revenues German unions split War hero Colin Powell War hero Colin Powell Oil prices slip Chains may raise prices Clinton signs law Lawsuit against tobacco companies suits against tobacco firms Indiana tobacco lawsuit Most active stocks Mexican markets Hog prices tumble NYSE closing averages British FTSE index Fed holds interest rates steady Fed to keep interest rates steady Fed keeps interest rates steady Fed keeps interest rates steady 0.8 I 0.6 l 0.4 I 0.2 _J 0.0 I I h-- \- —.1 -1 1 1 1 1 Q_ Q_ CD CD Q_ <' CD O Q_ O" QJ 3 n r+ r+ r+ 3" □ CD Q_ to r+ CD QJ r+ n QJ CD 1 n_ to r+ CD co 1 n_ to r+ CD o —I r+ 5 o CD r+ CD to o QJ ^- E QJ to n S Q_ 3 Q-Z H) C O 3 » £■ fl> (/) ■ ■ tO ^—v -J 3" Qq QJ Q_ Q_ CD Q_ QJ O 13 3 QJ Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants ompiete nn • The similarity of two clusters is the minimum intersimilarity the minimum similarity of a document from the first cluster and a document from the second cluster. • Once we have merged two clusters, how do we update the similarity matrix? • Again, this is simple: SIM(CJ/, (cj/q UCJ/c2)) = min(SIM(cJ/, CJ/q), SIM(CJ/, UJk2)) • We measure the similarity of two clusters by computing the diameter of the cluster that we would get if we merged them. Sojka, MR Group: PV211: Hierarchical clustering 26 / 62 1.0 l_ 0.8 _1_ 0.6 _l_ 0.4 _1_ 0.2 _l_ 0.0 _J NYSE closing averages Hog prices tumble Oil prices slip Ag trade reform. Chrysler / Latin America Japanese prime minister / Mexico Fed holds interest rates steady Fed to keep interest rates steady Fed keeps interest rates steady Fed keeps interest rates steady Mexican markets British FTSE index War hero Colin Powell War hero Colin Powell Lloyd's CEO questioned Lloyd's chief / U.S. grilling Ohio Blue Cross Lawsuit against tobacco companies suits against tobacco firms Indiana tobacco lawsuit Viag stays positive Most active stocks CompuServe reports loss Sprint / Internet access service Planet Hollywood Trocadero: tripling of revenues Back-to-school spending is up German unions split Chains may raise prices Clinton signs law to QJ n' CD o ■ r+ r+ 3" CD to QJ 3 CD i r+ n_ r+ to < r+ < CD o —\ n_ n_ to to r+ r+ CD CD —\ to O CD CD QJ n 3 qq CD 1 — — CD — -3 3 c- r- £ =■ r\ 3 ft) QJ 3 O ft) 3 °- o s cr 0 qj Oq qj qj n Q_ QJ 3 to n n QJ to □ Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants ompute single ana complete imK clusterings di d2 d-* d. 3 "4 3 + 2 1 XX XX č/5 c/ô c/7 ds XX XX o 0 12 3 4 Sojka, MR Group: PV211: Hierarchical clustering 28 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants ingle-link cluster! Sojka, MR Group: PV211: Hierarchical clustering 29 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants ompiete link clustering Sojka, MR Group: PV211: Hierarchical clustering 30 / 62 Introduction Si ngle-link/Com plete- ink Centroid/GAAC Labeling clusters Variants c ■ 1 1" 1 r~* 1 j_ 1" 1 1 _i_ bingle- link vs. Lor nplete link clustering Sojka, MR Group: PV211: Hierarchical clustering 31 / 62 Introduction Single-link/Complete-link Centroid / GAAC Labeling clusters Variants oingie- 2 1 0 xxxxxxxxxxxx "x" X X X X X X X X x x~~^> 1 1 1 2 1 3 4 5 6 7 8 9 1 1 1 10 11 12 Single-link clustering often produces long, straggly clusters. For most applications, these are undesirable. Sojka, MR Group: PV211: Hierarchical clustering 32 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants -cluster clustering will compiete-nnk producer a di d2 d3 d4 c/5 0 1 2 3 4 5 6 7 Coordinates: 1 + 2 x e, 4, 5 + 2 x e, 6, 7 — e. Sojka, MR Group: PV211: Hierarchical clustering 33 / 62 Introduction Single-link/Complete-link Centroidy ;GAAC Labeling clusters Variants ^ompiete-iinK: oensn uvity 10 OU tners 02 C/3 c/4 c/5 x} (x X X. 0 1 2 3 4 5 6 7 • The complete-link clustering of this set splits c/2 from its right neighbors - clearly undesirable. o The reason is the outlier d\. 9 This shows that a single outlier can negatively affect the outcome of complete-link clustering. • Single-link clustering does better in this case. Sojka, MR Group: PV211: Hierarchical clustering 34 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants • The similarity of two clusters is the average intersimilarity -the average similarity of documents from the first cluster with documents from the second cluster. • A naive implementation of this definition is inefficient (0(A/2)), but the definition is equivalent to computing the similarity of the centroids: SIM-CENT(cj/, UJj) = /2(CJ/) • jl{uJj) • Hence the name: centroid HAC • Note: this is the dot product, not cosine similarity! □ Sojka, MR Group: PV211: Hierarchical clustering 36 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants ompute centroia clustering 5 4 + + X d1 3 2 --1 --0 X d3 + x d2 X cV4 d5 X X d6 01234567 Sojka, MR Group: PV211: Hierarchical clustering 37 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants 1 ^entroia clustering 5 4- X c/i X d3 4 — O 3 -- X d2 2 — 1 --0 0 X d4 ^0 d5 X--0--X d6 Hi 15 6 7 Sojka, MR Group: PV211: Hierarchical clustering 38 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants I nversion in cen •nil clustering • In an inversion, the similarity increases during a merge sequence. Results in an "inverted" dendrogram. • Below: Similarity of the first merger (di U d2) is -4.0, similarity of second merger ((di U d2) U c/3) is ~ —3.5. 5 - 4 — 3 -- 2 -- cŕi 1 -- d3 X X O d2 X 0 j-4 —3 —2 — 1 -- 0 0 1 2 3 4 5 Sojka, MR Group: PV211: Hierarchical clustering 39 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants 1 nversions • Hierarchical clustering algorithms that allow inversions are inferior. • The rationale for hierarchical clustering is that at any given point, we've found the most coherent clustering for a given K. • Intuitively: smaller clusterings should be more coherent than larger clusterings. • An inversion contradicts this intuition: we have a large cluster that is more coherent than one of its subclusters. • The fact that inversions can occur in centroid clustering is a reason not to use it. Sojka, MR Group: PV211: Hierarchical clustering 40 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants p-average aggiomerative clustering • GAAC also has an "average-similarity" criterion, but does not have inversions. • The similarity of two clusters is the average intrasimilarity -the average similarity of all document pairs (including those from the same cluster). • But we exclude self-similarities. Sojka, MR Group: PV211: Hierarchical clustering 41 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants p-average aggiomerative clustering • Again, a naive implementation is inefficient (0(A/2)) and there is an equivalent, more efficient, centroid-based definition: sim-ga(cj/,cj/) = (W, + W,)(W, + W,-1)K S 3„)» - («,+ W;)] • Again, this is the dot product, not cosine similarity. Sojka, MR Group: PV211: Hierarchical clustering 42 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants clustering snou use • Don't use centroid HAC because of inversions. • In most cases: GAAC is best since it isn't subject to chaining and sensitivity to outliers. • However, we can only use GAAC for vector representations. • For other types of document representations (or if only pairwise similarities for documents are available): use complete-link. • There are also some applications for single-link (e.g., duplicate detection in web search). □ Sojka, MR Group: PV211: Hierarchical clustering 43 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants at or hierarchical clustering.' • For high efficiency, use flat clustering (or perhaps bisecting /c-means) • For deterministic results: HAC • When a hierarchical structure is desired: hierarchical algorithm • HAC also can be applied if K cannot be predetermined (can start without knowing K) □ Sojka, MR Group: PV211: Hierarchical clustering 44 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants ajor issue in clustering - laoenng • After a clustering algorithm finds a set of clusters: how can they be useful to the end user? • We need a pithy label for each cluster. 9 For example, in search result clustering for "jaguar", The labels of the three clusters could be "animal", "car", and "operating system". o Topic of this section: How can we automatically find good labels for clusters? Sojka, MR Group: PV211: Hierarchical clustering 46 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants 9 Come up with an algorithm for labeling clusters • Input: a set of documents, partitioned into K clusters (flat clustering) • Output: A label for each cluster • Part of the exercise: What types of labels should we consider? Words? □ Sojka, MR Group: PV211: Hierarchical clustering 47 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants E iscriminative labeling • To label cluster cj, compare uj with all other clusters • Find terms or phrases that distinguish uj from the other clusters • We can use any of the feature selection criteria we introduced in text classification to identify discriminating terms: mutual information, x2 and frequency. • (but the latter is actually not discriminative) □ Sojka, MR Group: PV211: Hierarchical clustering 48 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants on- sc ve laDenng • Select terms or phrases based solely on information from the cluster itself o E.g., select terms with high weights in the centroid (if we are using a vector space model) • Non-discriminative methods sometimes select frequent terms that do not distinguish clusters. • For example, Monday, Tuesday, ... in newspaper text □ Sojka, MR Group: PV211: Hierarchical clustering 49 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants sing tines Tor laoenng clusters • Terms and phrases are hard to scan and condense into a holistic idea of what the cluster is about. • Alternative: titles • For example, the titles of two or three documents that are closest to the centroid. • Titles are easier to scan than a list of phrases. Sojka, MR Group: PV211: Hierarchical clustering 50 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants uster laoenng: txampie 9 10 # docs 622 1017 1259 labeling method centroid oil plant mexico production crude power 000 refinery gas bpd police security rus-sian people military peace killed told grozny court 00 000 tonnes traders futures wheat prices cents September tonne mutual information plant oil production barrels crude bpd mexico dolly capacity petroleum police killed military security peace told troops forces rebels people delivery traders tu-tures tonne tonnes desk wheat prices title MEXICO: Hurricane Dolly heads for Mexico coast RUSSIA: Russia's Lebed meets rebel chief in Chechnya USA: Export Business - Grain/oilseeds complex 000 00 • Three methods: most prominent terms in centroid, differential labeling using Ml, title of doc closest to centroid • All three methods do a pretty good job. Sojka, MR Group: PV211: Hierarchical clustering 51 / 62 Introduction Sing le- link/Complete-link Centroid/GAAC Labeling clusters Variants 3i sea cing , i-means: > :op-a own ai • Start with all documents in one cluster • Split the cluster into 2 using K-means • Of the clusters produced so far, select one to split (e.g. select the largest one) • Repeat until we have produced the desired number of clusters □ Sojka, MR Group: PV211: Hierarchical clustering 53 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants Bisecting K-means BisectingKMeans(c/i, ..., d^) 1 cjo <- {d\,..., d/\i} 2 leaves <— {ujo} 3 for k ™)'(/W (A/m+/v,)(!vm+A/,-i) IK + nf (Nm + N£)] □ Sojka, MR Group: PV211: Hierarchical clustering 58 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants I omparison o aigoritnm method combination similarity time compl. optimal? comment single-link max intersimilarity of any 2 docs 0(A/2) yes chaining effect complete-link min intersimilarity of any 2 docs 0(A/2 log N) no sensitive to outliers group-average average of all sims 0(A/2log N) no best choice for most applications centroid average intersimilarity 0(A/2log N) no inversions can occur Sojka, MR Group: PV211: Hierarchical clustering 59 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants e hierarchy • Use as is (e.g., for browsing as in Yahoo hierarchy) • Cut at a predetermined threshold • Cut to get a predetermined number of clusters K • Ignores hierarchy below and above cutting line. Sojka, MR Group: PV211: Hierarchical clustering 60 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants ake-away today o Introduction to hierarchical clustering • Single-link and complete-link clustering • Centroid and group-average agglomerative clustering (GAAC) • Bisecting K-means • How to label clusters automatically □ Sojka, MR Group: PV211: Hierarchical clustering 61 / 62 Introduction Single-link/Complete-link Centroid/GAAC Labeling clusters Variants lesources • Chapter 17 of MR • Resources at http://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and Fl MU library o Columbia Newsblaster (a precursor of Google News): McKeown et al. (2002) • Bisecting K-means clustering: Steinbach et al. (2000) • PDDP (similar to bisecting K-means; deterministic, but also less efficient): Saravesi and Boley (2004) Sojka, MR Group: PV211: Hierarchical clustering 62 / 62 PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/PV211 MR 18: Latent Semantic Indexing Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2017-04-13 Sojka, MR Group: PV211: Latent Semantic Indexing 1/44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering verview Q Recap Q Latent semantic indexing Q Dimensionality reduction O LSI in information retrieval 0 CI ustermg Sojka, MR Group: PV211: Latent Semantic Indexing 2/44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering naexing ancnor • Anchor text is often a better description of a page's content than the page itself. • Anchor text can be weighted more highly than the text on the page. • A Google bomb is a search with "bad" results due to maliciously manipulated anchor text. [dangerous cult] on Google, Bing, Yahoo Sojka, MR Group: PV211: Latent Semantic Indexing 4/44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering -age Kan • Model: a web surfer doing a random walk on the web • Formalization: Markov chain • PageRank is the long-term visit rate of the random surfer or the steady-state distribution. • Need teleportation to ensure well-defined PageRank • Power method to compute PageRank • PageRank is the principal left eigenvector of the transition probability matrix. Sojka, MR Group: PV211: Latent Semantic Indexing 5/44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering L-omputing v 'ager ^ariK: rower metnoa Xl X2 Pt(d2) Pn = o.i Pi2 = 0.9 P2i = 0.3 P22 = 0.7 to 0 1 0.3 0.7 = xP ti 0.3 0.7 0.24 0.76 = xP2 0.24 0.76 0.252 0.748 = xP3 t3 0.252 0.748 0.2496 0.7504 = xP4 too 0.25 0.75 0.25 0.75 = xP°° PageRank vector = Ťr = (it\,it2) = (0.25,0.75) Pt(c/i) = Pt_i(c/i) * Pn + Pt-i(cfe) * P21 Pt(cfe) = Pt_i(i/i) * P12 + Pt-i(cfe) * P22 Sojka, MR Group: PV211: Latent Semantic Indexing 6/44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering UDS ana a hubs authorities www. b es t f a res. co m www.airlinesquality.com blogs.usatoday.com/sky aviationblog.dallasnews.com www.delta.com www.united.com Sojka, MR Group: PV211: Latent Semantic Indexing 7/44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering upaate ruies • A: link matrix • h: vector of hub scores • a: vector of authority scores • HITS algorithm: • Compute h = Aa 9 Compute a = ATh • Iterate until convergence o Output (i) list of hubs ranked according to hub score and (ii) list of authorities ranked according to authority score Sojka, MR Group: PV211: Latent Semantic Indexing 8/44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering E ai Ke-away i coa ay • Latent Semantic Indexing (LSI) / Singular Value Decomposition: The math • SVD used for dimensionality reduction • LSI: SVD in information retrieval • LSI as clustering 9 gensim: Topic modelling for humans (practical use of LSI etal.) Sojka, MR Group: PV211: Latent Semantic Indexing 9/44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval eca ocument matrix Clustering Anthony Julius The Hamlet Othello and Caesar Tempest Cleopatra anthony 5.25 3.18 0.0 0.0 0.0 brutus 1.21 6.10 0.0 1.0 0.0 caesar 8.59 2.54 0.0 1.51 0.25 calpurnia 0.0 1.54 0.0 0.0 0.0 cleopatra 2.85 0.0 0.0 0.0 0.0 mercy 1.51 0.0 1.90 0.12 5.25 worser 1.37 0.0 0.11 4.15 0.25 This matrix is the basis for computing the similarity between documents and queries. Today: Can we transform this matrix, so that we get a better measure of similarity between documents and queries? Sojka, MR Group: PV211: Latent Semantic Indexing 11 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering atent semantic in • We will decompose the term-document matrix into a product of matrices. • The particular decomposition we'll use: singular value decomposition (SVD). 9 SVD: C = UHVT (where C = term-document matrix) • We will then use the SVD to compute a new, improved term-document matrix C. • We'll get better similarity values out of C (compared to C). • Using SVD for this purpose is called latent semantic indexing or LSI. □ Sojka, MR Group: PV211: Latent Semantic Indexing 12 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering c di d2 d3 d4 ds de ship 1 0 1 0 0 0 boat 0 1 0 0 0 0 ocean 1 1 0 0 0 0 wood 1 0 0 1 1 0 tree 0 0 0 1 0 1 This is a standard term-document matrix. Actually, we use a non-weighted matrix here to simplify example. Sojka, MR Group: PV211: Latent Semantic Indexing 13 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering u 1 2 3 4 5 ship -0.44 -0.30 0.57 0.58 0.25 boat -0.13 -0.33 -0.59 0.00 0.73 ocean -0.48 -0.51 -0.37 0.00 -0.61 wood -0.70 0.35 0.15 -0.58 0.16 tree -0.26 0.65 -0.41 0.58 -0.09 One row per term, one column per min(M, N) where M is the number of terms and N is the number of documents. This is an orthonormal matrix: (i) Row vectors have unit length, (ii) Any two distinct row vectors are orthogonal to each other. Think of the dimensions as "semantic" dimensions that capture distinct topics like politics, sports, economics. 2 = land/water Each number u,j in the matrix indicates how strongly related term / is to the topic represented by semantic dimension j. Sojka, MR Group: PV211: Latent Semantic Indexing 14 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering E 1 2 3 4 5 1 2.16 0.00 0.00 0.00 0.00 2 0.00 1.59 0.00 0.00 0.00 3 0.00 0.00 1.28 0.00 0.00 4 0.00 0.00 0.00 1.00 0.00 5 0.00 0.00 0.00 0.00 0.39 This is a square, diagonal matrix of dimensionality min(M, N) x min(M, A/). The diagonal consists of the singular values of C. The magnitude of the singular value measures the importance of the corresponding semantic dimension. We'll make use of this by omitting unimportant dimensions. Sojka, MR Group: PV211: Latent Semantic Indexing 15 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering VT di d2 d3 1 -0.75 -0.28 -0.20 -0.45 -0.33 -0.12 2 -0.29 -0.53 -0.19 0.63 0.22 0.41 3 0.28 -0.75 0.45 -0.20 0.12 -0.33 4 0.00 0.00 0.58 0.00 -0.58 0.58 5 -0.53 0.29 0.63 0.19 0.41 -0.22 One column per document, one row per min(M, N) where M is the number of terms and N is the number of documents. Again: This is an orthonormal matrix: (i) Column vectors have unit length, (ii) Any two distinct column vectors are orthogonal to each other. These are again the semantic dimensions from matrices U and Z that capture distinct topics like politics, sports, economics. Each number v,j in the matrix indicates how strongly related document / is to the topic represented by semantic dimension Sojka, MR Group: PV211: Latent Semantic Indexing 16 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering c di d2 d3 dA ds d6 ship 1 0 1 0 0 0 boat 0 1 0 0 0 0 ocean 1 1 0 0 0 0 wood 1 0 0 1 1 0 tree 0 0 0 1 0 1 U 1 2 3 4 5 Z 1 2 3 4 5 ship 0.44 -0.30 0.57 0.58 0.25 1 2.16 0.00 0.00 0.00 0.00 boat 0.13 -0.33 -0.59 0.00 0.73 2 0.00 1.59 0.00 0.00 0.00 ocean 0.48 -0.51 -0.37 0.00 -0.61 * 3 0.00 0.00 1.28 0.00 0.00 wood 0.70 0.35 0.15 -0.58 0.16 4 0.00 0.00 0.00 1.00 0.00 tree 0.26 0.65 -0.41 0.58 -0.09 5 0.00 0.00 0.00 0.00 0.39 VT di d2 ds ds de 1 2 3 4 5 -0.75 -0.29 0.28 0.00 -0.53 0.28 0.53 0.75 0.00 0.29 0.20 0.19 0.45 0.58 0.63 -0.45 0.63 -0.20 0.00 0.19 -0.33 0.22 0.12 -0.58 0.41 -0.12 0.41 0.33 0.58 0.22 LSI is decomposition of C into a representation of the terms, a representation of the documents and a representation of the importance of the "semantic" dimensions. □ Sojka, MR Group: PV211: Latent Semantic Indexing 17 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering ummary • We've decomposed the term-document matrix C into a product of three matrices: UHVJ. & The term matrix U - consists of one (row) vector for each term • The document matrix VT - consists of one (column) vector for each document • The singular value matrix Z - diagonal matrix with singular values, reflecting importance of each dimension • Next: Why are we doing this? □ Sojka, MR Group: PV211: Latent Semantic Indexing 18 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering VT di d2 d3 1 -0.75 -0.28 -0.20 -0.45 -0.33 -0.12 2 -0.29 -0.53 -0.19 0.63 0.22 0.41 3 0.28 -0.75 0.45 -0.20 0.12 -0.33 4 0.00 0.00 0.58 0.00 -0.58 0.58 5 -0.53 0.29 0.63 0.19 0.41 -0.22 Verify that the first document has unit length. Verify that the first two documents are orthogonal. 0.752 + 0.292 + 0.282 + 0.002 + 0.532 = 1.0059 -0.75 * -0.28 + -0.29 * -0.53 + 0.28 * -0.75 + 0.00 * 0.00 + -0.53*0.29 = 0 Sojka, MR Group: PV211: Latent Semantic Indexing 19 / 44 • Key property: Each singular value tells us how important its dimension is. • By setting less important dimensions to zero, we keep the important information, but get rid of the "details". • These details may o be noise - in that case, reduced LSI is a better representation because it is less noisy, o make things dissimilar that should be similar - again, the reduced LSI representation is a better representation because it represents similarity better. • Analogy for "fewer details is better" • Image of a blue flower o Image of a yellow flower a Omitting color makes is easier to see the similarity □ Sojka, MR Group: PV211: Latent Semantic Indexing 21 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering imensionaiity to u 1 2 3 4 5 ship -0.44 -0.30 0.00 0.00 0.00 boat -0.13 -0.33 0.00 0.00 0.00 ocean -0.48 -0.51 0.00 0.00 0.00 wood -0.70 0.35 0.00 0.00 0.00 tree -0.26 0.65 0.00 0.00 0.00 £2 1 2 3 4 5 1 2.16 0.00 0.00 0.00 0.00 2 0.00 1.59 0.00 0.00 0.00 3 0.00 0.00 0.00 0.00 0.00 4 0.00 0.00 0.00 0.00 0.00 5 0.00 0.00 0.00 0.00 0.00 VT di d2 ^3 de 1 -0.75 -0.28 - 0.20 -0.45 -0.33 -0.12 2 -0.29 -0.53 - 0.19 0.63 0.22 0.41 3 0.00 0.00 0.00 0.00 0.00 0.00 4 0.00 0.00 0.00 0.00 0.00 0.00 5 0.00 0.00 0.00 0.00 0.00 0.00 Actually, we only zero out singular values in Z. This has the effect of setting the corresponding dimensions in U and VT to zero when computing the product C = UTVT. □ Sojka, MR Group: PV211: Latent Semantic Indexing 22 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering imensionanty to Kl c2 di d2 ds d* ds ship 0.85 0.52 0.28 0.13 0.21 -0.08 boat 0.36 0.36 0.16 -0.20 -0.02 -0.18 ocean 1.01 0.72 0.36 -0.04 0.16 -0.21- wood 0.97 0.12 0.20 1.03 0.62 0.41 tree 0.12 -0.39 -0.08 0.90 0.41 0.49 U 1 2 3 4 5 1 2 3 4 5 ship -0.44 -0.30 0.57 0.58 0.25 1 2.16 0.00 0.00 0.00 0.00 boat -0.13 -0.33 -0.59 0.00 0.73 2 X 3 0.00 1.59 0.00 0.00 0.00 ocean -0.48 -0.51 -0.37 0.00 -0.61 0.00 0.00 0.00 0.00 0.00 wood -0.70 0.35 0.15 -0.58 0.16 4 0.00 0.00 0.00 0.00 0.00 tree -0.26 0.65 -0.41 0.58 -0.09 5 0.00 0.00 0.00 0.00 0.00 VT di d2 ds dA ds 1 -0.75 -0.28 -0.20 -0.45 -0.33 -0.12 2 -0.29 -0.53 -0.19 0.63 0.22 0.41 3 0.28 -0.75 0.45 -0.20 0.12 -0.33 4 0.00 0.00 0.58 0.00 -0.58 0.58 5 -0.53 0.29 0.63 0.19 0.41 -0.22 X □ Sojka, MR Group: PV211: Latent Semantic Indexing 23 / 44 bxample ot C = U2-V : AM tour matrices Kecall unreduced decomposition C = UHVT Exercise: Why can this be viewed as soft clustering? C d i d2 d3 d4 ds de ship 1 0 1 0 0 0 boat 0 1 0 0 0 0 ocean 1 1 0 0 0 0 wood 1 0 0 1 1 0 tree 0 0 0 1 0 1 U 1 2 3 4 5 Z 1 2 3 4 5 ship 0.44 -0.30 0.57 0.58 0.25 1 2.16 0.00 0.00 0.00 0.00 boat 0.13 -0.33 -0.59 0.00 0.73 -0.61 2 0.00 1.59 0.00 0.00 0.00 ocean 0.48 -0.51 -0.37 0.00 3 0.00 0.00 1.28 0.00 0.00 wood 0.70 0.35 0.15 -0.58 0.16 4 0.00 0.00 0.00 1.00 0.00 tree 0.26 0.65 -0.41 0.58 -0.09 5 0.00 0.00 0.00 0.00 0.39 VT di d2 d3 c/4 ds de 1 2 3 4 5 0.75 0.29 0.28 0.00 0.53 0.28 0.53 0.75 0.00 0.29 0.20 0.19 0.45 0.58 0.63 0.45 0.63 0.20 0.00 0.19 0.33 0.22 0.12 0.58 0.41 0.12 0.41 0.33 0.58 0.22 LSI is decomposition of C into a representation of the terms, a representation of the documents and a representation of the importance of the "semantic" dimensions. □ Sojka, MR Group: PV211: Latent Semantic Indexing 24 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering riginai matrix l vs. reauce c di d2 ^3 d* ship 1 0 1 0 0 0 boat 0 1 0 0 0 0 ocean 1 1 0 0 0 0 wood 1 0 0 1 1 0 tree 0 0 0 1 0 1 c2 di d2 ^3 d* de ship 0.85 0.52 0.28 0.13 0.21 -0.08 boat 0.36 0.36 0.16 -0.20 -0.02 -0.18 ocean 1.01 0.72 0.36 -0.04 0.16 -0.21 wood 0.97 0.12 0.20 1.03 0.62 0.41 tree 0.12 -0.39 0.08 0.90 0.41 0.49 We can view C2 as a two-dimensional representation of the matrix C. We have performed a dimensionality reduction to two dimensions. □ Sojka, MR Group: PV211: Latent Semantic Indexing 25 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering c di d2 ^3 d4 de Compute the ship 1 0 1 0 0 0 similarity between boat 0 1 0 0 0 0 d2 and 0(3 for the ocean 1 1 0 0 0 0 original matrix wood 1 0 0 1 1 0 and for the tree 0 0 0 1 0 1 reduced matrix. c2 di d2 ^3 d* de ship 0.85 0.52 0.28 0.13 0.21 -0.08 boat 0.36 0.36 0.16 -0.20 -0.02 -0.18 ocean 1.01 0.72 0.36 -0.04 0.16 -0.21 wood 0.97 0.12 0.20 1.03 0.62 0.41 tree 0.12 -0.39 0.08 0.90 0.41 0.49 Sojka, MR Group: PV211: Latent Semantic Indexing 26 / 44 Recap Latent semantic indexing Dimensionality reduction mt tne re auced ma" C di d2 ^3 d* de ship 1 0 1 0 0 0 boat 0 1 0 0 0 0 ocean 1 1 0 0 0 0 wood 1 0 0 1 1 0 tree 0 0 0 1 0 1 c2 di d2 ^3 d* de ship 0.85 0.52 0.28 0.13 0.21 -0.08 boat 0.36 0.36 0.16 -0.20 -0.02 -0.18 ocean 1.01 0.72 0.36 -0.04 0.16 -0.21 wood 0.97 0.12 0.20 1.03 0.62 0.41 tree 0.12 ■0.39 0.08 0.90 0.41 0.49 LSI in information retrieval an Clustering Similarity of d2 and d$ in the original space: 0. Similarity of d2 and d$ in the reduced space: 0.52 *0.28 + 0.36 * 0.16 + 0.72 *0.36 + 0.12 * 0.20 + -0.39 * -0.08 « 0.52 "boat" and "ship" are semantically similar. The "reduced" similarity measure reflects this. What property of the SVD reduction is responsible for improved similarity? □ Sojka, MR Group: PV211: Latent Semantic Indexing 27 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering ompute matrix proauc c2 di d2 cJ3 c/4 ds ds ship 0.09 0.16 0.06 -0.19 -0.07 -0.12 boat ocean 0.10 0.15 0.17 0.27 0.06 -0.21 -0.07 -0.14 0.10 -0.32 -0.11 -0.21 ???????: wood -0.10 -0.19 -0.07 0.22 0.08 0.14 tree -0.19 -0.34 -0.12 0.41 0.14 0.27 U 1 2 3 4 5 1 2 3 4 5 ship -0.44 -0.30 0.57 0.58 0.25 1 0.00 0.00 0.00 0.00 0.00 boat -0.13 -0.33 -0.59 0.00 0.73 -0.37 0.00 -0.61 X 2 0.00 1.59 0.00 0.00 0.00 ocean -0.48 -0.51 3 0.00 0.00 0.00 0.00 0.00 wood -0.70 0.35 0.15 -0.58 0.16 4 0.00 0.00 0.00 0.00 0.00 tree -0.26 0.65 -0.41 0.58 -0.09 5 0.00 0.00 0.00 0.00 0.00 VT di d2 cfe c/4 ds d6 1 -0.75 -0.28 -0.20 -0.45 -0.33 - 0.12 2 -0.29 -0.53 -0.19 0.63 0.22 0.41 3 0.28 -0.75 0.45 -0.20 0.12 - 0.33 4 0.00 0.00 0.58 0.00 -0.58 0.58 5 -0.53 0.29 0.63 0.19 0.41 - 0.22 X Sojka, MR Group: PV211: Latent Semantic Indexing 28 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering y we use inTormation retrieva • LSI takes documents that are semantically similar (= talk about the same topics), . .. • . .. but are not similar in the vector space (because they use different words) . .. • . .. and re-represents them in a reduced vector space . .. • ... in which they have higher similarity. • Thus, LSI addresses the problems of synonymy and semantic related ness. • Standard vector space: Synonyms contribute nothing to document similarity. • Desired effect of LSI: Synonyms contribute strongly to document similarity. Sojka, MR Group: PV211: Latent Semantic Indexing 30 / 44 Recap 1 _atent semantic indexing Dimensionality reduction LSI in information retrieval Clustering ow i aaaresses synonymy an lantic reiatedness • The dimensionality reduction forces us to omit a lot of "detail". • We have to map differents words (= different dimensions of the full space) to the same dimension in the reduced space. o The "cost" of mapping synonyms to the same dimension is much less than the cost of collapsing unrelated words. • SVD selects the "least costly" mapping (see below), o Thus, it will map synonyms to the same dimension. • But it will avoid doing that for unrelated words. Sojka, MR Group: PV211: Latent Semantic Indexing 31 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering omparison to otner • Recap: Relevance feedback and query expansion are used to increase recall in information retrieval - if query and documents have no terms in common. o (or, more commonly, too few terms in common for a high similarity score) • LSI increases recall and hurts precision. • Thus, it addresses the same problems as (pseudo) relevance feedback and query expansion ... • . .. and it has the same problems. Sojka, MR Group: PV211: Latent Semantic Indexing 32 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering mpiementation • Compute SVD of term-document matrix • Reduce the space and compute reduced document representations • Map the query into the reduced space qk = T^U^q. • This follows from: Ck = UkTkVj T^UTC = • Compute similarity of qk with all reduced documents in Vk. 9 Output ranked list of documents as usual 9 Exercise: What is the fundamental problem with this approach? Sojka, MR Group: PV211: Latent Semantic Indexing 33 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering prima ity • SVD is optimal in the following sense. • Keeping the k largest singular values and setting all others to zero gives you the optimal approximation of the original matrix C. Eckart-Young theorem • Optimal: no other matrix of the same rank (= with the same underlying dimensionality) approximates C better. o Measure of approximation is Frobenius norm: • So LSI uses the "best possible" matrix. • There is only one best possible matrix - unique solution (modulo signs). • Caveat: There is only a tenuous relationship between the Frobenius norm and cosine similarity between documents. Sojka, MR Group: PV211: Latent Semantic Indexing 34 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering ata tor graphical illustration o ci Human machine interface for lab abc computer applications C2 A survey of user opinion of computer system response time C3 The EPS user interface management system C4 System and human system engineering testing of EPS C5 Relation of user perceived response time to error measurement mi The generation of random binary unordered trees /t?2 The intersection graph of paths in trees /T73 Graph minors IV Widths of trees and well quasi ordering /t?4 Graph minors A survey The matrix C cl c2 c3 c4 c5 ml m2 m3 m4 human 1 0 0 1 0 0 0 0 0 interface 1 0 1 0 0 0 0 0 0 computer 1 1 0 0 0 0 0 0 0 user 0 1 1 0 1 0 0 0 0 system 0 1 1 2 0 0 0 0 0 response 0 1 0 0 1 0 0 0 0 time 0 1 0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 survey 0 1 0 0 0 0 0 0 1 trees 0 0 0 0 0 1 1 1 0 graph 0 0 0 0 0 0 1 1 1 minors 0 0 0 0 0 0 0 1 1 Sojka, MR Group: PV211: Latent Semantic Indexing 35 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering icai illustration o 11 graph c*m3(10,11,12) □ m4(9,11,12) # 10tree 12 minor □•m2(10,11) • 9 survey ^ □ m1(10) „ c#6,7) 6 repsonse • 3 computer* 4 user 02(3,4,5,6,7,9) s nq(1,3) \ \ \ \ 5 d (1,2,3) 2 interface • 1 humanEps □ 03(2,4,5,8) • 5 system 2-dimensional plot of C2 (scaled dimensions). Circles = terms. Open squares = documents (component terms in parentheses), q = query "human computer interaction". The dotted cone represents the region whose points are within a cosine of .9 from q . All documents about human-computer documents (cl-c5) are near q, even c3/c5 although they share no terms. None of the graph theory documents (ml-m4) are near q. Sojka, MR Group: PV211: Latent Semantic Indexing 36 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering What happens when we rank documents according to cosine similarity in the original vector space? What happens when we rank documents according to cosine similarity in the reduced vector space? Sojka, MR Group: PV211: Latent Semantic Indexing 37 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering m perTorms Detter tnan vector space on 0 to 1 00 o CD Ö d CM d o d 0.0 MED: Precision-Recall Curves Means across Queries - LS1-100 --- SMART ....... TERM ---- VO LSI-10C nTERM SMART 0.2 0.4 0.6 0.8 1.0 LSI-100 = LSI reduced to 100 dimensions; SMART = SMART implementation of vector space model □ Sojka, MR Group: PV211: Latent Semantic Indexing 38 / 44 bxample ot C = U2-V : AM tour matrices Kecall unreduced decomposition C = UHVT Exercise: Why can this be viewed as soft clustering? C d i d2 d3 d4 ds de ship 1 0 1 0 0 0 boat 0 1 0 0 0 0 ocean 1 1 0 0 0 0 wood 1 0 0 1 1 0 tree 0 0 0 1 0 1 U 1 2 3 4 5 Z 1 2 3 4 5 ship 0.44 -0.30 0.57 0.58 0.25 1 2.16 0.00 0.00 0.00 0.00 boat 0.13 -0.33 -0.59 0.00 0.73 -0.61 2 0.00 1.59 0.00 0.00 0.00 ocean 0.48 -0.51 -0.37 0.00 3 0.00 0.00 1.28 0.00 0.00 wood 0.70 0.35 0.15 -0.58 0.16 4 0.00 0.00 0.00 1.00 0.00 tree 0.26 0.65 -0.41 0.58 -0.09 5 0.00 0.00 0.00 0.00 0.39 VT di d2 d3 dA ds de 1 2 3 4 5 0.75 0.29 0.28 0.00 0.53 0.28 0.53 0.75 0.00 0.29 0.20 0.19 0.45 0.58 0.63 0.45 0.63 0.20 0.00 0.19 0.33 0.22 0.12 0.58 0.41 0.12 0.41 0.33 0.58 0.22 LSI is decomposition of C into a representation of the terms, a representation of the documents and a representation of the importance of the "semantic" dimensions. □ Sojka, MR Group: PV211: Latent Semantic Indexing 40 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering m can be viewed as soft clustering • Each of the k dimensions of the reduced space is one cluster. • If the value of the LSI representation of document d on dimension k is x, then x is the soft membership of d in topic k. 9 This soft membership can be positive or negative. • Example: Dimension 2 in our SVD decomposition • This dimension/cluster corresponds to the water/earth dichotomy. • "ship", "boat", "ocean" have negative values. • "wood", "tree" have positive values. • di, d2, ds have negative values (most of their terms are water terms). • c/4, c/5, c/g have positive values (all of their terms are earth terms). □ Sojka, MR Group: PV211: Latent Semantic Indexing 41 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering semantic ma exmg ana Clustering wr ensim Gensim: an open-source vector space modeling and topic modeling toolkit, implemented in the Python programming language Tutorial examples of topic modelling for humans (LSI): http://radimrehurek.com/gensim/tut2.html DML-CZ similarity example: http://dml.cz/handle/10338.dmlcz/500114/SimilarArticles cf. papers similar to famous Otakar Boruvka's paper Go forth and create masterpieces for semantic indexing applications (by gensim, similarly as others already did ;-)! Sojka, MR Group: PV211: Latent Semantic Indexing 42 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering E ai Ke-away i coa ay • Latent Semantic Indexing (LSI) / Singular Value Decomposition: The math • SVD used for dimensionality reduction • LSI: SVD in information retrieval • LSI as clustering 9 gensim: Topic modelling for humans (practical use of LSI etal.) Sojka, MR Group: PV211: Latent Semantic Indexing 43 / 44 Recap Latent semantic indexing Dimensionality reduction LSI in information retrieval Clustering lesources o Chapter 18 of MR • Resources at http://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and Fl MU library • Original paper on latent semantic indexing by Deerwester et al. • Paper on probabilistic LSI by Thomas Hofmann • Word space: LSI for words Sojka, MR Group: PV211: Latent Semantic Indexing 44 / 44 Big picture Ads Duplicate detection Spam Web IR Size of the web PV211: Introduction to Information Retrieval http://www.f i.muni.cz/~ s oj ka/PV211 MR 19: Web search Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2017-04-25 Sojka, MR Group: PV211: Web search 1 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web Big picture o Ads o Duplicate detection o Spam Web IR • Queries • Links • Context • Users • Documents • Size o Size of the web Sojka, MR Group: PV211: Web search 2 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web Web search overview Sojka, MR Group: PV211: Web search 4 / 117 Big picture Duplicate detection Spam Web IR Size of the web How often do you use search engines on the Internet? Four or more times each day At least once everyday Several times each week At least once each week Several times each month Less frequently Never I 1.2% 21 •2%) o 100 200 300 400 500 ^Prospect Number of Responses 35.1% 600 700 Sojka, MR Group: PV211: Web search 5 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • Without search, content is hard to find. o —> Without search, there is no incentive to create content. • Why publish something if nobody will read it? o Why publish something if I don't get ad revenue from it? • Somebody needs to pay for the web. o Servers, web infrastructure, content creation o A large part today is paid by search ads. o Search pays for the web. □ Sojka, MR Group: PV211: Web search 6 / 117 Big picture Duplicate detection Spam Web IR Size of the web ggreg • Unique feature of the web: A small number of geographically dispersed people with similar interests can find each other. • Elementary school kids with hemophilia o People interested in translating R5 Scheme into relatively portable C (open source project) • Search engines are a key enabler for interest aggregation. Sojka, MR Group: PV211: Web search 7 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web w V k in genera a On the web, search is not just a nice feature. o Search is a key enabler of the web: .. . o . . .financing, content creation, interest aggregation etc. look at search ads • The web is a chaotic and uncoordinated collection. —>► lots of duplicates - need to detect duplicates o No control / restrictions on who can author content —> lots of spam - need to detect spam • The web is very large. —>► need to know how big it is □ Sojka, MR Group: PV211: Web search 8 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • Big picture • Ads - they pay for the web • Duplicate detection - addresses one aspect of chaotic content creation • Spam detection - addresses one aspect of lack of central access control • Probably won't get to today o Web information retrieval • Size of the web Sojka, MR Group: PV211: Web search 9 / 117 Big picture Duplicate detection Spam Web IR Size of the web S ■ i I Act«« 7^ *f dl u HMt ntttl t^fti'nrt ujvj. Wan *v Sninám 1. WUfnlnpton Rfral E Stalo fíiiHd.fil^ Wilmington1* olfůrmůriůfl *nrj rpjl Ľítutt iJUldi. Thi* tf VOUr OH- í- CnldwoU Wilminqtcfi'í fiLam^8f one real ísíatů company. l*tíw.CÍÍrří>ÍEr.Cůffl £t*tt u id* aiUfe«F. 10. Ill 3, All 111 Every! hiflrj you ng&tf to kflüw íbOut buying Of í riling J h&ms C on *ny Wŕl> SK&I Sojka, MR Group: PV211: Web search 11 / 117 Big picture Duplicate detection Spam Web IR Size of the web ■Or Ki«i n** «f ill um MWl I Wilmington fieal Eitate - Buddy Blak», Wilmington"! oiformiHOn and mal «tat» juidt. Thii ii vom cm' •nylluno. 10 do with wtoancjton. 3. CnldwMrit ftnnfcRT San CflnsLftflfllm Wilmington's fiumbeř on* real errata company 3 WHinlnrjton, NC Roal Eilalo Bucfcy aullflftl Everylhaio. you n»ecT to know about buying w ířliirwg a home c on my Web mel • Buddy Blake bid the maximum ($0.38) for this search. • He paid $0.38 to Goto every time somebody clicked on the link. o Pages were simply ranked according to bid - revenue maximization for Goto. • No separation of ads/docs. Only one result list! • Upfront and honest. No relevance ranking, . .. o .. . but Goto did not pretend there was any. □ Sojka, MR Group: PV211: Web search 12 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • Strict separation of search results and search ads Sojka, MR Group: PV211: Web search 13 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web Pag Web Images Maps News Shopping Gmail more Sign ir Gougle 'discount broker Search | Advanced Search Preferences Web Results 1 -10 of about 807,000 for discount broker [definition]. (0.12 seconds) Sponsored Links Discount Broker Reviews Rated #1 Online Broker Information on online discount brokers emphasizing rates, charges, and customer comments and complaints www.broker-reviews.us/ - 94k - Cached - Similar pages Discount Broker Rankings (2008 Broker Survey) at SmartMoney.com Discount Brokers. Rank/ Brokerage/ Minimum to Open Account, Comments, Standard Commis- sion*, Reduced Commission, Account Fee PerYear(Howto Avoid), Avg.... www.smartmoney.com/brokers/index.cfm?story=2004-discount-table - 121k - Cached - Similar pages No Minimums. No Inactivitv^Ee Transfer to Firstrade fo>»ffee! www.firstrade.con Stock Brokers | Discount Brokers | Online Brokers Most Recommended. Top 5 Brokers headlines. 10. Don't Pay Your Broker, May 15 at 3:39 PM. 5. Don't Discount the Discounters Apr 18 at 2:41 P| www.fool.com/investing/brokers/index.aspx - 44k - Cached - SimN^pages Discount Broker Discount Broker - Definition of Discount BrokeroH'Investopedia - A stockbroker who carries out buy and sell orders at a reduced commisspffcompared to a ... www.investopedia.com/terms/d/discounJJ»rtSker.asp -31k- Cached - Similar pages Discount Brokerage^pl^hline Trading for Smart Stock Market Online stock broker Soc^Hp^»Dffers the best in discount brokerage investing. Get stock market quotes from thiSTmemet stock trading company. www.sogotrade.com/ - 39k - Cached - Similar pages 15 questions to ask discount brokers - MSN Money Jan 11, 2004 ... If you're not big on hand-holding when it comes to investing, a discount broker can be an economical way to go. Just be sure to ask these ... moneycentral.msn.com/content/lnvesting/Startinvesting/P66171 .asp - 34k - Cached - Similar pages Cjjflfffriission free trades for 30 days. o maintenance fees. Sign up now. TDAMERITRADE.com TradeKing - Online Broker $4.95 perTrade, Market or Limit SmartMoney Top Discount Broke/ www.TradeKing.com Scottrade Brokerage $7 Trades, No Share Limit. I/-Depth Research. Start Trading O^ne Now! www.Scottrade.com Stock Trirtr- |1 «faTl 100 free tiadus, up 9*11UU back for transfer costs, $500 minimum www.sogotrade.com $3.95 Online Stock Trades Market/Limit Orders, No Share Limit and No Inactivity Fees www.Marsco.com INGDIRECT I ShareBuilder Sogo Trade appears in search results. SogoTrade pears in ads. ap- Do search engines rank advertisers higher than non-advertisers? All major search engines claim no. Sojka, MR Group: PV211: Web search 14 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • Similar problem at newspapers / TV channels • A newspaper is reluctant to publish harsh criticism of its major advertisers. o The line often gets blurred at newspapers / on TV. • No known case of this happening with search engines yet? Sojka, MR Group: PV211: Web search 15 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web Web Images Maps News Shopping Gmail more Sign ir Google 'discount broker Search Advanced Search Preferences Web Discount Broker Reviews Information on online discount brokers emphasizing rates, charges, and customer comments and complaints. www.broker-reviews.us/ - 94k - Cached - Similar pages Results 1 -10 of about 807,000 for discount broker [definition], (0.12 seconds) Sponsored Links Rated #1 Online Broker No Minimums. No Inactivity Fee Discount Broker Rankings (2008 Broker Survey) at SmartMoney.com Discount Brokers. Rank/ Brokerage/ Minimum to Open Account, Comments, Standard Commis-sion*, Reduced Commission, Account Fee Per Year (How to Avoid), Avg.... www.smartmoney .com/brokers/index. cfm?story=2004-discount-table - 121k - Cached - Similar pages Stock Brokers | Discount Brokers | Online Brokers Most Recommended. Top 5 Brokers headlines. 10. Don't Pay Your Broker for Free Funds May 15 at 3:39 PM. 5. Don't Discount the Discounters Apr 18 at 2:41 PM ... www.fool.com/investing/brokers/index.aspx - 44k - Cached - Similar pages Discount Broker Discount Broker - Definition of Discount Broker on Investopedia - A stockbroker who carries out buy and sell orders at a reduced commission compared to a ... www.investopedia.com/terms/d/discountbroker.asp -31k- Cached - Similar pages Discount Brokerage and Online Trading for Smart Stock Market... Online stock broker SogoTrade offers the best in discount brokerage investing. Get stock market quotes from this internet stock trading company. www.sogotrade.com/ - 39k - Cached - Similar pages 15 questions to ask discount brokers - MSN Money Jan 11, 2004 ... If you're not big on hand-holding when it comes to investing, a discount broker can be an economical way to go. Just be sure to ask these ... moneycentral.msn.com/content/lnvesting/Startinvesting/P66171 .asp - 34k - Cached - Similar pages Transfer to Firstrade for Free! www.firstrade.com Discount Broker Commission free trades for 30 days. No maintenance fees. Sign up now. TDAMERITRADE.com TradeKing - Online Broker $4.95 per Trade, Market or Limit SmartMoney Top Discount Broker 200i www.TradeKing.com Scottrade Brokerage $7 Trades, No Share Limit. In-Depth Research. Start Trading Online Now! www.Scottrade.com Stock trades £1.50 - £3 100 free trades, up to $100 back for transfer costs, $500 minimum www.sogotrade.com £3.95 Online Stock Trades Market/Limit Orders, No Share Limit and No Inactivity Fees www.Marsco.com INGDIRECT I ShareBuilder Sojka, MR Group: PV211: Web search 16 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web w ar • Advertisers bid for keywords - sale by auction. • Open system: Anybody can participate and bid on keywords. • Advertisers are only charged when somebody clicks on your ad. • How does the auction determine an ad's rank and the price paid for the ad? o Basis is a second price auction, but with twists • For the bottom line, this is perhaps the most important research area for search engines - computational advertising. • Squeezing an additional fraction of a cent from each ad means billions of additional revenue for the search engine. Sojka, MR Group: PV211: Web search 17 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web nking aas • Selecting the ads to show for a query and ranking them is a ranking problem . .. • .. .similar to the document ranking problem. • Key difference: The bid price of each ad is a factor in ranking that we didn't have in document ranking. • First cut: rank advertisers according to bid price Sojka, MR Group: PV211: Web search 18 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • First cut: according to bid price a la Goto • Bad idea: open to abuse o Example: query [treatment for cancer?] —> how to write your last will • We don't want to show nonrelevant or offensive ads. • Instead: rank based on bid price and relevance • Key measure of ad relevance: clickthrough rate • clickthrough rate = CTR = clicks per impressions • Result: A nonrelevant ad will be ranked low. • Even if this decreases search engine revenue short-term o Hope: Overall acceptance of the system and overall revenue is maximized if users get useful information. • Other ranking factors: location, time of day, quality and loading speed of landing page • The main ranking factor: the query □ Sojka, MR Group: PV211: Web search 19 / 117 Big picture Duplicate detection Spam Web IR Size of the web na pri advertiser bid CTR ad rank rank paid A B C D $4.00 $3.00 $2.00 $1.00 0.01 0.03 0.06 0.08 0.04 0.09 0.12 0.08 4 2 1 3 (minimum) $2.68 $1.51 $0.51 • bid: maximum bid for a click by advertiser • CTR: click-through rate: when an ad is displayed, what percentage of time do users click on it? CTR is a measure of relevance. • ad rank: bid x CTR: this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is • rank: rank in auction • paid: second price auction price paid by advertiser Sojka, MR Group: PV211: Web search 20 / 117 Big picture Duplicate detection Spam na pri Web IR Size of the web advertiser bid CTR ad rank rank paid A B C D $4.00 $3.00 $2.00 $1.00 0.01 0.03 0.06 0.08 0.04 0.09 0.12 0.08 4 2 1 3 (minimum) $2.68 $1.51 $0.51 Second price auction: The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent). pricei x CTRi = bid2 x CTR2 (this will result in ranki=ranl<2) pricei = bid2 x CTR2 / CTRi pi = bid2 x CTR2/CTR1 = 3.00 x 0.03/0.06 = 1.50 p2 = bid3 x CTR3/CTR2 = 1.00 x 0.08/0.03 = 2.67 p3 = bid4 x CTR4/CTR3 = 4.00 x 0.01/0.08 = 0.50 □ Sojka, MR Group: PV211: Web search 21 / 117 Big picture Ads uplicate detection Spam Web IR Size of the web Bywora s wr According to https://web.archive.org/web/20080928175127/http://www.cwire $69.1 mesothelioma treatment options $65.9 personal injury lawyer michigan $62.6 student loans consolidation $61.4 car accident attorney los angeles $59.4 online car insurance quotes $59.4 arizona dui lawyer $46.4 asbestos cancer $40.1 home equity line of credit $39.8 life insurance quotes $39.2 refinancing $38.7 equity line of credit $38.0 lasik eye surgery new york city $37.0 2nd mortgage $35.9 free car insurance quote Sojka, MR Group: PV211: Web search 22 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web win-win-win • The search engine company gets revenue every time somebody clicks on an ad. o The user only clicks on an ad if they are interested in the ad. Search engines punish misleading and nonrelevant ads. o Asa result, users are often satisfied with what they find after clicking on an ad. • The advertiser finds new customers in a cost-effective way. Sojka, MR Group: PV211: Web search 23 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • Why is web search potentially more attractive for advertisers than TV spots, newspaper ads or radio spots? • The advertiser pays for all this. How can the advertiser be cheated? • Any way this could be bad for the user? 9 Any way this could be bad for the search engine? □ Sojka, MR Group: PV211: Web search 24 / 117 Big picture Duplicate detection Spam Web IR Size of the web win-win-win: rage • Buy a keyword on Google a Then redirect traffic to a third party that is paying much than you are paying Google. o E.g., redirect to a page full of ads • This rarely makes sense for the user. • Ad spammers keep inventing new tricks. o The search engines need time to catch up with them. Sojka, MR Group: PV211: Web search 25 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web win-win-win: • Example: geico • During part of 2005: The search term "geico" on Google was bought by competitors. • Geico lost this case in the United States. • Louis Vuitton lost similar case in Europe. • See https://web.archive.org/web/20050702015704/www.google.c • It's potentially misleading to users to trigger an ad off of a trademark if the user can't buy the product on the site. Sojka, MR Group: PV211: Web search 26 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • The web is full of duplicated content. • More so than many other collections • Exact duplicates o Easy to eliminate o E.g., use hash/fingerprint • Near-duplicates o Abundant on the web Difficult to eliminate • For the user, it's annoying to get a search result with near-identical documents. • Marginal relevance is zero: even a highly relevant document becomes non-relevant if it appears below a (near-)duplicate. • We need to eliminate near-duplicates. Sojka, MR Group: PV211: Web search 28 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web Google c. Li Flight div W Micha... O WlKIPEDlA The Free Encyclopedia navigation_ ■ Main page ■ Contents ■ Featured content ■ Cuaent events ■ Random article search (Čo) f Search ) interaction ■ About Wikipedia ■ Commun ity portal ■ Recent changes I |-r,nt-m> tl»,|,i„-.,< - Michael Jackson i * Michael Jackson From Wikipedia, the free encyclopedia For other persons named Michael Jackson, see Michael Jackson (disambiguation). Michael Joseph Jackson [August 29, 195A- June 25, 2009) was an American recording artist, entertainer and businessman. The seventh child of the Jackson family, he made his debut as an entertainer in 1968 as a member of The. wapedia. Wiki: Michael Jackson (1/6) For other persons named Michael Jackson, see Michael Jackson (disambiguation). Michael Joseph Jackson (August 29,1958 - June 25,2009) was an American recording artist, entertainer and businessman. The seventh child of the Jackson family, he made his debut as an entertainer in 1968 as a member of The Jackson 5. He then began a solo O Find: (Q pric (_r4ext I Previous) (o Highlight ajl j Q Mjtch case j Q Find: (q" i ( Next Previous 1 Hlg Sojka, MR Group: PV211: Web search 29 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web How would you eliminate near-duplicates on the web? Sojka, MR Group: PV211: Web search 30 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web mg near-aupn • Compute similarity with an edit-distance measure • We want "syntactic" (as opposed to semantic) similarity. • True semantic similarity (similarity in content) is too difficult to compute. • We do not consider documents near-duplicates if they have the same content, but express it with different words. 9 Use similarity threshold 6 to make the call "is/isn't a near-duplicate". • E.g., two documents are near-duplicates if similarity >6 = 80%. □ Sojka, MR Group: PV211: Web search 31 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • A shingle is simply a word n-gram. • Shingles are used as features to measure syntactic similarity of documents. • For example, for n = 3, "a rose is a rose is a rose" would be represented as this set of shingles: • { a-rose-is, rose-is-a, is-a-rose } • We can map shingles to 1..2m (e.g., m = 64) by fingerprinting. • From now on: Sk refers to the shingle's fingerprint in 1..2m. • We define the similarity of two documents as the Jaccard coefficient of their shingle sets. □ Sojka, MR Group: PV211: Web search 32 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • A commonly used measure of overlap of two sets • Let A and B be two sets • Jaccard coefficient: jaccard(^, B) = AnB AuB (A ^ 0 or B ^ 0) o jaccard(^,^) = 1 o jaccard(ae) = oif^ne = o • A and B don't have to be the same size. • Always assigns a number between 0 and 1 □ Sojka, MR Group: PV211: Web search 33 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • Three documents: d\\ "Jack London traveled to Oakland" d2\ "Jack London traveled to the city of Oakland" c/3: "Jack traveled from Oakland to London" o Based on shingles of size 2 (2-grams or bigrams), what are the Jaccard coefficients J(d\,d2) and J(d\^d^)l • J{dud2) = 3/8 = 0.375 o J(cfi,cf3) = 0 • Note: very sensitive to dissimilarity □ Sojka, MR Group: PV211: Web search 34 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • The number of shingles per document is large. • To increase efficiency, we will use a sketch, a cleverly chosen subset of the shingles of a document. o The size of a sketch is, say, n = 200 .. . • .. . and is defined by a set of permutations 7Ti... 7T2oo- • Each 7T; is a random permutation on 1..2m • The sketch of d is defined as: < minsGcy 7Ti(s), minsGcy tt2(s), ... , minsGcy 7r20o(s) > (a vector of 200 numbers). □ Sojka, MR Group: PV211: Web search 35 / 117 Big picture Duplicate detection Spam Web IR Size of the web na minimum document 1: {sk} document 2: {sk} Si S2 S3 S4 xk = 7r(sk) Si S5 S3 S4 Xk = ^(s/c) o »»0 o • • o X3 Xi x4 x2 O o »o O_. om X3 Xi X4 *5 O *3 O O Xi x4 mm Sk n(sk) O *3 O *2 o *3 O O Xi x5 mm Sk K(sk) O *3 O *2 i/t7 i/t7 We use minsGcy17r(s) = minsGcy2 7r(s) as a test for: are d\ and c/2 near-duplicates? In this case: permutation 71 says: d\ « c/2 Sojka, MR Group: PV211: Web search 36 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web mputmg • Sketches: Each document is now a vector of n = 200 numbers. o Much easier to deal with than the very high-dimensional space of shingles • But how do we compute Jaccard? □ Sojka, MR Group: PV211: Web search 37 / 117 • How do we compute Jaccard? • Let U be the union of the set of shingles of d\ and d2 and / the intersection. • There are \ U\\ permutations on U. o For sf e /, for how many permutations 7r do we have arg minsGcyi tt(s) = s' = arg minsGcy2 tt(s)? • Answer: (|(V| — 1)! • There is a set of (\U\ — 1)! different permutations for each s in /. ^> |/|(|^| — 1)! permutations make arg minsGcyi tt(s) = arg minsGcy2 tt(s) true • Thus, the proportion of permutations that make minsGcy17r(s) = minsGcy2 7r(s) true is: u -1)! u = -L=j(dl,d2) □ Sojka, MR Group: PV211: Web search 38 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • Thus, the proportion of successful permutations is the Jaccard coefficient. • Permutation tt is successful iff mins(Ec/i tt(s) = minsGc/2 tt(s) • Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial. • Estimator of probability of success: proportion of successes in n Bernoulli trials, (n = 200) • Our sketch is based on a random selection of permutations. o Thus, to compute Jaccard, count the number k of successful permutations for < cfi, d2 > and divide by n = 200. • k/n = /c/200 estimates J(di, d2). □ Sojka, MR Group: PV211: Web search 39 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web mp • We use hash functions as an efficient type of permutation: hi : {1..2m} -> {1..2m} • Scan all shingles Sk in union of two sets in arbitrary order o For each hash function h, and documents d±, c/2,...: keep slot for minimum value found so far • If hj(sk) is lower than minimum found so far: update slot □ Sojka, MR Group: PV211: Web search 40 / 117 Big picture Duplicate detection Spam Web IR Size of the web di d2 Sl 1 0 S2 0 1 S3 1 1 S4 1 0 0 1 h(x) = x mod 5 g(x) = (2x + 1) mod 5 min(/7(c/i)) = 1 Ý 0 = min(/7(c/2)) m\n(g(d1)) m\n(g(d2)) 2/0 di slot CÍ2 Slot h oo OO g oo oo h(l) = l 1 1 - oo fir(i) = 3 3 3 - oo /i(2) = 2 — 1 2 2 S(2) = 0 — 3 0 0 /?(3) = 3 3 1 3 2 fir(3) = 2 2 2 2 0 /?(4) = 4 4 1 - 2 ár(4) = 4 4 2 - 0 /j(5) = 0 — 1 0 0 ár(5) = 1 — 2 1 0 J{di,d2) = _ o+o _ = 0 final sketches Sojka, MR Group: PV211: Web search 41 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web di d2 d3 Sl 0 1 1 S2 1 0 1 S3 0 1 0 S4 1 0 0 h(x) = 5x + 5 mod 4 g(x) = (3x + 1) mod 4 /V /V /V Estimate J(d\, c/2), J{di, d$), J(d2, c/3) Sojka, MR Group: PV211: Web search 42 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web d\ slot d2 slot c/3 slot 00 00 00 di d2 c/3 00 00 00 Oil h(l) = 2 — 00 2 2 2 2 S2 10 1 = 0 — 00 0 0 0 0 S3 0 1 0 h(2) = 3 3 3 — 2 3 2 S4 1 0 0 g{2) = 3 3 3 — 0 3 0 h(3) = 0 — 3 0 0 - 2 *(3) = 2 — 3 2 0 - 0 h(x) = 5x + 5 mod 4 /?(4) = 1 1 1 — 0 - 2 g(x) = (3x + 1) mod 4 ff(4) = 1 1 1 — 0 - 0 final sketches Sojka, MR Group: PV211: Web search 43 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web Sojka, MR Group: PV211: Web search 44 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • Input: N documents • Choose n-gram size for shingling, e.g., n = 5 • Pick 200 random permutations, represented as hash functions • Compute N sketches: 200 x N matrix shown on previous slide, one row per permutation, one column per document • Compute —^2— pairwise similarities o Transitive closure of documents with similarity > 9 • Index only one document from each equivalence class Sojka, MR Group: PV211: Web search 45 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web icient near-aupii • Now we have an extremely efficient method for estimating a Jaccard coefficient for a single pair of two documents. • But we still have to estimate 0(A/2) coefficients where N is the number of web pages. • Still intractable • One solution: locality sensitive hashing (LSH) • Another solution: sorting (Henzinger 2006) □ Sojka, MR Group: PV211: Web search 46 / 117 • You have a page that will generate lots of revenue for you if people visit it. o Therefore, you would like to direct visitors to this page. • One way of doing this: get your page ranked highly in search results. • Exercise: How can I get my page ranked highly? Sojka, MR Group: PV211: Web search 48 / 117 Big picture Duplicate detection Spam Web IR niqu Size of the web 9 Misleading meta-tags, excessive repetition • Hidden text with colors, style sheet tricks etc. • Used to be very effective, most search engines now catch these Sojka, MR Group: PV211: Web search 49 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web * land dulfermd * taiŕs- ú efierre ú * lam důftirrerl + taji cdeterretl * lan dCeferriírl * taKTdeterred * lan dfefyrrud + tair rdfiTe rred * UndrareTrůd + t£Ů£ s ů EÍĚYfBŮ * lax 5 d ufarre d * taji" d a eTeríe d * U*wJůffferr&d + taji diíefetTĚd * la* da diferjsd * taiř ú p. rtef red * Ia*d4e1ůrra4 * Jaw d e iterce-d * la« d 3 Bferra-d + tax d p. 3Teríeú * landweferred * t£K dfiWfeíTEĽl * tax d b sľeríB d * tatí devferreta * ly.:; d BtVeireJd * i" - ijE j- "■■"■<■■:: * IsKdeTaerrad + tawdEtferred * Jam daflerr-Bd * lax d eíref rfid * Uxdefdenad + taw dĚtTerreff *■ lam dafcerrBiJ, * raw d e fie ta n e d.' * is* daTCerrad + taw d e"te4ríĚ d * lax d af3Bría-d * law d-efie^ľfed/ * Ian datWarred * lawdeŤRwrreíl * 1*ä d efserra d-+ taw d e "tesne d * laK d BŤefrrťd * ta* asretrreLi tfe insurance and annuity tu jí ck*fe-3ii-«d kansariterica union centrál american skandia defireiTed bankers: fidelity account pho-eiux Ta lability riiresnik texas pan american athletics ta* defvwiied mainsi amencan modem home ta* iLpfípu-ed uinon iideh"^/ derreasing usallianz l\ix tlefesrred Life assurance waiver of premium rax defefrte-d brokerage conseco slůck Minnesota delayed compensation tax d^feriiedl adjusting váni slrarig premium interstate assuranc-e whole death benefit impaired risk rax defe5rreíl me/djcare procedures whule mutual benefit buc housmglean ta\ dcfeili-MiL 'íwí defanfrd lifescareh reassure cotiseco deferred fixed annuities tax flefertrpd navy mutual aid nórihwesteru mutual t*x. d*fpitted national benefit family tinaled partnership uůitíiTveítem mutual union fidelity shibuya tax dřf ři"?red jackson nacucrnal life nev^orklifc vanishing premium interstate áw-x. ance >>!d but cix dd'cri^d jotnt and suiviuoi term life insurance no cicaní wifctovers rax tlefeiJied bankers fidelity Rtiaranty www central insurance corn customers mutual of omaha best deft-it 4 *d p.&c amve store ainencan heritage Us rit-fp-i pi *d navy mutual aid medicare disability empire general nortnwesE mutual tax ■iVLVidied western & southern inx de-fpiulp-d umou hdeEiry ameiieati general liut iltfrrťerrf tax ílí-feiTť-j d tax defended union fidelsty Tax deferreid mutual bericficff curely tax dťfHi^^d ihituya reassure tnx d*f*n*i*atty ^artans say up a nerthwesteiri rnuiual life. Should set uidiand nali^rtal hlh urn ■Jcfirnrce 1 jc-hn alden. say cerni undcrwritens few Tenable unuTcrs al tfc mutual bcneSts ttvr Jackson national insurance General amentEnt Kfe insuianue mdeinmty pkyeinx uash suneudec value rsnj play But business interstate assurance she once putp&c pretriLums arc bankers fidelity she ^vu~el^T/ef5. Appr-e usable pVct naliy mutual aid. riojne benclieial HuiLcuVealbiiJ umtuiJ. life' butinjutier Mr variable urmrcirsai life there empire general few ta:: refenred cicariiple play ameruuan ^eiutial lite and auuidbisL iiiKUfantJe uurnp-afiy. Sc-metbmy. eauh Lrheap lifc niEur'ance lincoln rate reduction credit 1 been are durtru? want but Example praiciple inxrcslment amvestor tiiain^le t-jcanjple say say endeuuuty yeiT ciia physstlans uautual. 2uicr> tlie. Yeats tti'in kft insurance best rates 1 money c^cKari^c say inland marine medicare liberty national life insurancr There hfe assurance udEng insurers death benefil eenbTal reserve life Western and southern disability once, tax defbrrej-d tnay up American heritage farmer's medicaid national benefit life-aSEtnante exacupie play. TiansameciCa silfc-limcd±iitt^iy COrn. OUr peoples benefit old republic. Sojka, MR Group: PV211: Web search 50 / 117 Big picture Duplicate detection niqu Spam rway Web IR Size of the web naer pag • Doorway page: optimized for a single keyword, redirects to the real target page • Lander page: optimized for a single keyword or a misspelled domain name, designed to attract surfers who will then click on ads Sojka, MR Group: PV211: Web search 51 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web inder page Weitere Links: Wild Yam Root | Mexican Appetizers | Yam | GambarSkodeng Ulu Yam | Wild Eyes | The Yam Yams | Arnica Cream Chickweed Cream | Colloidal Silver Cream | Witch Hazel Cream | COMPOSITA.COM Suche Sprachauswahl:| Deutsch j Sponsored Links Wild Russian Girls Plenty of Russian Girls interested in building a Happy Marriage. u k, an astas ia- i n ternati on al.com Wild Yam 10% By HPLC , Supply 500Kg/mon from 100% natural herb www, hon son bio, com Suche dir eine Frau aus Sofort Kontakte zu Frauen Ohne Anmeldung, kostenlos starten! www. SMS-Contacts. de/Sexy Yamaha Boats For Sale Find, Buy and Sell the Right Boat! Free Text/Email Alert Service riohtboat, com/adverts/Yamaha html Wild Yam Root Harvested at height of potency 20 Year, Family Run Herb Company. www.BlessedHerbs.som WEITERE LINKS a Wild Yam Root a Mexican Appetizers . Yam , Gambar Skodeng Ulu Yam i Wild Eyes a The Yam Yams a Arnica Cream . Chickweed Cream , Colloidal Silver Cream i Witch Hazel Cream • Number one hit on Google for the search "composita" • The only purpose of this page: get people to click on the ads and make money for the page owner Sojka, MR Group: PV211: Web search 52 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web niqu • Get good content from somewhere (steal it or produce it yourself) • Publish a large number of slight variations of it • For example, publish the answer to a tax question with the spelling variations of "tax deferred" on the previous slide Sojka, MR Group: PV211: Web search 53 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • Serve fake content to search engine spider • So do we just penalize this always? • No: legitimate uses (e.g., different content to US vs. European users) Sojka, MR Group: PV211: Web search 54 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web niqu • Create lots of links pointing to the page you want to promote o Put these links on pages with high (or at least non-zero) PageRank j Newly registered domains (domain flooding) o A set of pages that all point to each other to boost each other's PageRank (mutual admiration society) o Pay somebody to put your link on their highly ranked page ("schuetze horoskop" example) o Leave comments that include the link on blogs Sojka, MR Group: PV211: Web search 55 / 117 Big picture Duplicate detection Spam Web IR Size of the web engine optimiz, • Promoting a page in the search rankings is not necessarily spam. • It can also be a legitimate business - which is called SEO. • You can hire an SEO firm to get your page highly ranked. o There are many legitimate reasons for doing this. • For example, Google bombs like Who is a failure? • And there are many legitimate ways of achieving this: o Restructure your content in a way that makes it easy to index o Talk with influential bloggers and have them link to your site • Add more interesting and original content Sojka, MR Group: PV211: Web search 56 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • Quality indicators • Links, statistically analyzed (PageRank etc) • Usage (users visiting a page) o No adult content (e.g., no pictures with flesh-tone) • Distribution and structure of text (e.g., no keyword stuffing) • Combine all of these indicators and use machine learning • Editorial intervention o Blacklists o Top queries audited o Complaints addressed o Suspect patterns detected Sojka, MR Group: PV211: Web search 57 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web o Major search engines have guidelines for webmasters. o These guidelines tell you what is legitimate SEO and what is spamming. • Ignore these guidelines at your own risk • Once a search engine identifies you as a spammer, all pages on your site may get low ranks (or disappear from the index entirely). o There is often a fine line between spam and legitimate SEO. • Scientific study of fighting spam on the web: adversarial information retrieval Sojka, MR Group: PV211: Web search 58 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • Links: The web is a hyperlinked document collection. • Queries: Web queries are different, more varied and there are a lot of them. How many? « 109 o Users: Users are different, more varied and there are a lot of them. How many? « 109 • Documents: Documents are different, more varied and there are a lot of them. How many? « 1011 • Context: Context is more important on the web than in many other IR applications. • Ads and spam Sojka, MR Group: PV211: Web search 60 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web uery aistriDunon Most frequent queries on a large search engine on 2002.10.26. 1 sex 16 crack 31 juegos 46 Caramail 2 (artifact) 17 games 32 nude 47 msn 3 (artifact) 18 pussy 33 music 48 Jennifer lopez 4 porno 19 cracks 34 musica 49 tits 5 mp3 20 lolita 35 anal 50 free porn 6 Halloween 21 britney spears 36 free6 51 cheats 7 sexo 22 ebay 37 avril lavigne 52 yahoo.com 8 chat 23 sexe 38 hotmail.com 53 eminem 9 porn 24 Pamela Anderson 39 winzip 54 Christina Aguilera 10 yahoo 25 warez 40 fuck 55 incest 11 KaZaA 26 divx 41 wallpaper 56 letras de canciones 12 xxx 27 gay 42 hotmail.com 57 hardcore 13 Hentai 28 harry potter 43 postales 58 weather 14 lyrics 29 playboy 44 shakira 59 wallpapers 15 hotmail 30 lolitas 45 traductor 60 lingerie More than 1/3 of these are queries for adult content. Exercise: Does this mean that most people are looking for adult content? Sojka, MR Group: PV211: Web search 62 / 117 Big picture Duplicate detection Spam Web IR Size of the web uery aisinouiion • Queries have a power law distribution. • Recall Zipf's law: a few very frequent words, a large number of very rare words • Same here: a few very frequent queries, a large number of very rare queries • Examples of rare queries: search for names, towns, books etc o The proportion of adult queries is much lower than 1/3 Sojka, MR Group: PV211: Web search 63 / 117 Big picture Duplicate detection E yp Spam Web IR Size of the web in w • Informational user needs: I need information on something, "low hemoglobin" • We called this "information need" earlier in the class. o On the web, information needs proper are only a subclass of user needs. • Other user needs: Navigational and transactional • Navigational user needs: I want to go to this web site, "hotmail", "myspace", "United Airlines" • Transactional user needs: I want to make a transaction. o Buy something: "MacBook Air" o Download something: "Acrobat Reader" o Chat with someone: "live soccer chat" • Difficult problem: How can the search engine tell what the user need or intent for a particular query is? Sojka, MR Group: PV211: Web search 64 / 117 Big picture Duplicate detection Spam Web IR Size of the web ypernn • Web search in most cases is interleaved with navigation • .. . i.e., with following links. • Different from most other IR collections Sojka, MR Group: PV211: Web search 66 / 117 Kinds of behaviors we see in the data Short /Nav Topic exploration Topic switch Methodical results exploration Query reform Go #Ie Big picture Ads Duplicate detection Spam Web IR Size of the web • Strongly connected component (SCC) in the center • Lots of pages that get linked to, but don't link (OUT) • Lots of pages that link to other pages, but don't get linked to (IN) • Tendrils, tubes, islands Sojka, MR Group: PV211: Web search 68 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • What can we do to guess user intent? • Guess user intent independent of context: o Spell correction o Precomputed "typing" of queries (next slide) • Better: Guess user intent based on context: o Geographic context (slide after next) • Context of user in this session (e.g., previous query) • Context provided by personal profile (Yahoo/MSN do this, Google claims it does not) Sojka, MR Group: PV211: Web search 70 / 117 Big picture Duplicate detection Spam Web IR :ypm Size of the web • Calculation: 5+4 • Unit conversion: 1 kg in pounds • Currency conversion: 1 euro in kronor • Tracking number: 8167 2278 6764 • Flight info: LH 454 • Area code: 650 • Map: columbus oh • Stock price: msft • Albums/movies etc: coldplay Sojka, MR Group: PV211: Web search 71 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • Three relevant locations • Server (nytimes.com —>» New York) o Web page (nytimes.com article about Albania) • User (located in Palo Alto) • Locating the user • IP address • Information provided by user (e.g., in user profile) o Mobile phone • Geo-tagging: Parse text and identify the coordinates of the geographic entities • Example: East Palo Alto CA —>► Latitude: 37.47 N, Longitude: 122.14 W o Important NLP problem Sojka, MR Group: PV211: Web search 72 / 117 Big picture Duplicate detection Spam Web IR Size of the web iTy • Result restriction: Don't consider inappropriate results o For user on google.fr . .. o . . . only show .fr results • Ranking modulation: use a rough generic ranking, rerank based on personal context • Contextualization / personalization is an area of search wi lot of potential for improvement. Sojka, MR Group: PV211: Web search 73 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web o Use short queries (average < 3) • Rarely use operators • Do not want to spend a lot of time on composing a query • Only look at the first couple of results • Want a simple Ul, not a search engine start page overloaded with graphics • Extreme variability in terms of user needs, user expectations, experience, knowledge, .. . o Industrial/developing world, English/Estonian, old/young, rich/poor, differences in culture and class • One interface for hugely divergent needs Sojka, MR Group: PV211: Web search 75 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • Classic IR relevance (as measured by F) can also be used for web IR. • Equally important: Trust, duplicate elimination, readability, loads fast, no pop-ups • On the web, precision is more important than recall. • Precision at 1, precision at 10, precision on the first 2-3 pages a But there is a subset of queries where recall matters. Sojka, MR Group: PV211: Web search 76 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • Has this idea been patented? • Searching for info on a prospective financial advisor • Searching for info on a prospective employee • Searching for info on a date Sojka, MR Group: PV211: Web search 77 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • Distributed content creation: no design, no coordination o "Democratization of publishing" • Result: extreme heterogeneity of documents on the web • Unstructured (text, html), semistructured (html, xml), structured/relational (databases) • Dynamically generated content Sojka, MR Group: PV211: Web search 79 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web AA129 Application server Browser Back-en databas • Dynamic pages are generated from scratch when the user requests them - usually from underlying data in a database. • Example: current status of flight LH 454 Sojka, MR Group: PV211: Web search 80 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • Most (truly) dynamic content is ignored by web spiders. o It's too much to index it all. • Actually, a lot of "static" content is also assembled on the fly (asp, php etc.: headers, date, ads etc) Sojka, MR Group: PV211: Web search 81 / 117 Big picture Duplicate detection Spam Web IR Size of the web pag nge irequ 90% 80% 70% 60% 50% 30% 20% 10% 0% a I .COT .de .re: .uk □ complete change (0) ■ large change (1-26) □ medium change (29-56) □ small change (57-B3) ■ no text change (84) □ ne charge (B5) JP .org .en .gov .edu Sojka, MR Group: PV211: Web search 82 / 117 Big picture Duplicate detection Spam Web IR Size of the web iiiiiiingu • Documents in a large number of languages • Queries in a large number of languages • First cut: Don't return English results for a Japanese query • However: Frequent mismatches query/document languages o Many people can understand, but not query in a language. o Translation is important. • Google example: "Beaujolais Nouveau -wine" Sojka, MR Group: PV211: Web search 83 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • Significant duplication - 30%-40% duplicates in some studies. • Duplicates in the search results were common in the early days of the web. o Today's search engines eliminate duplicates very effectively. o Key for high user satisfaction. Sojka, MR Group: PV211: Web search 84 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web o For many collections, it is easy to assess the trustworthiness of a document. o A collection of Reuters newswire articles o A collection of TASS (Telegraph Agency of the Soviet Union) newswire articles from the 1980s o Your Outlook email from the last three years • Web documents are different: In many cases, we don't know how to evaluate the information. • Hoaxes abound. Sojka, MR Group: PV211: Web search 85 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web e we 174000000 156600000 139200000 121800000 104400000 87000000 69600000 52200000 34800000 17400000 0 in '-D 'd Ch Ch Ch Ch Ch Ch rl d rl r--- r-- co co ch ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch d d rl o ■_> o o o rl o o n n n o o o o o co o in L"' '■£■ o o- o o- o '■o r-- ■o o o o r-- ■o o coco CO o-o '■"'■J C-J '."■■J í\l C-.J C-.J r-.j r,j r,j r,j r,j P- j= P- S= |i- Q-LQ.S_Q.S_Q.i_Q.S_ Q S_ Q S_ Q. S_ QS_Q S_Q CO » The web keeps growing. o But growth is no longer exponential? Sojka, IIR Group: PV211: Web search 87 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • What is size? Number of web servers? Number of pages? Terabytes of data available? • Some servers are seldom connected. • Example: Your laptop running a web server o Is it part of the web? o The "dynamic" web is infinite. o Any sum of two numbers is its own dynamic page on Google. (Example: "2+4") Sojka, MR Group: PV211: Web search 88 / 117 Big picture Duplicate detection Spam Web IR Size of the web ngin Can I claim a page is in the index if I only index the first 4,000 bytes? Can I claim a page is in the index if I only index anchor text pointing to the page? • There used to be (and still are?) billions of pages that are only indexed by anchor text. Sojka, MR Group: PV211: Web search 89 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web impie m wer un • OR-query of frequent words in a number of languages • According to this query: Size of web > 21,450,000,000 on 2007.07.07 and > 25,350,000,000 on 2008.07.03 • But page counts of Google search results are only rough estimates. Sojka, MR Group: PV211: Web search 90 / 117 Big picture Ads Du plicate detection Spam Web IR Size of the web IZ e • Media • Users o They may switch to the search engine that has the best coverage of the web. o Users (sometimes) care about recall. If we underestimate the size of the web, search engine results may have low recall. • Search engine designers (how many pages do I need to be able to handle?) • Crawler designers (which policy will crawl close to N pages?) Sojka, MR Group: PV211: Web search 92 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web Sojka, MR Group: PV211: Web search 93 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web impie m wer un • OR-query of frequent words in a number of languages • According to this query: Size of web > 21,450,000,000 on 2007.07.07 • Big if: Page counts of Google search results are correct. (Generally, they are just rough estimates.) • But this is just a lower bound, based on one search engine. • How can we do better? Sojka, MR Group: PV211: Web search 94 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web o The "dynamic" web is infinite. o Any sum of two numbers is its own dynamic page on Google. (Example: "2+4") o Many other dynamic sites generating infinite number of pages • The static web contains duplicates - each "equivalence class" should only be counted once. • Some servers are seldom connected. o Example: Your laptop • Is it part of the web? Sojka, MR Group: PV211: Web search 95 / 117 Big picture Duplicate detection Spam Web IR Size of the web ngin Can I claim a page is in the index if I only index the first 4,000 bytes? Can I claim a page is in the index if I only index anchor text pointing to the page? • There used to be (and still are?) billions of pages that are only indexed by anchor text. Sojka, MR Group: PV211: Web search 96 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web How can we estimate the size of the web? Sojka, MR Group: PV211: Web search 97 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web tripling m • Random queries • Random searches • Random IP addresses • Random walks Sojka, MR Group: PV211: Web search 98 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • There are significant differences between indexes of different search engines. • Different engines have different preferences. • max URL depth, max count/host, anti-spam rules, priority rules etc. o Different engines index different things under the same URL. o anchor text, frames, meta-keywords, size of prefix etc. Sojka, MR Group: PV211: Web search 99 / 117 Relative Size from Overlap [Bharat & Broder, 98] IB A 1 i i —'— Sample URLs randomly from A Check if contained in B and vice versa Ar\B = (1/2) * Size A b = (1/6) * Size b (l/2)*Size A = (l/6)*Size b .-. Size A / Size b = (1/6)/(1/2) = 1/3 Each test involves: (i) Sampling (ii) Checking Big picture Ads Duplicate detection Spam Web IR Size of the web mpiing • Ideal strategy: Generate a random URL • Problem: Random URLs are hard to find (and sampling distribution should reflect "user interest") • Approach 1: Random walks / IP addresses • In theory: might give us a true estimate of the size of the web (as opposed to just relative sizes of indexes) • Approach 2: Generate a random URL contained in a given engine o Suffices for accurate estimation of relative size Sojka, MR Group: PV211: Web search 101 / 117 Big picture naom Duplicate detection Spam Web IR Size of the web rom • Idea: Use vocabulary of the web for query generation • Vocabulary can be generated from web crawl • Use conjunctive queries w\ AND 1/1/2 o Example: vocalists AND rsi • Get result set of one hundred URLs from the source engine • Choose a random URL from the result set o This sampling method induces a weight l/l/(p) for each page p. • Method was used by Bharat and Broder (1998). Sojka, MR Group: PV211: Web search 102 / 117 Big picture Duplicate detection Spam page is in Web IR Size of the web • Either: Search for URL if the engine supports this • Or: Create a query that will find doc d with high probability Download doc, extract words o Use 8 low frequency word as AND query • Call this a strong query for d j Run query • Check if d is in result set • Problems o Near duplicates o Redirects o Engine time-outs Sojka, MR Group: PV211: Web search 103 / 117 Computing Relative Sizes and Total Coverage [BB98] a = AltaVista, e = Excite, h f xy = fraction of xiny ■ Six pair-wise overlaps f * ■"■ah a - f * ■"-ha h El f * ai a - f. * la ■ 1 = e2 f * ae a - f * ea e = e3 f * h - f * ■ i = e4 f * h - f * e = e5 f ■* ei e - f - * le ■ = e6 ■ Arbitrarily, let a = 1. = HotBot, i = Infoseek ■ We have 6 equations and 3 unknowns. ■ Solve for e, h and i to minimize 2 e±2 ■ Compute engine overlaps. ■ Re-normalize so that the total joint coverage is 100% Advantages & disadvantages ■ Statistically sound under the induced weight. ■ Biases induced by random query ■ Query Bias: Favors content-rich pages in the language(s) of the lexicon ■ Ranking Bias: Solution: Use conjunctive queries & fetch all ■ Checking Bias: Duplicates, impoverished pages omitted ■ Document or query restriction bias: engine might not deal properly with 8 words conjunctive query ■ Malicious Bias: Sabotage by engine ■ Operational Problems: Time-outs, failures, engine inconsistencies, index modification. Big picture Ads Duplicate detection Spam Web IR Size of the web • Choose random searches extracted from a search engine log (Lawrence & Giles 97) • Use only queries with small result sets • For each random query: compute ratio size(ri)/size(r2) of the two result sets • Average over random searches Sojka, MR Group: PV211: Web search 106 / 117 Big picture Duplicate detection Spam Web IR Size of the web vaniag vaniag • Advantage • Might be a better reflection of the human perception of coverage • Issues • Samples are correlated with source of log (unfair advantage for originating search engine) • Duplicates o Technical statistical problems (must have non-zero results, ratio average not statistically sound) Sojka, MR Group: PV211: Web search 107 / 117 Random searches [Lawr98, Lawr99] ■ 575 & 1050 queries from the NEC Rl employee logs . 6 Engines in 1998, 11 in 1999 ■ Implementation: ■ Restricted to queries with < 600 results in total ■ Counted URLs from each engine after verifying query match ■ Computed size ratio & overlap for individual queries ■ Estimated index size ratio & overlap by averaging over all queries Queries from Lawrence and Giles study 1 H ■ adaptive access control ■ softmax activation function ■ neighborhood preservation ■ bose multidimensional topographic system theory ■ hamiltonian structures ■ gamma mlp ■ right linear grammar ■ dvi2pdf ■ pulse width modulation ■ John oliensis neural ■ rieke spikes exploring neural ■ unbalanced prior probabilities ■ ranked assignment method ■ internet explorer favourites importing ■ karvel thornber ■ video watermarking ■ counterpropagation network ■ fat shattering dimension ■ abelson amorphous computing ■ zili liu Random IP addresses [Lawrence & Giles '991 ■ Generate random IP addresses ■ Find a web server at the given address ■ If there's one ■ Collect all pages from server. ■ Method first used by O'Neill, McClain, & Lavoie, "A Methodology for Sampling the World Wide Web", 1997. http : //digitalsrchive , oclc , org/da/ViewObject.jsp?objid^0000 003447 Big picture Ads Duplicate detection Spam Web IR Size of the web • [Lawr99] exhaustively crawled 2,500 servers and extrapolated • Estimated size of the web to be 800 million Sojka, MR Group: PV211: Web search 111 / 117 Big picture vaniag Duplicate detection Spam Web IR Size of the web 9 9 Advantages 9 Can, in theory, estimate the size of the accessible web (as opposed to the (relative) size of an index) Clean statistics Independent of crawling strategies Disadvantages Many hosts share one IP (—>► oversampling) Hosts with large web sites don't get more weight than hosts with small web sites (—>» possible undersampling) Sensitive to spam (multiple IPs for same spam server) Again, duplicates 9 9 9 9 Sojka, MR Group: PV211: Web search 112 / 117 Random walks [Henzinger et a/WWW9] ■ View the Web as a directed graph ■ Build a random walk on this graph ■ Includes various "jump" rules back to visited sites ■ Does not get stuck in spider traps! ■ Can follow all links! ■ Converges to a stationary distribution ■ Must assume graph is finite and independent of the walk. ■ Conditions are not satisfied (cookie crumbs, flooding) ■ Time to convergence not really known ■ Sample from stationary distribution of walk ■ Use the "strong query11 method to check coverage by SE Dependence on seed list ■ How well connected is the graph? [Broder et al., WWW9] Disconnected com do nenta Advantages & disadvantages ■ Advantages ■ "Statistically clean" method at least in theory! ■ Could work even for infinite web (assuming convergence) under certain metrics. ■ Disadvantages ■ List of seeds is a problem. ■ Practical approximation might not be valid. ■ Non-uniform distribution ■ Subject to link spamming a Many different approaches to web size estimation. o None is perfect. 9 The problem has gotten much harder. a There has not been a good study for a couple of years. o Great topic for a thesis! Sojka, MR Group: PV211: Web search 116 / 117 Big picture Ads Duplicate detection Spam Web IR Size of the web • Chapter 19 of MR • Resources at http://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and Fl MU library o Hal Varian explains Google second price auction: http://www.youtube.com/watch?v=K710a2PVhPQ • Size of the web queries • Trademark issues (Geico and Vuitton cases) o How ads are priced o Henzinger, Finding near-duplicate web pages: A large-scale evaluation of algorithms, ACM SIGIR 2006. Sojka, MR Group: PV211: Web search 117 / 117 Recap A simple crawler A real crawler PV211: Introduction to Information Retrieval http://www.f i.muni.cz/~ s oj ka/PV211 MR 20: Crawling Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2017-04-25 Sojka, MR Group: PV211: Crawling 1/32 Recap A simple crawler A real crawler Pag Web Images Maps News Shopping Gmail more Sign ir Google 'discount broker Search Advanced Search Preferences Web Discount Broker Reviews Information on online discount brokers emphasizing rates, charges, and customer comments and complaints. www.broker-reviews.us/ - 94k - Cached - Similar pages Results 1 -10 of about 807,000 for discount broker [definition], (0.12 seconds) Sponsored Links Rated #1 Online Broker No Minimums. No Inactivity Fee Discount Broker Rankings (2008 Broker Survey) at SmartMoney.com Discount Brokers. Rank/ Brokerage/ Minimum to Open Account, Comments, Standard Commis-sion*, Reduced Commission, Account Fee Per Year (How to Avoid), Avg.... www.smartmoney .com/brokers/index. cfm?story=2004-discount-table - 121k - Cached - Similar pages Stock Brokers | Discount Brokers | Online Brokers Most Recommended. Top 5 Brokers headlines. 10. Don't Pay Your Broker for Free Funds May 15 at 3:39 PM. 5. Don't Discount the Discounters Apr 18 at 2:41 PM ... www.fool.com/investing/brokers/index.aspx - 44k - Cached - Similar pages Discount Broker Discount Broker - Definition of Discount Broker on Investopedia - A stockbroker who carries out buy and sell orders at a reduced commission compared to a ... www.investopedia.com/terms/d/discountbroker.asp -31k- Cached - Similar pages Discount Brokerage and Online Trading for Smart Stock Market... Online stock broker SogoTrade offers the best in discount brokerage investing. Get stock market quotes from this internet stock trading company. www.sogotrade.com/ - 39k - Cached - Similar pages 15 questions to ask discount brokers - MSN Money Jan 11, 2004 ... If you're not big on hand-holding when it comes to investing, a discount broker can be an economical way to go. Just be sure to ask these ... moneycentral.msn.com/content/lnvesting/Startinvesting/P66171 .asp - 34k - Cached - Similar pages Transfer to Firstrade for Free! www.firstrade.com Discount Broker Commission free trades for 30 days. No maintenance fees. Sign up now. TDAMERITRADE.com TradeKing - Online Broker $4.95 per Trade, Market or Limit SmartMoney Top Discount Broker 200i www.TradeKing.com Scottrade Brokerage $7 Trades, No Share Limit. In-Depth Research. Start Trading Online Now! www.Scottrade.com Stock trades £1.50 - £3 100 free trades, up to $100 back for transfer costs, $500 minimum www.sogotrade.com £3.95 Online Stock Trades Market/Limit Orders, No Share Limit and No Inactivity Fees www.Marsco.com INGDIRECT I ShareBuilder Sojka, MR Group: PV211: Crawling 3/32 Recap A simple crawler A real crawler advertiser bid CTR ad rank rank paid A $4.00 O01 O04 4 (minimum) B $3.00 0.03 0.09 2 $2.68 C $2.00 0.06 0.12 1 $1.51 D $1.00 0.08 0.08 3 $0.51 • bid: maximum bid for a click by advertiser • CTR: click-through rate: when an ad is displayed, what percentage of time do users click on it? CTR is a measure of relevance. • ad rank: bid x CTR: this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is • paid: Second price auction: The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent). Sojka, MR Group: PV211: Crawling 4/32 Recap A simple crawler A real crawler g o Users only click if they are interested. o The advertiser only pays when a user clicks on an ad. • Searching for something indicates that you are more likely to buy it .. . • ... in contrast to radio and newspaper ads. Sojka, MR Group: PV211: Crawling 5/32 Recap A simple crawler A real crawler document 1: {sk} inimum ot permu document 2: {sk} Si s2 s3 s4 xk = 7r(sk) O • • O O ♦ ♦ O *3 O Xi x4 o o *2 O X3 Xi x4 minS/c7r(s/c) 1—Q- *2 *3 L Si s5 S3 s4 -p- 1 ° • 0 «o • • 0 p L Xi x4 L 0 O O 0 p *3 Xi x5 minS/c 1 0 - *3 Roughly: We use minsGcy17r(s) = minsGcy2 7r(s) as a test for: are and c/2 near-duplicates? Sojka, MR Group: PV211: Crawling 6/32 Recap A simple crawler A real crawler W 1 n crawling • Web search engines must crawl their documents. • Getting the content of the documents is easier for many other IR systems. o E.g., indexing all files on your hard disk: just do a recursive descent on your file system • Ok: for web IR, getting the content of the documents takes longer . .. o .. . because of latency. • But is that really a design/systems challenge? Sojka, MR Group: PV211: Crawling 8/32 Recap A simple crawler A real crawler • Initialize queue with URLs of known seed pages • Repeat o Take URL from queue o Fetch and parse page • Extract URLs from page o Add URLs to queue • Fundamental assumption: The web is well linked. Sojka, MR Group: PV211: Crawling 9/32 Recap A simple crawler A real crawler s wrong wi urlqueue := (some carefully selected set of seed urls) while urlqueue is not empty: myurl := urlqueue.getlastanddeleteO mypage := myurl.fetch() fetchedurls.add(myurl) newurls := mypage.extracturIs() for myurl in newurls: if myurl not in fetchedurls and not in urlqueue: urlqueue.add(myurl) addtoinvertedindex(mypage) Sojka, MR Group: PV211: Crawling 10 / 32 Recap A simple crawler A real crawler • Scale: we need to distribute. • We can't index everything: we need to subselect. How? • Duplicates: need to integrate duplicate detection • Spam and spider traps: need to integrate spam detection • Politeness: we need to be "nice" and space out all requests for a site over a longer period (hours, days) • Freshness: we need to recrawl periodically. o Because of the size of the web, we can do frequent recrawls only for a small subset. Again, subselection problem or prioritization Sojka, MR Group: PV211: Crawling 11 / 32 Recap A simple crawler A real crawler gnuu rawnng pr • To fetch 20,000,000,000 pages in one month . .. • .. . we need to fetch almost 8,000 pages per second! • Actually: many more since many of the pages we attempt to crawl will be duplicates, unfetchable, spam etc. Sojka, MR Group: PV211: Crawling 12 / 32 Recap A simple crawler A real crawler rawier mu Be polite • Don't hit a site too often o Only crawl pages you are allowed to crawl: robots.txt Be robust • Be immune to spider traps, duplicates, very large pages, very large websites, dynamic pages etc Sojka, MR Group: PV211: Crawling 13 / 32 Recap A simple crawler A real crawler roDO"cs. • Protocol for giving crawlers ("robots") limited access to a website, originally from 1994 • Examples: o User-agent: * Disallow: /yoursite/temp/ • User-agent: searchengine Disallow: / • Important: cache the robots.txt file of each site we are crawling Sojka, MR Group: PV211: Crawling 14 / 32 Recap A simple crawler A real crawler roDO"cs ."cxt: ^nin.gov; User-agent: PicoSearch/1.0 Disallow: /news/information/knight/ Disallow: /nidcd/ • • • Disallow: /news/research_matters/secure/ Disallow: /od/ocpl/wag/ User-agent: * Disallow: /news/information/knight/ Disallow: /nidcd/ Disallow: /news/research_matters/secure/ Disallow: /od/ocpl/wag/ Disallow: /ddir/ Disallow: /sdminutes/ Sojka, MR Group: PV211: Crawling 15 / 32 Recap A simple crawler A real crawler ny crawier • Be capable of distributed operation • Be scalable: need to be able to increase crawl rate by adding more machines • Fetch pages of higher quality first • Continuous operation: get fresh version of already crawled pages Sojka, MR Group: PV211: Crawling 16 / 32 Recap A simple crawler A real crawler URLs crawled and parsed URL frontier: found, but not yet crawled unseen URLs Sojka, MR Group: PV211: Crawling 18 / 32 Recap A simple crawler A real crawler • The URL frontier is the data structure that holds and manages URLs we've seen, but that have not been crawled yet. • Can include multiple pages from the same host • Must avoid trying to fetch them all at the same time • Must keep all crawling threads busy Sojka, MR Group: PV211: Crawling 19 / 32 Recap A simple crawler A real crawler doc FPs robots templates URL frontier URL set content seen? URL filter dup URL elim Sojka, MR Group: PV211: Crawling 20 / 32 Recap A simple crawler A real crawler Z • Some URLs extracted from a document are relative URLs. • E.g., at http://www.fi.muni.cz/~sojka/PV211/, we may have p20crawl.pdf • This is the same as URL: http://www.fi.muni.cz/~sojka/PV211/p20crawl.pdf • During parsing, we must normalize (expand) all relative URLs. Sojka, MR Group: PV211: Crawling 21 / 32 Recap A simple crawler A real crawler o For each page fetched: check if the content is already in the index • Check this using document fingerprints or shingles • Skip documents whose content has already been indexed Sojka, MR Group: PV211: Crawling 22 / 32 Recap A simple crawler A real crawler riDutmg • Run multiple crawl threads, potentially at different nodes o Usually geographically distributed nodes • Partition hosts being crawled into nodes Sojka, MR Group: PV211: Crawling Recap A simple crawler A real crawler enters ^wayTarmg.c Map DetalFi Trackers B tog My horn es í ay created by Plngcfom Waypoints Q] Berlin, Germany zoom 22 Frankfurt, Germany zoom pf] Munlch, Germany zown Zürich, Swttzerland ^ Groningen, Netneriandi zoom FT1 Moni, Beiglum zoom £2 Eamahaven, Netherlandi zoom FH Paria zoom France llŕííf'Htůť1 Lirriu'gei Sojka, MR Group: PV211: Crawling 24 / 32 Recap A simple crawler A real crawler URL frontier from other nodes Sojka, MR Group: PV211: Crawling 25 / 32 Recap A simple crawler A real crawler wo main • Politeness: Don't hit a web server too frequently o E.g., insert a time gap between successive requests to the same server • Freshness: Crawl some pages (e.g., news sites) more often than others • Not an easy problem: simple priority queue fails. Sojka, MR Group: PV211: Crawling 26 / 32 F front queues f. queue selector & b. queue router B back queues: single host on each b. queue selector heap • URLs flow in from the top into the frontier. • Front queues manage prioritization. • Back queues enforce politeness. • Each queue is FIFO. Sojka, MR Group: PV211: Crawling Recap A simple crawler A real crawler F front queues f. queue selector & b. queue router • Prioritizer assigns to URL an integer priority between 1 and F. • Then appends URL to corresponding queue • Heuristics for assigning priority: refresh rate, PageRank etc Sojka, MR Group: PV211: Crawling o Selection from front queues is initiated 28 / 32 B back queues Single host on each b. queue selector T • Invariant 1. Each back queue is kept non-empty while the crawl is in progress. • Invariant 2. Each back queue only contains URLs from a single host. • Maintain a table from hosts to back queues. • In the heap: • One entry for each back queue • The entry is the Sojka, MR Group: PV211: Crawling F front queues f. queue selector & b. queue router B back queues: single host on each b. queue selector heap • URLs flow in from the top into the frontier. • Front queues manage prioritization. • Back queues enforce politeness. • Each queue is FIFO. Sojka, MR Group: PV211: Crawling Recap A simple crawler A real crawler c ■ J i bpid er trap o Malicious server that generates an infinite sequence of linked pages. • Sophisticated spider traps generate pages that are not easily identified as dynamic. Sojka, MR Group: PV211: Crawling 31 / 32 Recap A simple crawler A real crawler • Chapter 20 of MR • Resources at http://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and Fl MU library o Papers by NLP centre people crawling data for Sketch Engine • Paper on Mercator by Heydon et al. • Robot exclusion standard Sojka, MR Group: PV211: Crawling 32 / 32 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/PV211 MR 21: Link analysis Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2017-04-13 Sojka, MR Group: PV211: Link analysis 1/82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities verv iew Q Recap Q Anchor text Q Citation analysis Q PageRank O HITS: Hubs & Authorities Sojka, MR Group: PV211: Link analysis 2/82 bearcn engines rank content pages ana ac Web Images Maps News Shopping Gmail more Sign ir Google 'discount broker Search Advanced Search Preferences Web Discount Broker Reviews Information on online discount brokers emphasizing rates, charges, and customer comments and complaints. www.broker-reviews.us/ - 94k - Cached - Similar pages Results 1 -10 of about 807,000 for discount broker [definition]. (0.12 seconds) Sponsored Links Rated #1 Online Broker No Minimums. No Inactivity Fee Discount Broker Rankings (2008 Broker Survey) at SmartMoney.com Discount Brokers. Rank/ Brokerage/ Minimum to Open Account, Comments, Standard Commis- sion*, Reduced Commission, Account Fee Per Year (How to Avoid), Avg.... www.smartmoney.com/brokers/index.cfm?story=2004-discount-table -121k- Cached - Similar pages Stock Brokers | Discount Brokers | Online Brokers Most Recommended. Top 5 Brokers headlines. 10. Don't Pay YourBrokerfor Free Funds May 15 at 3:39 PM. 5. Don't Discount the Discounters Apr 18 at 2:41 PM ... www.fool.com/investing/brokers/index.aspx - 44k - Cached - Similar pages Discount Broker Discount Broker-Definition of Discount Broker on Investopedia -A stockbroker who carries out buy and sell orders at a reduced commission compared to a ... www.investopedia.com/terms/d/discountbroker.asp -31k- Cached - Similar pages Discount Brokerage and Online Trading for Smart Stock Market... Online stock broker SogoTrade offers the best in discount brokerage investing. Get stock market quotes from this internet stock trading company. www.sogotrade.com/ - 39k - Cached - Similar pages 15 questions to ask discount brokers - MSN Money Jan 11, 2004 ... If you're not big on hand-holding when it comes to investing, a discount broker can be an economical way to go. Just be sure to ask these ... moneycentral.msn.com/content/lnvesting/Startinvesting/P66171 .asp - 34k - Cached - Similar pages Transfer to Firstrade for Free! www.firstrade.com Discount Broker Commission free trades for 30 days. No maintenance fees. Sign up now. TDAMERITRADE.com TradeKing - Online Broker $4.95 per Trade, Market or Limit SmartMoney Top Discount Broker 2007 www.TradeKing.com Scottrade Brokerage $7 Trades, No Share Limit. In-Depth Research. Start Trading Online Now! www.Scottrade.com Stock trades £1.50 - $3 100 free trades, up to $100 back for transfer costs, $500 minimum www.sogotrade.com £3.95 Online Stock Trades Market/Limit Orders, No Share Limit and No Inactivity Fees www.Marsco.com IIMGDIRECT I ShareBuilder Sojka, MR Group: PV211: Link analysis 4/82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities I oogie s second price auction advertiser bid CTR ad rank rank paid A B C D $4.00 $3.00 $2.00 $1.00 0.01 0.03 0.06 0.08 0.04 0.09 0.12 0.08 4 2 1 3 (minimum) $2.68 $1.51 $0.51 • bid: maximum bid for a click by advertiser • CTR: click-through rate: when an ad is displayed, what percentage of time do users click on it? CTR is a measure of relevance. • ad rank: bid x CTR: this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is • paid: Second price auction: The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent). Sojka, MR Group: PV211: Link analysis 5/82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities at s great aoout search aas o Users only click if they are interested. o The advertiser only pays when a user clicks on an ad. • Searching for something indicates that you are more likely to buy it ... • ... in contrast to radio and newspaper ads. Sojka, MR Group: PV211: Link analysis 6/82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities ear duplicate detection: Minimum or permutation document 1: {sk} document 2: {sk} Si S2 s3 s4 xk = 7r(sk) O • • O O ♦ ♦ O *3 O Xi x4 o o *2 O X3 Xi x4 mins/c7ľ(s/c) 1—0- *2 *3 L Si S5 S3 S4 - Xk = ^(s/c) L 0 • 0 «o • • 0 > *3 Xi x4 *5 L 0 O O 0 * *3 Xi x5 mins/c 7t(S/c) 1 0 -►- *3 Roughly: We use minsGcy17r(s) = minsGcy2 7r(s) as a test for: are and c/2 near-duplicates? Sojka, MR Group: PV211: Link analysis 7/82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities di d2 Sl 1 0 S2 0 1 S3 1 1 S4 1 0 0 1 h(x) = X mod 5 g(x) = (2x + 1) mod 5 min(/?(c/i)) = 1^0 = m\n(h(d2)) min(g(c/i)) min(^(£/2)) = 2^0 = di slot c/2 slot oo 00 oo 00 h(l) = l 1 1 - 00 fir(i) = 3 3 3 - 00 /i(2) = 2 — 1 2 2 fir(2) = 0 — 3 0 0 /?(3) = 3 3 1 3 2 S(3) = 2 2 2 2 0 /?(4) = 4 4 1 - 2 ár(4) = 4 4 2 - 0 /?(5) = 0 — 1 0 0 ár(5) = 1 — 2 1 0 j{di,d2) 0+0 0 final sketches Sojka, MR Group: PV211: Link analysis 8/82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities ake-away today • Anchor text: What exactly are links on the web and why are they important for IR? • Citation analysis: the mathematical foundation of PageRank and link-based ranking • PageRank: the original algorithm that was used for link-based ranking on the web • Hubs & Authorities: an alternative link-based ranking algorithm Sojka, MR Group: PV211: Link analysis 9/82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities • Assumption 1: A hyperlink is a quality signal. • The hyperlink di —>► c/2 indicates that c/i's author deems c/2 high-quality and relevant. • Assumption 2: The anchor text describes the content of c/2. o We use anchor text somewhat loosely here for: the text surrounding the hyperlink, o Example: "You can find cheap cars here." • Anchor text: "You can find cheap cars here" Sojka, MR Group: PV211: Link analysis 11 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities exi M oniy vs. itexi anc :ex • Searching on [text of c/2] + [anchor text —>► c/2] is often more effective than searching on [text of c/2] only. • Example: Query IBM • Matches IBM's copyright page • Matches many spam pages • Matches IBM Wikipedia article • May not match IBM home page! • ... if IBM home page is mostly graphics • Searching on [anchor text —> c/2] is better for the query IBM. o In this representation, the page with the most occurrences of IBM is www.ibm.com. Sojka, MR Group: PV211: Link analysis 12 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities Mncr lor text pointing to www.iDm.com www.nytimes.com: "IBM acquires Webify" \ \ \ \ www.slashdot.or^g: "New IBM optical chip" www.stanfordl.edu: '"IBM faculty award recipients" www.ibm.com Sojka, MR Group: PV211: Link analysis 13 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities I naexing ancnor • Thus: Anchor text is often a content than the page itself. • Anchor text can be weighted (based on Assumptions 1&2) better description of a page's more highly than document text. □ Sojka, MR Group: PV211: Link analysis 14 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities xercise: Assumptions underlying ragercan • Assumption 1: A link on the web is a quality signal - the author of the link thinks that the linked-to page is high-quality. • Assumption 2: The anchor text describes the content of the linked-to page. • Is assumption 1 true in general? • Is assumption 2 true in general? □ Sojka, MR Group: PV211: Link analysis 15 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities oogie DomDs • A Google bomb is a search with "bad" results due to maliciously manipulated anchor text. • Google introduced a new weighting function in 2007 that fixed many Google bombs. • Still some remnants: [dangerous cult] on Google, Bing, Yahoo • Coordinated link creation by those who dislike the Church of Scientology • Defused Google bombs: [dumb motherf. ..], [who is a failure?], [evil empire] □ Sojka, MR Group: PV211: Link analysis 16 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities rigins or ragercan nation analysis • Citation analysis: analysis of citations in the scientific literature • Example citation: "Miller (2001) has shown that physical activity alters the metabolism of estrogens." • We can view "Miller (2001)" as a hyperlink linking two scientific articles. • One application of these "hyperlinks" in the scientific literature: • Measure the similarity of two articles by the overlap of other articles citing them. • This is called cocitation similarity. • Cocitation similarity on the web: Google's "related:" operator, e.g. [related:www.ford.com] Sojka, MR Group: PV211: Link analysis 18 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities urigins or ľ HHI utation analysis [Z) 9 Another application: Citation frequency can be used to measure the impact of a scientific article. o Simplest measure: Each citation gets one vote, o On the web: citation frequency = inlink count • However: A high inlink count does not necessarily mean high quality ... o . .. mainly because of link spam. • Better measure: weighted citation frequency or citation rank • An citation's vote is weighted according to its citation impact, o Circular? No: can be formalized in a well-defined way. Sojka, MR Group: PV211: Link analysis 19 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities rigins or ragercan nation analysis o Better measure: weighted citation frequency or citation rank • This is basically PageRank. 9 PageRank was invented in the context of citation analysis by Pinsker and Narin in the 1960s. o Citation analysis is a big deal: The budget and salary of this lecturer are / will be determined by the impact of his publications! □ Sojka, MR Group: PV211: Link analysis 20 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities rigins or ragercan ummary • We can use the same formal representation for o citations in the scientific literature o hyperlinks on the web • Appropriately weighted citation frequency is an excellent measure of quality . .. • . .. both for web pages and for scientific publications. • Next: PageRank algorithm for computing weighted citation frequency on the web Sojka, MR Group: PV211: Link analysis 21 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities agercan anaom wa • Imagine a web surfer doing a random walk on the web o Start at a random page • At each step, go out of the current page along one of the lin on that page, equiprobably o In the steady state, each page has a long-term visit rate. • This long-term visit rate is the page's PageRank. • PageRank = long-term visit rate = steady state probability Sojka, MR Group: PV211: Link analysis 23 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities ormaiization or random wa arkov cnams • A Markov chain consists of N states, plus an N x N transition probability matrix P. • state = page • At each step, we are on exactly one of the pages. • For 1 < ij < A/, the matrix entry P,j tells us the probability of j being the next page, given we are currently on page /. a Clearly, for all i, £j=i Pl} = 1 Pa ^ Sojka, MR Group: PV211: Link analysis 24 / 82 Recap Anchor text Citation ana lysis PageRank HITS: Hubs & Authorities zxampie we d grapi car benz honda ford jaguar jaguar PageRank d2 d0 0.05 di 0.04 d3 0.25 dA 0.21 d6 0.31 PageRank(d2)< PageRank(d6): why? 0.11 0.04 leopard speed a h do 0.10 0.03 di 0.01 0.04 d2 0.12 0.33 d3 0.47 0.18 dA 0.16 0.04 ds 0.01 0.04 de 0.13 0.35 highest in-degree: d2, d3, afe highest out-degree: d2, afe highest PageRank: afe highest hub score: d& (close: d2) highest authority score: d$ Sojka, MR Group: PV211: Link analysis 25 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities ink matrix tor example do di d2 do 0 0 1 di 0 1 1 d2 1 0 1 d3 0 0 0 d4 0 0 0 d5 0 0 0 de 0 0 0 d4 de 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1 1 1 1 0 1 Sojka, MR Group: PV211: Link analysis 26 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities ransition proDaDinty matrix r tor examp do d0 0.00 dx 0.00 eh. 0.33 0*3 0.00 dA 0.00 d5 0.00 d6 0.00 di d2 0.00 1.00 0.50 0.50 0.00 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ds d4 0.00 0.00 0.00 0.00 0.33 0.00 0.50 0.50 0.00 0.00 0.00 0.00 0.33 0.33 ds d§ 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.50 0.50 0.00 0.33 Sojka, MR Group: PV211: Link analysis 27 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities ong-term visit rate • Recall: PageRank = long-term visit rate • Long-term visit rate of page d is the probability that a web surfer is at page d at a given point in time. • Next: what properties must hold of the web graph for the long-term visit rate to be well defined? o The web graph must correspond to an ergodic Markov chain. • First a special case: The web graph must not contain dead ends. □ Sojka, MR Group: PV211: Link analysis 28 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities eaa enas • The web is full of dead ends. • Random walk can get stuck in dead ends. • If there are dead ends, long-term visit rates are not well-defined (or non-sensical). Sojka, MR Group: PV211: Link analysis 29 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities E exporting - to gei us oui o • At a dead end, jump to a random web page with prob. 1//V. • At a non-dead end, with probability 10%, jump to a random web page (to each with a probability of 0.1//V). 9 With remaining probability (90%), go out on a random hyperlink. • For example, if the page has 4 outgoing links: randomly choose one with probability (l-0.10)/4=0.225 • 10% is a parameter, the teleportation rate. • Note: "jumping" from dead end is independent of teleportation rate. □ Sojka, MR Group: PV211: Link analysis 30 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities esuit OT teieportin • With teleporting, we cannot get stuck in a dead end. 9 But even without dead ends, a graph may not have well-defined long-term visit rates. • More generally, we require that the Markov chain be ergodic. Sojka, MR Group: PV211: Link analysis 31 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities rgoaic Markov cnams • A Markov chain is ergodic iff it is irreducible and aperiodic. o Irreducibility. Roughly: there is a path from any page to any other page. • Aperiodicity. Roughly: The pages cannot be partitioned such that the random walker visits the partitions sequentially. • A non-ergodic Markov chain: 1.0 1.0 Sojka, MR Group: PV211: Link analysis 32 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities rgo< arKov cnains • Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state. o This is the steady-state probability distribution. • Over a long time period, we visit each state in proportion to this rate. • It doesn't matter where we start. o Teleporting makes the web graph ergodic. • =4> Web-graph+teleporting has a steady-state probability distribution. • =4> Each page in the web-graph+teleporting has a PageRank. Sojka, MR Group: PV211: Link analysis 33 / 82 • We now know what to do to make sure we have a well-defined PageRank for each page. • Next: how to compute PageRank Sojka, MR Group: PV211: Link analysis 34 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities ormaiization o a Di nty vector • Example: • A probability (row) vector x — (xi,... , x/v) tells us where the random walk is at any point. ( 0 0 0 ... 1 ... 0 0 0 ) 1 2 3 ... / ... N-2 N-l N • More generally: the random walk is on page / with probability x/. • Example: ( 0.05 0.01 0.0 ... 0.2 ... 1 . . . / . . . • £x, = l □ 0.01 0.05 0.03 ) N-2 N-l N Sojka, MR Group: PV211: Link analysis 35 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities ange in proDaDiiity vecto • If the probability vector is x = (xi,... , x/v) at this step, what is it at the next step? o Recall that row / of the transition probability matrix P tells us where we go next from state /. • So from x, our next state is distributed as xP. Sojka, MR Group: PV211: Link analysis 36 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities eaay state in vector notation o The steady state in vector notation is simply a vector 7r = (7ri, 7T2,..., tt/v) of probabilities. • (We use 7r to distinguish it from the notation for the probability vector x.) • 717 is the long-term visit rate (or PageRank) of page /. • So we can think of PageRank as a very long vector -entry per page. Sojka, MR Group: PV211: Link analysis 37 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities eaay-si isiNDUTion: • What is the PageRank / steady state in this example? n 0J5 n ro On u 0.25 u LO Sojka, MR Group: PV211: Link analysis 38 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities eaay-state aistriDution: Xl X2 Pt(di) Pt(d2) Pu- = 0.25 P12 = 0.75 P21-- = 0.25 P22 = 0.75 to 0.25 0.75 0.25 0.75 tl 0.25 0.75 (convergence) PageRank vector = 7? = (711,712) = (0.25,0.75) pt(c/i) = pt_i(c/i) * P11 + Pt-i{d2) * P21 Pt{d2) = Pt-i{di) * P12 + Pt-i{d2) * P22 □ Sojka, MR Group: PV211: Link analysis 39 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities ow a o we compute tr ie steaa y state vector r • In other words: how do we compute PageRank? o Recall: tv = (7ri, 7T2,..., tt/v) is the PageRank vector, the vector of steady-state probabilities ... • . .. and if the distribution in this step is x, then the distribution in the next step is xP. • But 7r is the steady state! • So: 7r = ttP 9 Solving this matrix equation gives us jf. 9 7F is the principal left eigenvector for P .. . • . . .that is, 7r is the left eigenvector with the largest eigenvalue. • All transition probability matrices have largest eigenvalue 1. Sojka, MR Group: PV211: Link analysis 40 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities ne way ot computing agercan • Start with any distribution x, e.g., uniform distribution • After one step, we're at xP. • After two steps, we're at xP2. • After k steps, we're at xPk. • Algorithm: multiply x by increasing powers of P until convergence. • This is called the power method. • Recall: regardless of where we start, we eventually reach the steady state jf. 9 Thus: we will eventually (in asymptotia) reach the steady state. Sojka, MR Group: PV211: Link analysis 41 / 82 • What is the PageRank / steady state in this example? n 09 n u 03 u Ö • The steady state distribution (= the PageRanks) in this example are 0.25 for d\ and 0.75 for a^- Sojka, MR Group: PV211: Link analysis 42 / 82 Recap Anchor text Citation analysis PageRank HITS: Hu os & Authorities 1 computing rage Kan ■"ower metr 10 a Xl X2 Pt(d2) Pn = o.i P12 = 0.9 P2i = 0.3 P22 = 0.7 to 0 1 0.3 0.7 = x P tl 0.3 0.7 0.24 0.76 = xP2 0.24 0.76 0.252 0.748 = xP3 t3 0.252 0.748 0.2496 0.7504 = xPA too 0.25 0.75 0.25 0.75 = xP°° PageRank vector = ťt = (iti,it2) = (0.25,0.75) Pt(c/i) = Pt_i(c/i) * Pn + Pt-i{d2) * P2i Pt{d2) = Pt_i(c/i) * Pi2 + Pt-i(c/2) * P22 □ Sojka, MR Group: PV211: Link analysis 43 / 82 • What is the PageRank / steady state in this example? n 09 n u 03 u Ö • The steady state distribution (= the PageRanks) in this example are 0.25 for d\ and 0.75 for d2. Sojka, MR Group: PV211: Link analysis 44 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities ompute rageKank using power metno Sojka, MR Group: PV211: Link analysis 45 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities Xl X2 Pt(d2) Pn = 0.7 Pl2 = 0.3 P21 = 0.2 P22 = 0.8 to 0 1 0.2 0.8 tl 0.2 0.8 0.3 0.7 0.3 0.7 0.35 0.65 t3 0.35 0.65 0.375 0.625 too 0.4 0.6 0.4 0.6 PageRank vector = ťt = (íti,^) = (0.4,0.6) Pt(c/i) = Pt_i(í/i) * Pii + Pt-i(cfe) * P21 Pt(cfe) = Pt-M) * P12 + Pt-i(cfe) * P22 □ Sojka, MR Group: PV211: Link analysis 46 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities agercanK summary • Preprocessing o Given graph of links, build matrix P • Apply teleportation o From modified matrix, compute vr • 7?/ is the PageRank of page /. • Query processing o Retrieve pages satisfying the query o Rank them by their PageRank • Return reranked list to the user Sojka, MR Group: PV211: Link analysis 47 / 82 Recap Anchor text Citation ana lysis PageRank HITS: Hubs & Authorities -age Kani k issues o Real surfers are not random surfers. • Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories - and search! o —>» Markov model is not a good model of surfing. • But it's good enough as a model for our purposes. • Simple PageRank ranking (as described on previous slide) produces bad results for many pages. • Consider the query [video service] • The Yahoo home page (i) has a very high PageRank and (ii) contains both video and service. o If we rank all Boolean hits according to PageRank, then the Yahoo home page would be top-ranked. • Clearly not desirable • In practice: rank according to weighted combination of raw text match, anchor text match, PageRank & other factors • —> see lecture on Learning to Rank □ Sojka, MR Group: PV211: Link analysis 48 / 82 Recap Anchor text Citation ana lysis PageRank HITS: Hubs & Authorities zxampie we d grapi car benz honda ford jaguar jaguar PageRank d2 ds d0 0.05 di 0.04 d3 0.25 dA 0.21 d6 0.31 PageRank(d2)< PageRank(d6): why? 0.11 0.04 leopard speed a h do 0.10 0.03 di 0.01 0.04 d2 0.12 0.33 d3 0.47 0.18 dA 0.16 0.04 ds 0.01 0.04 de 0.13 0.35 highest in-degree: d2, d3, afe highest out-degree: d2, afe highest PageRank: afe highest hub score: d& (close: d2) highest authority score: d3 Sojka, MR Group: PV211: Link analysis 49 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities ransition (proDaDintyj matrix do d0 0.00 c/i 0.00 eh. 0.33 d3 0.00 dA 0.00 d5 0.00 d6 0.00 oři oř2 0.00 1.00 0.50 0.50 0.00 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ds d4 0.00 0.00 0.00 0.00 0.33 0.00 0.50 0.50 0.00 0.00 0.00 0.00 0.33 0.33 ds d§ 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.50 0.50 0.00 0.33 Sojka, MR Group: PV211: Link analysis 50 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities ransition matrix Mil eieporting do d0 0.02 dx 0.02 eh. 0.31 d3 0.02 dA 0.02 d5 0.02 cfe 0.02 oři d2 0.02 0.88 0.45 0.45 0.02 0.31 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ds d4 0.02 0.02 0.02 0.02 0.31 0.02 0.45 0.45 0.02 0.02 0.02 0.02 0.31 0.31 ds d§ 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.88 0.45 0.45 0.02 0.31 Sojka, MR Group: PV211: Link analysis 51 / 82 Recap Anchor text Citation ana lysis PageRank HITS: Hubs & Authorities "ower me- c n oa veci tors x X xP1 xP2 xP3 xP4 xP5 xP6 xP7 xP8 xP9 xP10 xP11 xP12 xP13 do 0.14 0.06 0.09 0.07 0.07 0.06 0.06 0.06 0.06 0.05 0.05 0.05 0.05 0.05 di 0.14 0.08 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 d2 0.14 0.25 0.18 0.17 0.15 0.14 0.13 0.12 0.12 0.12 0.12 0.11 0.11 0.11 ds 0.14 0.16 0.23 0.24 0.24 0.24 0.24 0.25 0.25 0.25 0.25 0.25 0.25 0.25 dA 0.14 0.12 0.16 0.19 0.19 0.20 0.21 0.21 0.21 0.21 0.21 0.21 0.21 0.21 d5 0.14 0.08 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 de 0.14 0.25 0.23 0.25 0.27 0.28 0.29 0.29 0.30 0.30 0.30 0.30 0.31 0.31 Sojka, MR Group: PV211: Link analysis 52 / 82 Recap Anchor text Citation ana lysis PageRank HITS: Hubs & Authorities zxampie we d grapi car benz honda ford jaguar jaguar PageRank d2 d0 0.05 di 0.04 d3 0.25 dA 0.21 d6 0.31 PageRank(d2)< PageRank(d6): why? 0.11 0.04 leopard speed a h do 0.10 0.03 di 0.01 0.04 d2 0.12 0.33 d3 0.47 0.18 dA 0.16 0.04 ds 0.01 0.04 de 0.13 0.35 highest in-degree: d2, d3, afe highest out-degree: d2, afe highest PageRank: afe highest hub score: d& (close: d2) highest authority score: d$ Sojka, MR Group: PV211: Link analysis 53 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities ow important is rage Kan • Frequent claim: PageRank is the most important component of web ranking. • The reality: • There are several components that are at least as important: e.g., anchor text, phrases, proximity, tiered indexes ... o Rumor has it that PageRank in its original form (as presented here) now has a negligible impact on ranking! • However, variants of a page's PageRank are still an essential part of ranking. • Adressing link spam is difficult and crucial. Sojka, MR Group: PV211: Link analysis 54 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities ypernnK-inaucea lopic _>earc • Premise: there are two different types of relevance on the web. • Relevance type 1: Hubs. A hub page is a good list of [links to pages answering the information need]. o E.g., for query [chicago bulls]: Bob's list of recommended resources on the Chicago Bulls sports team • Relevance type 2: Authorities. An authority page is a direct answer to the information need. o The home page of the Chicago Bulls sports team o By definition: Links to authority pages occur repeatedly on hub pages. • Most approaches to search (including PageRank ranking) don't make the distinction between these two very different types of relevance. □ Sojka, MR Group: PV211: Link analysis 56 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities U os an a a orities: U ennition • A good hub page for a topic links to many authority pages for that topic. 9 A good authority page for a topic is linked to by many hub pages for that topic. • Circular definition - we will turn this into an iterative computation. Sojka, MR Group: PV211: Link analysis 57 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities txampie ror rums ana autnorn cies hubs authorities www. b es t f a res. co m aviationblog.dallasnews.com Sojka, MR Group: PV211: Link analysis 58 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities ow io compuie r 1U D ana auir ionty scores • Do a regular web search first o Call the search result the root set o Find all pages that are linked to or link to pages in the root set • Call this larger set the base set • Finally, compute hubs and authorities for the base set (which we'll view as a small web graph) □ Sojka, MR Group: PV211: Link analysis 59 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities The root set Nodes that root set nodes link to Nodes that link to root set nodes The base set Sojka, MR Group: PV211: Link analysis 60 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities oot set an • Root set typically has 200-1,000 nodes. • Base set may have up to 5,000 nodes. o Computation of base set, as shown on previous slide: o Follow outlinks by parsing the pages in the root set o Find c/'s inlinks by searching for all pages containing a link to d □ Sojka, MR Group: PV211: Link analysis 61 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities ud ana autnority scores • Compute for each page d in the base set a hub score h(d) and an authority score a{d) • Initialization: for all d\ h(d) — 1, a{d) — 1 • Iteratively update all h(d),a(d) 9 After convergence: • Output pages with highest h scores as top hubs o Output pages with highest a scores as top authorities • So we output two ranked lists Sojka, MR Group: PV211: Link analysis 62 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities eraiive upaaie • For all d: h(d) = J2d^v 3(y) 9 For all d: a(d) = h{y) 9 Iterate these two steps until convergence Sojka, MR Group: PV211: Link analysis 63 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities • Scaling o To prevent the a() and /?() values from getting too big, can scale down after each iteration • Scaling factor doesn't really matter. o We care about the relative (as opposed to absolute) values of the scores. • In most cases, the algorithm converges after a few iterations. Sojka, MR Group: PV211: Link analysis 64 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities Autnorities ror query ncago duns 0.85 www.nba.com/bulls 0.25 www.essexl.com/people/jmiller/bulls.htm "da Bulls" 0.20 www.nando.net/SportServer/basketball/nba/chi.html "The Chicago Bulls" 0.15 users.aol.com/rynocub/bulls.htm "The Chicago Bulls Home Page" 0.13 www.geocities.com / Colosseum/6095 "Chicago Bulls" (Ben-Shaul et al, WWW8) Sojka, MR Group: PV211: Link analysis 65 / 82 s authority page tor 3ageRank HITS: Hubs & Authorities icago duns • Teckels tar tne Chicago BulfeAferizion Whtiess ara now on saisr. Join Sums' personflirtas inc uúing currant prayers, co acnes, ecc-nds. broadcaslars and entertalrHnant teams cm August 17 al Via White Pnss Golf Club In Bensarwflla, III. • MŮS.1D: • h I I I I I ., Cfticago Hulls i DYaTt c&nlral Z0O9 * Pre-diaft Ask Sam meJSbag special 4 Pre-draft interview: Wake^ JsflTeagus * Pre-chafi interview: VGU's Eric Mayrar * Pre-dtaJt interview: Wake's Jamas Jotma j Pra-diaft interview: UNCs Wayne HMngk Sojka, MR Group: PV211: Link analysis 66 / 82 Recap Anchor text Citation ans i lysis PageRank HITS: Hubs & Authorities u ds tor query ncago dui is 1.62 www.geocities.com / Colosseum/1778 "UnbelieveabullsMM!" 1.24 www.webring.org/cgi-bin/webri ng?ring=ch bulls "Erin's Chicago Bulls Page" 0.74 www.geocities.com/Hollywood/Lot/3330/Bulls.html "Chicago Bulls" 0.52 www.nobull.net/web_position/kw-search-15-M2.htm "Excite Search Results: bulls" 0.52 www.halcyon.com/wordsltd/bbal l/bulls. htm "Chicago Bulls Links" (Ben-Shaul et al, WWW8) Sojka, MR Group: PV211: Link analysis 67 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities uď page Tor iL.nicago duns COAST TO COAST TIC KETS great tickets from nice people Minnesota Timberwolves rackets Naw Jersey Nets Tickets Maw Orleans Homata Tickets New York Knicks Tickets Oidahama City Thunder Tickets Orlando Mag-« Tickets Philadelphia ?6e^E T.ckais PKraen Suns Tickets Portland T'ai' 8la2e-B Tewels Sacramanto Kings Tickets San Antonia Ss^rs TckeSs Toronto Haslets Tickete Utah Jazz Tickets Waahinglori Wizards Tckets NBA All-Star Weekend NBA Finals Tkkflls NBA Playoffs Tickets All NBA Tickets Event Selections Sporting Events MLS Baseball Tickets NFL Football Tickets NBA Basketball Tickets NHL Hockey Tickets NASCAR Racing Tickets PGA Golf Tickets Tennis Tickets NCAA Football Tickets ming cm City Guide I \ Official Website Links: Chicago BuNs {official site) http://www.nba .com/bulIs/ Fan Club - Fan Site Links: Chicago Bulls Chicago Bulls Fan Sire with Bulls Blog. News, Buíľs Forum, Wallpapers ana all your basic Chicago Bulls essentials! I http://Www_bu.llscentraLcom Chicago Bulls Blog The päace to be for news and views on the Chicago Buläs and NBA Basketball! h tip i/Zchi-buil s, blog spoicom News and Information Links: Chicago Sun-Times (local newspaper) http://www.sunu mas, ram/sporls/basketball/bulrs/1 ndex. html Chicago Tribune {local newspaper) http://www.chľcagotrl burce .com/spons/basketball/buUar Wlklpedla - Chicago Bulls All about the Chicago Bulls from Wlklpedia. the free online encyclopedia, h ftp J/en.wl klpedia. org/wikl/Chicago_Bulls Merchandise Links: Chicago Bulls watches http://ww w.sportl me watches. com/NB A_watches/Ch Icagc-Bul Is-watches. html Sojka, MR Group: PV211: Link analysis 68 / 82 HITS can pull together good pages regardless of page content. Once the base set is assembled, we only do link analysis, no text matching. Pages in the base set often do not contain any of the query words. In theory, an English query can retrieve Japanese-language pages! • If supported by the link structure between English and Japanese pages Danger: topic drift - the pages found by following links may not be related to the original query. Sojka, MR Group: PV211: Link analysis 69 / 82 Recap Anchor text Citation analysi 5 PageRank HITS: Hubs & Authorities : of convergenc • We define an N x N adjacency matrix A. (We called this the link matrix earlier. • For 1 < ij < N, the matrix entry A,j tells us whether there is a link from page / to page j {A,j — 1) or not (Ay = 0). • Example: n ® di c/i 0 d2 1 d3 1 d2 d3 1 0 1 1 0 0 Sojka, MR Group: PV211: Link analysis 70 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities \ a / "j_ J i i . ■ Write upd ate ruies as matrix operations • Define the hub vector h = ..., h/\i) as the vector of hub scores. /?,- is the hub score of page cf,\ • Similarly for a, the vector of authority scores • Now we can write h(d) = J2d^y a(y) as a matrix operation: h = Aa .. . • . .. and we can write a(d) = I^y^ h(y) as a = /4T/7 • HITS algorithm in matrix notation: • Compute h = Aa 9 Compute a = ATh d Iterate until convergence Sojka, MR Group: PV211: Link analysis 71 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities as eigenvector prooiem o HITS algorithm in matrix notation. Iterate: • Compute h = Aa • Compute a = ATh • By substitution we get: h = AATh and a = A7 Aa • Thus, h is an eigenvector of AAT and a is an eigenvector of ATA. • So the HITS algorithm is actually a special case of the powe method and hub and authority scores are eigenvector values. • HITS and PageRank both formalize link analysis as eigenvector problems. Sojka, MR Group: PV211: Link analysis 72 / 82 Recap Anchor text Citation ana lysis PageRank HITS: Hubs & Authorities zxampie we d grapi car benz honda ford jaguar jaguar PageRank d2 d0 0.05 di 0.04 d3 0.25 dA 0.21 d6 0.31 PageRank(d2)< PageRank(d6): why? 0.11 0.04 leopard speed a h do 0.10 0.03 di 0.01 0.04 d2 0.12 0.33 d3 0.47 0.18 dA 0.16 0.04 ds 0.01 0.04 de 0.13 0.35 highest in-degree: d2, ds, afe highest out-degree: d2, afe highest PageRank: afe highest hub score: afe (close: d2) highest authority score: ds Sojka, MR Group: PV211: Link analysis 73 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities do di d2 d3 d5 cfe do 0 0 1 0 0 0 0 di 0 1 1 0 0 0 0 d2 1 0 1 2 0 0 0 d3 0 0 0 1 1 0 0 d4 0 0 0 0 0 0 1 d5 0 0 0 0 0 1 1 d6 0 0 0 2 1 0 1 Sojka, MR Group: PV211: Link analysis 74 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities 1 1 1 j_ II í A H u b vectors ho, h j = jA i • a,-, / > 1 ho hi h2 h3 h4 do 0.14 0.06 0.04 0.04 0.03 0.03 di 0.14 0.08 0.05 0.04 0.04 0.04 d2 0.14 0.28 0.32 0.33 0.33 0.33 d3 0.14 0.14 0.17 0.18 0.18 0.18 dA 0.14 0.06 0.04 0.04 0.04 0.04 d5 0.14 0.08 0.05 0.04 0.04 0.04 de 0.14 0.30 0.33 0.34 0.35 0.35 Sojka, MR Group: PV211: Link analysis 75 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities a j_ I ■ , , —> 1 a T 1 " \ 1 Aut hority vectors a-, — 31 d0 0.06 di 0.06 d2 0.19 d3 0.31 dA 0.13 ds 0.06 d6 0.19 32 a3 0.09 0.10 0.03 0.01 0.14 0.13 0.43 0.46 0.14 0.16 0.03 0.02 0.14 0.13 34 35 0.10 0.10 0.01 0.01 0.12 0.12 0.46 0.46 0.16 0.16 0.01 0.01 0.13 0.13 36 3*7 0.10 0.10 0.01 0.01 0.12 0.12 0.47 0.47 0.16 0.16 0.01 0.01 0.13 0.13 Sojka, MR Group: PV211: Link analysis 76 / 82 Recap Anchor text Citation ana lysis PageRank HITS: Hubs & Authorities zxampie we d grapi car benz honda ford jaguar jaguar PageRank d2 d0 0.05 di 0.04 d3 0.25 dA 0.21 d6 0.31 PageRank(d2)< PageRank(d6): why? 0.11 0.04 leopard speed a h do 0.10 0.03 di 0.01 0.04 d2 0.12 0.33 d3 0.47 0.18 dA 0.16 0.04 ds 0.01 0.04 de 0.13 0.35 highest in-degree: d2, ds, afe highest out-degree: d2, afe highest PageRank: afe highest hub score: afe (close: d2) highest authority score: ds Sojka, MR Group: PV211: Link analysis 77 / 82 Recap Anchor text Citation ana lysis PageRank HITS: Hubs & Authorities zxampie we d grapi car benz honda ford jaguar jaguar PageRank d2 d0 0.05 di 0.04 d3 0.25 dA 0.21 d6 0.31 PageRank(d2)< PageRank(d6): why? 0.11 0.04 leopard speed a h do 0.10 0.03 di 0.01 0.04 d2 0.12 0.33 d3 0.47 0.18 dA 0.16 0.04 ds 0.01 0.04 de 0.13 0.35 highest in-degree: d2, ds, afe highest out-degree: d2, afe highest PageRank: afe highest hub score: d& (close: d2) highest authority score: d$ Sojka, MR Group: PV211: Link analysis 78 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities agercanK vs. ISCUSSIO • PageRank can be precompiled, HITS has to be computed at query time. • HITS is too expensive in most application scenarios. o PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to. o These two are orthogonal. • We could also apply HITS to the entire web and PageRank to a small base set. 9 Claim: On the web, a good hub almost always is also a good authority. • The actual difference between PageRank ranking and HITS ranking is therefore not as large as one might expect. Sojka, MR Group: PV211: Link analysis 79 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities • Why is a good hub almost always also a good authority? Sojka, MR Group: PV211: Link analysis 80 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities ake-away today • Anchor text: What exactly are links on the web and why are they important for IR? • Citation analysis: the mathematical foundation of PageRank and link-based ranking • PageRank: the original algorithm that was used for link-based ranking on the web • Hubs & Authorities: an alternative link-based ranking algorithm Sojka, MR Group: PV211: Link analysis 81 / 82 Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities lesources • Chapter 21 of MR • Resources at http://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and Fl MU library o American Mathematical Society article on PageRank (popular science style) o Jon Weinberg's home page (main person behind HITS) o A Google bomb and its defusing • Google's official description of PageRank: PageRank reflects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results. Sojka, MR Group: PV211: Link analysis 82 / 82