Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/PV211 IIR 1: Boolean Retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2023-02-15 (compiled on 2023-02-27 19:09:36) Sojka, IIR Group: PV211: Boolean Retrieval 1 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Take-away Basic information about the course, teachers, evaluation, exercises Boolean Retrieval: Design and data structures of a simple information retrieval system What topics will be covered in this class (overview)? Sojka, IIR Group: PV211: Boolean Retrieval 2 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Overview 1 Introduction 2 History of information retrieval 3 Boolean model 4 Inverted index 5 Processing queries 6 Query optimization 7 Course overview and agenda Sojka, IIR Group: PV211: Boolean Retrieval 3 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Start with why (Simon Sinek) Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Why important? Why you? Why now? Information handling on Faculty of informatics in information age. . . Sojka, IIR Group: PV211: Boolean Retrieval 5 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Prerequisites Curiosity about how Information Retrieval works. But seriously, based on Manning et al. IIR textbook (available in MU libraries): Chapters 1–5 benefit from basic course on algorithms and data structures. Chapters 6–7 need in addition linear algebra, vectors and dot products. For Chapters 11–13 basic probability notions are needed. Chapters 18–21 demand course in linear algebra, notions of matrix rank, eigenvalues and eigenvectors. Sojka, IIR Group: PV211: Boolean Retrieval 6 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours PV211 course design I proactive rather than reactive learning, diversity is stability, welcomed, learning by doing/programming, skillful rather than bag of facts, Stanford (TEX, Google) inspired Sojka, IIR Group: PV211: Boolean Retrieval 7 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours PV211 course design II Mentoring rather than ‘ex cathedra’ lectures: “The flipped classroom is a pedagogical model in which the typical lecture and homework elements of a course are reversed.” Questions are welcome—on PV211 IS discussion forum before lectures, and also during lectures. Respect to the individual learning speed and knowledge. Student [soft skills and programming] activities (answering in discussion forums) are explicitly welcomed. Richness of materials available in advance: MOOC (Massive open online course) becoming widespread, parts of IIR Stanford courses being available, together with other freely available teaching materials, including the whole IIR book, Google Colab notebooks,. . . . Sojka, IIR Group: PV211: Boolean Retrieval 8 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Teachers Petr Sojka, sojka@fi.muni.cz Consulting hours Spring 2023: Wednesday 14:00–14:50 after the Wednesday lecture or by appointment by email. Room C523 (or C522), fifth floor, Botanická 68a. Course web page: https://www.fi.muni.cz/~sojka/PV211/ Teaching assistants (TA): Vít Novotný, witiko@mail.muni.cz, Michal Štefánik, stefanik.m@mail.muni.cz Vojtěch Kalivoda, 527350@mail.muni.cz Šárka Ščavnická, 527352@mail.muni.cz All TAs are ready for consultations after their teaching hours or by appointment. Sojka, IIR Group: PV211: Boolean Retrieval 9 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Evaluation of students Classification is based on points you could get a) 48 pts during the term: 36 pts for the two term programming projects, 12 pts for term projects peer reviews, and b) 52 pts for the final exam (multiple-choice test): 20 pts for exercises, similar to those practiced at seminars, 32 pts for classical multiple-choice test. In addition, one can get additional premium points based on activities during lectures, exercises (good answers), in IS discussion forum, or negotiated related projects. Classification scale lower bounds for passing z/k are 48/53 points and E–A grading will be adjusted based on ECTS suggestion in IS (E/D/C/B/A ≈ 58/66/74/82/90 pts . Dates of [final] exams will be announced via IS.muni.cz (at least three terms, probably four). Sojka, IIR Group: PV211: Boolean Retrieval 10 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Two term projects and their student peer reviews I Until 13. 3. 23:59 (resp. 1. 5. 23:59), your tasks awarded up to 10 pts resp. 26 pts will be the following: 1 Individually implement a ranked unsupervised (resp. supervised) retrieval system for Cranfield (resp. ARQMath3) collection. 2 Document your code and stick to an organized, consistent, human-readable coding style. 3 For first task, reach at least 22% (resp. TBA) mean average precision (MAP) and record it in your Jupyter notebook or in the public leaderboard. 4 Upload an .ipynb file with your Jupyter notebook to the homework vault in IS MU. Sojka, IIR Group: PV211: Boolean Retrieval 11 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Two term projects and their student peer reviews II For detailed instructions and an example solution, see the Google Colaboratory document linked from the interactive course syllabus in IS MU. Between 14. 3. and 20. 3., (resp. 2. 5. and 8. 5.) your task awarded up to 3 × 2 = 6 pts will be to review the term projects of three of your colleagues. 0.5 pts will be awarded for handing in a review of your colleague’s term project. 1.5 pts will be awarded for reviewing the completion of tasks in your colleague’s term project. Sojka, IIR Group: PV211: Boolean Retrieval 12 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Two term projects and their student peer reviews III You will be instructed on the first practical on which institutional computational resources (Jupyter Hub, Google Colab, Deepnote,. . . ) you will have at your disposal for solving projects. You can get up to extra 40/20/10/9/8/7/6/5/4/3/2/1 point(s) for the 1st/2nd/3rd/4th/. . . /12th place in the competition. Final leaderboards will be increasingly ordered by sum of weighted positions gained in both tasks, and by sum of two scores in the case of tie. Sojka, IIR Group: PV211: Boolean Retrieval 13 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Summary of course grading week/ deadline pts description 1–4/ 13. 3. 23:59 10 first assignment project (Cranfield) 5/ 20. 3. 23:59 6 peer review of Cranfield TFIDF 6–11/ 1. 5. 23:59 20 second assignment project (ARQMath3) 6–11/ 1. 5. 23:59 6 for justification and explanation of your solution and code of second assignement 12/ 8. 5. 23:59 6 peer review of explained second project 14+/ exam part 1 20 open exercises, similar to those practiced at seminars 14+/ exam part 2 32 classical multiple-choice test testing understanding of topics taught 1–14+/ extra points X extra activities during term or negotiated related projects 14+/ total points 100+X points for ECTS gradings Sojka, IIR Group: PV211: Boolean Retrieval 14 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Can we proceed [Y/N]? Questions? Python? Jupyter Notebook, Jupyter Hub? Google Colab? Deepnote project? Bc./Mgr./Ph.D.? Mandatory course z/k/zk? Erasmus? Nationalities: CZ?, SK?, EN=C2 (mother tongue)?, other? 2 programming projects?/challenges? Student peer reviews? Mikolov? Řehůřek? Materna? Jurových? Presentation style? traditional? or agile/interactive [warm ups, Kahoot])? Piazza? Discord discussion forum with anonymous posts?! Sojka, IIR Group: PV211: Boolean Retrieval 15 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours History of information retrieval: gradual changes of channels Sojka, IIR Group: PV211: Boolean Retrieval 17 / 86 Gradual speedup of changes in IR Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours 1998: google.stanford.edu collaborative project with Stanford faculty (‘flipped IS’ :-) on collected disks Google 1998 ‘Anatomy paper’ (Page, Brin) Sojka, IIR Group: PV211: Boolean Retrieval 21 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Unstructured (text) versus structured (database) data in 2016 ? in 2026 ? Sojka, IIR Group: PV211: Boolean Retrieval 26 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Boolean retrieval The Boolean model is arguably the simplest model to base an information retrieval system on. Queries are Boolean expressions, e.g., Caesar and Brutus The search engine returns all documents that satisfy the Boolean expression. Does Google use the Boolean model? Sojka, IIR Group: PV211: Boolean Retrieval 27 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Does Google use the Boolean model? On Google, the default interpretation of a query [w1 w2 . . . wn] is w1 AND w2 AND . . . AND wn Cases where you get hits that do not contain one of the wi : anchor text page contains variant of wi (morphology, spelling correction, synonym) long queries (n large) boolean expression generates very few hits Simple Boolean vs. Ranking of result set Simple Boolean retrieval returns matching documents in no particular order. Google (and most well designed Boolean engines) rank the result set – they rank good hits (according to some estimator of relevance) higher than bad hits. Sojka, IIR Group: PV211: Boolean Retrieval 28 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Unstructured data in 1650: collective works of Shakespeare Sojka, IIR Group: PV211: Boolean Retrieval 30 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Unstructured data in 1650 Which plays of Shakespeare contain the words Brutus and Caesar, but not Calpurnia? One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia. Why is grep not the solution? Slow (for large collections) grep is line-oriented, IR is document-oriented “not Calpurnia” is non-trivial Other operations (e.g., find the word Romans near countryman) not feasible Ranked retrieval (best documents to return) – focus of later lectures, but not this one Sojka, IIR Group: PV211: Boolean Retrieval 31 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Term-document incidence matrix Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 . . . Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar. Entry is 0 if term doesn’t occur. Example: Calpurnia doesn’t occur in The tempest. Sojka, IIR Group: PV211: Boolean Retrieval 32 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Incidence vectors So we have a 0/1 vector for each term. To answer the query Brutus and Caesar and not Calpurnia: Take the vectors for Brutus, Caesar, and Calpurnia Complement the vector of Calpurnia Do a (bitwise) and on the three vectors 110100 and 110111 and 101111 = 100100 Sojka, IIR Group: PV211: Boolean Retrieval 33 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours 0/1 vector for Brutus Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 . . . result: 1 0 0 1 0 0 Sojka, IIR Group: PV211: Boolean Retrieval 34 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Answers to query Anthony and Cleopatra, Act III, Scene ii Agrippa [Aside to Domitius Enobarbus]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me. Sojka, IIR Group: PV211: Boolean Retrieval 35 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Bigger collections Consider N = 106 documents, each with about 1000 tokens ⇒ total of 109 tokens On average 6 bytes per token, including spaces and punctuation ⇒ size of document collection is about 6 · 109 = 6 GB Assume there are M = 500,000 distinct terms in the collection (Notice that we are making a term/token distinction.) Sojka, IIR Group: PV211: Boolean Retrieval 36 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Can’t build the incidence matrix M = 500,000 × 106 = half a trillion 0s and 1s. But the matrix has no more than one billion 1s. Matrix is extremely sparse. What is a better representations? We only record the 1s: inverted index! Sojka, IIR Group: PV211: Boolean Retrieval 37 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Inverted index For each term t, we store a list of all documents that contain t. Brutus −→ 1 2 4 11 31 45 173 174 Caesar −→ 1 2 4 5 6 16 57 132 . . . Calpurnia −→ 2 31 54 101 ... dictionary postings Sojka, IIR Group: PV211: Boolean Retrieval 38 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Inverted index construction 1 Collect the documents to be indexed: Friends, Romans, countrymen. So let it be with Caesar . . . 2 Tokenize the text, turning each document into a list of tokens: Friends Romans countrymen So . . . 3 Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms: friend roman countryman so . . . 4 Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings. Sojka, IIR Group: PV211: Boolean Retrieval 39 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Tokenization and preprocessing Doc 1. I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me. Doc 2. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious: =⇒ Doc 1. i did enact julius caesar i was killed i’ the capitol brutus killed me Doc 2. so let it be with caesar the noble brutus hath told you caesar was ambitious Sojka, IIR Group: PV211: Boolean Retrieval 40 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Generate postings Doc 1. i did enact julius caesar i was killed i’ the capitol brutus killed me Doc 2. so let it be with caesar the noble brutus hath told you caesar was ambitious =⇒ term docID i 1 did 1 enact 1 julius 1 caesar 1 i 1 was 1 killed 1 i’ 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious 2 Sojka, IIR Group: PV211: Boolean Retrieval 41 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Sort postings term docID i 1 did 1 enact 1 julius 1 caesar 1 i 1 was 1 killed 1 i’ 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious 2 =⇒ term docID ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 i 1 i 1 i’ 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2 Sojka, IIR Group: PV211: Boolean Retrieval 42 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Create postings lists, determine document frequency term docID ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 i 1 i 1 i’ 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2 =⇒ term doc. freq. → postings lists ambitious 1 → 2 be 1 → 2 brutus 2 → 1 → 2 capitol 1 → 1 caesar 2 → 1 → 2 did 1 → 1 enact 1 → 1 hath 1 → 2 i 1 → 1 i’ 1 → 1 it 1 → 2 julius 1 → 1 killed 1 → 1 let 1 → 2 me 1 → 1 noble 1 → 2 so 1 → 2 the 2 → 1 → 2 told 1 → 2 you 1 → 2 was 2 → 1 → 2 with 1 → 2 Sojka, IIR Group: PV211: Boolean Retrieval 43 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Split the result into dictionary and postings file Brutus −→ 1 2 4 11 31 45 173 174 Caesar −→ 1 2 4 5 6 16 57 132 . . . Calpurnia −→ 2 31 54 101 ... dictionary postings file Sojka, IIR Group: PV211: Boolean Retrieval 44 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Later in this course Index construction: how can we create inverted indexes for large collections? How much space do we need for dictionary and index? Index compression: how can we efficiently store and process indexes for large collections? Ranked retrieval: what does the inverted index look like when we want the “best” answer? Sojka, IIR Group: PV211: Boolean Retrieval 45 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Simple conjunctive query (two terms) Consider the query: Brutus AND Calpurnia To find all matching documents using inverted index: 1 Locate Brutus in the dictionary 2 Retrieve its postings list from the postings file 3 Locate Calpurnia in the dictionary 4 Retrieve its postings list from the postings file 5 Intersect the two postings lists 6 Return intersection to user Sojka, IIR Group: PV211: Boolean Retrieval 47 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Intersecting two postings lists Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 Calpurnia −→ 2 → 31 → 54 → 101 Intersection =⇒ 2 → 31 This is linear in the length of the postings lists. Note: This only works if postings lists are sorted. Sojka, IIR Group: PV211: Boolean Retrieval 48 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Intersecting two postings lists Intersect(p1, p2) 1 answer ← 2 while p1 = nil and p2 = nil 3 do if docID(p1) = docID(p2) 4 then Add(answer, docID(p1)) 5 p1 ← next(p1) 6 p2 ← next(p2) 7 else if docID(p1) < docID(p2) 8 then p1 ← next(p1) 9 else p2 ← next(p2) 10 return answer Sojka, IIR Group: PV211: Boolean Retrieval 49 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Query processing: Exercise france −→ 1 → 2 → 3 → 4 → 5 → 7 → 8 → 9 → 11 → 12 → 13 → 14 → 15 paris −→ 2 → 6 → 10 → 12 → 14 lear −→ 12 → 15 Compute hit list for ((paris AND NOT france) OR lear) Sojka, IIR Group: PV211: Boolean Retrieval 50 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Boolean queries The Boolean retrieval model can answer any query that is a Boolean expression. Boolean queries are queries that use and, or and not to join query terms. Views each document as a set of terms. Is precise: Document matches condition or not. Primary commercial retrieval tool for 3 decades Many professional searchers (e.g., lawyers) still like Boolean queries. You know exactly what you are getting. Many search systems you use are also Boolean: spotlight, email, intranet, etc. Sojka, IIR Group: PV211: Boolean Retrieval 51 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Commercially successful Boolean retrieval: Westlaw Largest commercial legal search service in terms of the number of paying subscribers Over half a million subscribers performing millions of searches a day over tens of terabytes of text data The service was started in 1975. In 2005, Boolean search (called “Terms and Connectors” by Westlaw) was still the default, and used by a large percentage of users . . . . . . although ranked retrieval has been available since 1992. Sojka, IIR Group: PV211: Boolean Retrieval 52 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Westlaw: Example queries Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competing company Query: “trade secret” /s disclos! /s prevent /s employe! Information need: Requirements for disabled people to be able to access a workplace Query: disab! /p access! /s work-site work-place (employment /3 place) Information need: Cases about a host’s responsibility for drunk guests Query: host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest Sojka, IIR Group: PV211: Boolean Retrieval 53 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Westlaw: Comments Proximity operators: /3 = within 3 words, /s = within a sentence, /p = within a paragraph Space is disjunction, not conjunction! (This was the default in search pre-Google.) Long, precise queries: incrementally developed, not like web search Why professional searchers often like Boolean search: precision, transparency, control When are Boolean queries the best way of searching? Depends on: information need, searcher, document collection,. . . Sojka, IIR Group: PV211: Boolean Retrieval 54 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Query optimization Consider a query that is an and of n terms, n > 2 For each of the terms, get its postings list, then and them together Example query: Brutus AND Calpurnia AND Caesar What is the best order for processing this query? Sojka, IIR Group: PV211: Boolean Retrieval 56 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Query optimization Example query: Brutus AND Calpurnia AND Caesar Simple and effective optimization: Process in order of increasing frequency Start with the shortest postings list, then keep cutting further In this example, first Caesar, then Calpurnia, then Brutus Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 Calpurnia −→ 2 → 31 → 54 → 101 Caesar −→ 5 → 31 Sojka, IIR Group: PV211: Boolean Retrieval 57 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Optimized intersection algorithm for conjunctive queries Intersect( t1, . . . , tn ) 1 terms ← SortByIncreasingFrequency( t1, . . . , tn ) 2 result ← postings(first(terms)) 3 terms ← rest(terms) 4 while terms = nil and result = nil 5 do result ← Intersect(result, postings(first(terms))) 6 terms ← rest(terms) 7 return result Sojka, IIR Group: PV211: Boolean Retrieval 58 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours More general optimization Example query: (madding or crowd) and (ignoble or strife) Get frequencies for all terms Estimate the size of each or by the sum of its frequencies (conservative) Process in increasing order of or sizes Sojka, IIR Group: PV211: Boolean Retrieval 59 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Exercise Recommend a query processing order for: (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes) Sojka, IIR Group: PV211: Boolean Retrieval 60 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Course overview and agenda We are done with Chapter 1 of IIR (IIR 01). Plan for the rest of the semester: some 14 of the 21 chapters of IIR (cf. slides from previous years and those planned for this year – comments welcome). In what follows: teasers for most chapters – to give you a sense of what will be covered. One or two bonus invited lecture(s), and lecture(s) on IR topics researched in my research group MIR.fi.muni.cz and on state-of-the art achievements in the area (vector space embeddings, transformers, Neural AI 4 IR, etc.). Sojka, IIR Group: PV211: Boolean Retrieval 62 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Week 2 – IIR 02: The term vocabulary and postings lists Phrase queries: “Stanford University” Proximity queries: Gates near Microsoft We need an index that captures position information for phrase queries and proximity queries. Sojka, IIR Group: PV211: Boolean Retrieval 63 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Week 2 – IIR 03: Dictionaries and tolerant retrieval rd aboard ardent boardroom border or border lord morbid sordid bo aboard about boardroom border ✲ ✲ ✲ ✲ ✲ ✲ ✲ ✲ ✲ ✲ ✲ ✲ Sojka, IIR Group: PV211: Boolean Retrieval 64 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Week 3 – IIR 04: Index construction masterassign map phase reduce phase assign parser splits parser parser inverter postings inverter inverter a-f g-p q-z a-f g-p q-z a-f g-p q-z a-f segment files g-p q-z Sojka, IIR Group: PV211: Boolean Retrieval 65 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Week 4 – IIR 05: Index compression 0 1 2 3 4 5 6 01234567 log10 rank 7 log10cf Zipf’s law Sojka, IIR Group: PV211: Boolean Retrieval 66 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Week 4 – IIR 06: Scoring, term weighting and the vector space model Ranking search results Boolean queries only give inclusion or exclusion of documents. For ranked retrieval, we measure the proximity between the query and each document. One formalism for doing this: the vector space model Key challenge in ranked retrieval: evidence accumulation for a term in a document 1 vs. 0 occurrence of a query term in the document 3 vs. 2 occurrences of a query term in the document Usually: more is better But by how much? Need a scoring function that translates frequency into score or weight Sojka, IIR Group: PV211: Boolean Retrieval 67 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Week 5 – IIR 07: Scoring in a complete search system Documents Document cache Indexes k-gram Scoring parameters MLR training set Results page Indexers Parsing Linguistics user query Free text query parser Spell correction Scoring and ranking Tiered inverted positional index Inexact top K retrieval Metadata in zone and field indexes Sojka, IIR Group: PV211: Boolean Retrieval 68 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Week 5 – IIR 08: Evaluation and dynamic summaries Sojka, IIR Group: PV211: Boolean Retrieval 69 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Week 6 – Anatomy of the web-scale IR system Challenges in Building Large-Scale Information Retrieval Systems by Jeff Dean, Google Senior Fellow, jeff@google.com Sojka, IIR Group: PV211: Boolean Retrieval 70 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Week 7 – IIR 18: Latent Semantic Indexing Sojka, IIR Group: PV211: Boolean Retrieval 71 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Week 7 – CS276 14: Distributed Word Representations for Information retrieval Sojka, IIR Group: PV211: Boolean Retrieval 72 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Week 8 – IIR 09: Relevance feedback & query expansion Sojka, IIR Group: PV211: Boolean Retrieval 73 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours IIR 12: Language models q1 w P(w|q1) w P(w|q1) STOP 0.2 toad 0.01 the 0.2 said 0.03 a 0.1 likes 0.02 frog 0.01 that 0.04 . . . . . . This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said that toad likes frog STOP P(string) = 0.01 ·0.03 ·0.04 ·0.01 ·0.02 ·0.01 ·0.2 = 0.0000000000048 Sojka, IIR Group: PV211: Boolean Retrieval 74 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Week 8 – IIR 13: Text classification & Naive Bayes Text classification = assigning documents automatically to predefined classes Examples: Language (English vs. French) Adult content Region Sojka, IIR Group: PV211: Boolean Retrieval 75 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours NO week this term IIR 11: Probabilistic information retrieval document relevant (R = 1) nonrelevant (R = 0) Term present xt = 1 pt ut Term absent xt = 0 1 − pt 1 − ut O(R|q, x) = O(R|q) · t:xt=qt =1 pt ut · t:xt=0,qt =1 1 − pt 1 − ut (1) Sojka, IIR Group: PV211: Boolean Retrieval 76 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Week 9 – IIR 14: Vector classification, kNN search X X XX X X X X X X X ∗ Sojka, IIR Group: PV211: Boolean Retrieval 77 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Week 10 – IIR 15: Support vector machines, Learning to rank Sojka, IIR Group: PV211: Boolean Retrieval 78 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Week 11 – IIR 16: Flat clustering Sojka, IIR Group: PV211: Boolean Retrieval 79 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours NO week this term – IIR 17: Hierarchical clustering http://news.google.com Sojka, IIR Group: PV211: Boolean Retrieval 80 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Week 12 – IIR 19: The web and its challenges Unusual and diverse documents Unusual and diverse users and information needs Beyond terms and text: exploit link analysis, user data How do web search engines work? How can we make them better? Sojka, IIR Group: PV211: Boolean Retrieval 81 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Week 12 – IIR 20: Crawling www Fetch DNS Parse URL Frontier Content Seen? ✓ ✒ ✏ ✑ ✒✑ Doc FP’s ✓ ✒ ✏ ✑ ✒✑ URL set URL Filter Host splitter To other nodes From other nodes Dup URL Elim✲ ✛ ✲ ✻ ✛✲ ❄ ✻ ✲ ✲ ✲ ✲ ✛ ✻❄ ✻❄✻✻✻ ✲✲ ✲ Sojka, IIR Group: PV211: Boolean Retrieval 82 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Week 13 – IIR 21: Link analysis / PageRank Sojka, IIR Group: PV211: Boolean Retrieval 83 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Week 1 – Week 13: Related research seminars and courses Semianr FI:PV212 of LEMMA/MIR labs. MIR group’s solution for ARQMath 2022 @ CLEF2022 tasks: Math information Retrieval Question Answering and Formula searching Talks and brainstormings of TA’s and FI MU alumni’s talks (Řehůřek, Materna, Jurových,. . . )? Informatics colloquium related talk(s): Tomáš Mikolov 2019, or in 2017. MU on Coursera Sojka, IIR Group: PV211: Boolean Retrieval 84 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Take-away Basic information about the course, teachers, evaluation, exercises Boolean Retrieval: Design and data structures of a simple information retrieval system What topics will be covered in this class (overview)? Sojka, IIR Group: PV211: Boolean Retrieval 85 / 86 Introduction History Boolean model Inverted index Processing Boolean queries Query optimization Cours Resources Chapter 1 of IIR Resources at https://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and FI MU library course schedule and overview IIR textbook and other books (Baeta-Yates et al: Modern Information Retrieval, and other passed on during the lecture) Jupyter Hub/ Google Colab/ Deepnote environments with examples Shakespeare search engine https://www.rhymezone.com/shakespeare/ Sojka, IIR Group: PV211: Boolean Retrieval 86 / 86