Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model Introduction to Information Retrieval This lecture; IIR Sections 6.2-6.4.3 § Ranked retrieval § Scoring documents § Term frequency § Collection statistics § Weighting schemes § Vector space scoring Introduction to Information Retrieval Ranked retrieval § Thus far, our queries have all been Boolean. § Documents either match or don’t. § Good for expert users with precise understanding of their needs and the collection. § Also good for applications: Applications can easily consume 1000s of results. § Not good for the majority of users. § Most users incapable of writing Boolean queries (or they are, but they think it’s too much work). § Most users don’t want to wade through 1000s of results. § This is particularly true of web search. Ch. 6 Introduction to Information Retrieval Problem with Boolean search: feast or famine § Boolean queries often result in either too few (=0) or too many (1000s) results. § Query 1: “standard user dlink 650” → 200,000 hits § Query 2: “standard user dlink 650 no card found”: 0 hits § It takes a lot of skill to come up with a query that produces a manageable number of hits. § AND gives too few; OR gives too many Ch. 6 Introduction to Information Retrieval Ranked retrieval models § Rather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query § Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language § In principle, there are two separate choices here, but in practice, ranked retrieval has normally been associated with free text queries and vice versa 5 Introduction to Information Retrieval Feast or famine: not a problem in ranked retrieval § When a system produces a ranked result set, large result sets are not an issue § Indeed, the size of the result set is not an issue § We just show the top k ( ≈ 10) results § We don’t overwhelm the user § Premise: the ranking algorithm works Ch. 6 Introduction to Information Retrieval Scoring as the basis of ranked retrieval § We wish to return in order the documents most likely to be useful to the searcher § How can we rank-order the documents in the collection with respect to a query? § Assign a score – say in [0, 1] – to each document § This score measures how well document and query “match”. Ch. 6 Introduction to Information Retrieval Take 1: Jaccard coefficient § A common measure of overlap of two sets A and B § jaccard(A,B) = |A ∩ B| / |A ∪ B| § jaccard(A,A) = 1 § jaccard(A,B) = 0 if A ∩ B = 0 § A and B don’t have to be the same size. § Always assigns a number between 0 and 1. Ch. 6 Introduction to Information Retrieval Jaccard coefficient: Scoring example § What is the query-document match score that the Jaccard coefficient computes for each of the two documents below? § Query: ides of march § Document 1: caesar died in march § Document 2: the long march Ch. 6 Introduction to Information Retrieval Issues with Jaccard for scoring § It doesn’t consider term frequency (how many times a term occurs in a document) § Rare terms in a collection are more informative than frequent terms. Jaccard doesn’t consider this information § We need a more sophisticated way of normalizing for length Ch. 6 Introduction to Information Retrieval Query-document matching scores § We need a way of assigning a score to a query/document pair § Let’s start with a one-term query § If the query term does not occur in the document: score should be 0 § The more frequent the query term in the document, the higher the score (should be) § We will look at a number of alternatives for this. Ch. 6 Introduction to Information Retrieval Recall (Lecture 2): Binary termdocument incidence matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 Each document is represented by a binary vector ∈ {0,1}|V| Sec. 6.2 Introduction to Information Retrieval Term-document count matrices § Consider the number of occurrences of a term in a document: § Each document is a count vector in ℕv: a column below Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 157 73 0 0 0 0 Brutus 4 157 0 1 0 0 Caesar 232 227 0 2 1 1 Calpurnia 0 10 0 0 0 0 Cleopatra 57 0 0 0 0 0 mercy 2 0 3 5 5 1 worser 2 0 1 1 1 0 Sec. 6.2 Introduction to Information Retrieval Bag of words model § Vector representation doesn’t consider the ordering of words in a document § John is quicker than Mary and Mary is quicker than John have the same vectors § This is called the bag of words model. § In a sense, this is a step back: The positional index was able to distinguish these two documents. Introduction to Information Retrieval Term frequency tf § The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. § Note: Frequency means count in IR § We want to use tf when computing query-document match scores. But how? § Raw term frequency is not what we want: § A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term. § But not 10 times more relevant. § Relevance does not increase proportionally with term frequency. Introduction to Information Retrieval Log-frequency weighting § The log frequency weight of term t in d is § 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. § Score for a document-query pair: sum over terms t in both q and d: § score § The score is 0 if none of the query terms is present in the document. î í ì >+ = otherwise0, 0tfif,tflog1 10 t,dt,d t,dw å ÇÎ += dqt dt )tflog(1 , Sec. 6.2 Introduction to Information Retrieval Rare terms are more informative § Rare terms are more informative than frequent terms § Recall stop words § Consider a term in the query that is rare in the collection (e.g., arachnocentric) § A document containing this term is very likely to be relevant to the query arachnocentric § → We want a high weight for rare terms like arachnocentric. Sec. 6.2.1 Introduction to Information Retrieval Collection vs. Document frequency § Collection frequency of t is the number of occurrences of t in the collection § Document frequency of t is the number of documents in which t occurs § Example: § Which word is for better search (gets higher weight) Word Collection frequency Document frequency insurance 10440 3997 try 10422 8760 Sec. 6.2.1 Introduction to Information Retrieval idf weight § dft is the document frequency of t: the number of documents that contain t § dft is an inverse measure of the informativeness of t § dft £ N § We define the idf (inverse document frequency) of t by § We use log (N/dft) instead of N/dft to “dampen” the effect of idf. )/df(logidf 10 tt N= Sec. 6.2.1 Introduction to Information Retrieval idf example, suppose N = 1 million term dft idft calpurnia 1 6 animal 100 4 sunday 1,000 3 fly 10,000 2 under 100,000 1 the 1,000,000 0 There is one idf value for each term t in a collection. Sec. 6.2.1 )/df(logidf 10 tt N= Introduction to Information Retrieval Effect of idf on ranking § Does idf have an effect on ranking for one-term queries, like § iPhone § idf has no effect on ranking one term queries § idf affects the ranking of documents for queries with at least two terms § For the query capricious person, idf weighting makes occurrences of capricious count for much more in the final document ranking than occurrences of person. 21 Introduction to Information Retrieval tf-idf weighting § The tf-idf weight of a term is the product of its tf weight and its idf weight. § Best known weighting scheme in information retrieval § Note: the “-” in tf-idf is a hyphen, not a minus sign! § Alternative names: tf.idf, tf x idf § Increases with the number of occurrences within a document § Increases with the rarity of the term in the collection )df/(log)tf1log(w 10,, tdt Ndt ´+= Sec. 6.2.2 Introduction to Information Retrieval Score for a document given a query § There are many variants § How “tf” is computed (with/without logs) § Whether the terms in the query are also weighted § … 23 Score(q,d) = tf.idft,dtÎqÇd å Sec. 6.2.2 Introduction to Information Retrieval Binary → count → weight matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 5.25 3.18 0 0 0 0.35 Brutus 1.21 6.1 0 1 0 0 Caesar 8.59 2.54 0 1.51 0.25 0 Calpurnia 0 1.54 0 0 0 0 Cleopatra 2.85 0 0 0 0 0 mercy 1.51 0 1.9 0.12 5.25 0.88 worser 1.37 0 0.11 4.15 0.25 1.95 Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V| Sec. 6.3 Introduction to Information Retrieval Documents as vectors § So we have a |V|-dimensional vector space § Terms are axes of the space § Documents are points or vectors in this space § Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine § These are very sparse vectors - most entries are zero. Sec. 6.3 Introduction to Information Retrieval Queries as vectors § Key idea 1: Do the same for queries: represent them as vectors in the space § Key idea 2: Rank documents according to their proximity to the query in this space § proximity = similarity of vectors § proximity ≈ inverse of distance Sec. 6.3 Introduction to Information Retrieval Formalizing vector space proximity § First cut: distance between two points § ( = distance between the end points of the two vectors) § Euclidean distance? § Euclidean distance is a bad idea . . . § . . . because Euclidean distance is large for vectors of different lengths. Sec. 6.3 Introduction to Information Retrieval Why distance is a bad idea The Euclidean distance between q and d2 is large even though the distribution of terms in the query q and the distribution of terms in the document d2 are very similar. Sec. 6.3 Introduction to Information Retrieval Use angle instead of distance § Thought experiment: take a document d and append it to itself. Call this document dʹ. § “Semantically” d and dʹ have the same content § The Euclidean distance between the two documents can be quite large § The angle between the two documents is 0, corresponding to maximal similarity. § Key idea: Rank documents according to angle with query. Sec. 6.3 Introduction to Information Retrieval From angles to cosines § The following two notions are equivalent. § Rank documents in decreasing order of the angle between query and document § Rank documents in increasing order of cosine(query,document) § Cosine is a monotonically decreasing function for the interval [0o, 180o] Sec. 6.3 Introduction to Information Retrieval From angles to cosines § But how should we be computing cosines? Sec. 6.3 Introduction to Information Retrieval Length normalization § A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L2 norm: § Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of unit hypersphere) § Effect on the two documents d and dʹ (d appended to itself) from earlier slide: they have identical vectors after length-normalization. § Long and short documents now have comparable weights å= i ixx 2 2 ! Sec. 6.3 Introduction to Information Retrieval cosine(query,document) åå å == = =•= • = V i i V i i V i ii dq dq d d q q dq dq dq 1 2 1 2 1 ),cos( ! ! ! ! !! !!!! Dot product Unit vectors qi is the weight of term i in the query di is the weight of term i in the document cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d. Sec. 6.3 Introduction to Information Retrieval Cosine for length-normalized vectors § For length-normalized vectors, cosine similarity is simply the dot product (or scalar product): for q, d length-normalized. 34 !! cos( "! q, "! d) = "! q • "! d = qidii=1 V å Introduction to Information Retrieval Cosine similarity illustrated 35 Introduction to Information Retrieval Cosine similarity amongst 3 documents term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 0 6 wuthering 0 0 38 How similar are the novels SaS: Sense and Sensibility PaP: Pride and Prejudice, and WH: Wuthering Heights? Term frequencies (counts) Sec. 6.3 Note: To simplify this example, we don’t do idf weighting. Introduction to Information Retrieval 3 documents example contd. Log frequency weighting term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.00 1.85 2.04 gossip 1.30 0 1.78 wuthering 0 0 2.58 After length normalization term SaS PaP WH affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0 0.405 wuthering 0 0 0.588 cos(SaS,PaP) ≈ 0.94 cos(SaS,WH) ≈ 0.79 cos(PaP,WH) ≈ 0.69 Sec. 6.3 dot(SaS,PaP) ≈ 12.1 dot(SaS,WH) ≈ 13.4 dot(PaP,WH) ≈ 10.1 Introduction to Information Retrieval Computing cosine scores Sec. 6.3 Introduction to Information Retrieval Computing cosine scores § Previous algorithm scores term-at-a-time (TAAT) § Algorithm can be adapted to scoring document-at-atime (DAAT) § Storing wt,d in each posting could be expensive § …because we’d have to store a floating point number § For tf-idf scoring, it suffices to store tft,d in the posting and idft in the head of the postings list § Extracting the top K items can be done with a priority queue (e.g., a heap) Sec. 6.4 Introduction to Information Retrieval tf-idf weighting has many variants Sec. 6.4 Introduction to Information Retrieval Weighting may differ in queries vs documents § Many search engines allow for different weightings for queries vs. documents § SMART Notation: denotes the combination in use in an engine, with the notation ddd.qqq, using the acronyms from the previous table § A very standard weighting scheme is: lnc.ltc § Document: logarithmic tf (l as first character), no idf and cosine normalization § Query: logarithmic tf (l in leftmost column), idf (t in second column), cosine normalization … Sec. 6.4 Introduction to Information Retrieval tf-idf example: lnc.ltc Term Query Document Pro d tf- raw tf-wt df idf wt n’liz e tf-raw tf-wt wt n’liz e auto 0 0 5000 2.3 0 0 1 1 1 0.52 0 best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0 car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27 insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53 Document: car insurance auto insurance Query: best car insurance Score = 0+0+0.27+0.53 = 0.8 Doc length = 12 + 02 +12 +1.32 »1.92 Sec. 6.4 Introduction to Information Retrieval Summary – vector space ranking § Represent the query as a weighted tf-idf vector § Represent each document as a weighted tf-idf vector § Compute the cosine similarity score for the query vector and each document vector § Rank documents with respect to the query by score § Return the top K (e.g., K = 10) to the user Introduction to Information Retrieval Resources for today’s lecture § IIR 6.2 – 6.4.3 § http://www.miislita.com/information-retrieval- tutorial/cosine-similarity-tutorial.html § Term weighting and cosine similarity tutorial for SEO folk! Ch. 6