Why rank? More on cosine The complete search system Implementation of ranking
PV211: Introduction to Information Retrieval
https://www.fi.muni.cz/~sojka/PV211
IIR 7: Scores in a complete search system
Handout version
Petr Sojka, Hinrich Schütze et al.
Faculty of Informatics, Masaryk University, Brno
Center for Information and Language Processing, University of Munich
2024-03-21
(compiled on 2024-02-20 00:50:46)
Sojka, IIR Group: PV211: Scores in a complete search system 1 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Overview
1 Why rank?
2 More on cosine
3 The complete search system
4 Implementation of ranking
Sojka, IIR Group: PV211: Scores in a complete search system 2 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Take-away today
The importance of ranking: User studies at Google
Length normalization: Pivot normalization
The complete search system
Implementation of ranking
Sojka, IIR Group: PV211: Scores in a complete search system 3 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Why is ranking so important?
Last lecture: Problems with unranked retrieval
Users want to look at a few results – not thousands.
It’s very hard to write queries that produce a few results.
Even for expert searchers
→ Ranking is important because it eﬀectively reduces a large
set of results to a very small one.
Next: More data on “users only look at a few results”
Actually, in the vast majority of cases they only examine 1, 2,
or 3 results.
Sojka, IIR Group: PV211: Scores in a complete search system 5 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Empirical investigation of the eﬀect of ranking
The following slides are from Dan Russell’s JCDL talk
Dan Russell was the “Über Tech Lead for Search Quality &
User Happiness” at Google.
How can we measure how important ranking is?
Observe what searchers do when they are searching in a
controlled setting
Videotape them
Ask them to “think aloud”
Interview them
Eye-track them
Time them
Record and count their clicks
Sojka, IIR Group: PV211: Scores in a complete search system 6 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Importance of ranking: Summary
Viewing abstracts: Users are a lot more likely to read the
abstracts of the top-ranked pages (1, 2, 3, 4) than the
abstracts of the lower ranked pages (7, 8, 9, 10).
Clicking: Distribution is even more skewed for clicking
In 1 out of 2 cases, users click on the top-ranked page.
Even if the top-ranked page is not relevant, 30% of users will
click on it.
→ Getting the ranking right is very important.
→ Getting the top-ranked page right is most important.
Sojka, IIR Group: PV211: Scores in a complete search system 13 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Exercise
Ranking is also one of the high barriers to entry for
competitors to established players in the search engine market.
Why?
Sojka, IIR Group: PV211: Scores in a complete search system 14 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Why distance is a bad idea
0 1
0
1
rich
poor
q:[rich poor]
d1:Ranks of starving poets swell
d2:Rich poor gap grows
d3:Record baseball salaries in 2010
The Euclidean distance of q and d2 is large although the
distribution of terms in the query q and the distribution of terms in
the document d2 are very similar.
That’s why we do length normalization or, equivalently, use cosine
to compute query-document matching scores.
Sojka, IIR Group: PV211: Scores in a complete search system 16 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Exercise: A problem for cosine normalization
Query q: “anti-doping rules Beijing 2008 olympics”
Compare three documents
d1: a short document on anti-doping rules at 2008 Olympics
d2: a long document that consists of a copy of d1 and 5 other
news stories, all on topics diﬀerent from Olympics/anti-doping
d3: a short document on anti-doping rules at the 2004 Athens
Olympics
What ranking do we expect in the vector space model?
What can we do about this?
Sojka, IIR Group: PV211: Scores in a complete search system 17 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Pivot normalization
Cosine normalization produces weights that are too large for
short documents and too small for long documents (on
average).
Adjust cosine normalization by linear adjustment: “turning”
the average normalization on the pivot
Eﬀect: Similarities of short documents with query decrease;
similarities of long documents with query increase.
This removes the unfair advantage that short documents have.
Sojka, IIR Group: PV211: Scores in a complete search system 18 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Predicted and true probability of relevance
source:
Lillian Lee
Sojka, IIR Group: PV211: Scores in a complete search system 19 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Pivot normalization
source:
Lillian Lee
Sojka, IIR Group: PV211: Scores in a complete search system 20 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Pivoted normalization: Amit Singhal’s experiments
(relevant documents retrieved and (change in) average precision)
Sojka, IIR Group: PV211: Scores in a complete search system 21 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Complete search system
Sojka, IIR Group: PV211: Scores in a complete search system 23 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Tiered indexes
Basic idea:
Create several tiers of indexes, corresponding to importance of
indexing terms
During query processing, start with highest-tier index
If highest-tier index returns at least k (e.g., k = 100) results:
stop and return results to user
If we’ve only found < k hits: repeat for next index in tier
cascade
Example: two-tier system
Tier 1: Index of all titles
Tier 2: Index of the rest of documents
Pages containing the search words in the title are better hits
than pages containing the search words in the body of the text.
Sojka, IIR Group: PV211: Scores in a complete search system 24 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Tiered index
Tier 1
Tier 2
Tier 3
auto
best
car
insurance
auto
auto
best
car
car
insurance
insurance
best
Doc2
Doc1
Doc2
Doc1
Doc3
Doc3
Doc3
Doc1
Doc2
Sojka, IIR Group: PV211: Scores in a complete search system 25 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Tiered indexes
The use of tiered indexes is believed to be one of the reasons
that Google search quality was signiﬁcantly higher initially
(2000/01) than that of competitors.
(along with PageRank, use of anchor text and proximity
constraints)
Sojka, IIR Group: PV211: Scores in a complete search system 26 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Complete search system
Sojka, IIR Group: PV211: Scores in a complete search system 27 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Components we have introduced thus far
Document preprocessing (linguistic and otherwise)
Positional indexes
Tiered indexes
Spelling correction
k-gram indexes for wildcard queries and spelling correction
Query processing
Document scoring
Sojka, IIR Group: PV211: Scores in a complete search system 28 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Components we haven’t covered yet
Document cache: we need this for generating snippets (=
dynamic summaries)
Zone indexes: They separate the indexes for diﬀerent zones:
the body of the document, all highlighted text in the
document, anchor text, text in metadata ﬁelds,. . .
Machine-learned ranking functions
Proximity ranking (e.g., rank documents in which the query
terms occur in the same local window higher than documents
in which the query terms occur far from each other)
Query parser
Sojka, IIR Group: PV211: Scores in a complete search system 29 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Components we haven’t covered yet: Query parser
IR systems often guess what the user intended.
The two-term query London tower (without quotes) may be
interpreted as the phrase query “London tower”.
The query 100 Madison Avenue, New York may be interpreted
as a request for a map.
How do we “parse” the query and translate it into a formal
speciﬁcation containing phrase operators, proximity operators,
indexes to search, etc.?
Sojka, IIR Group: PV211: Scores in a complete search system 30 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Vector space retrieval: Interactions
How do we combine phrase retrieval with vector space
retrieval?
We do not want to compute document frequency / idf for
every possible phrase. Why?
How do we combine Boolean retrieval with vector space
retrieval?
For example: “+”-constraints and “−”-constraints
Postﬁltering is simple, but can be very ineﬃcient – no easy
answer.
How do we combine wild cards with vector space retrieval?
Again, no easy answer.
Sojka, IIR Group: PV211: Scores in a complete search system 31 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Exercise
Design criteria for tiered system
Each tier should be an order of magnitude smaller than the
next tier.
The top 100 hits for most queries should be in tier 1, the top
100 hits for most of the remaining queries in tier 2, etc.
We need a simple test for “can I stop at this tier or do I have
to go to the next one?”
There is no advantage to tiering if we have to hit most tiers
for most queries anyway.
Consider a two-tier system where the ﬁrst tier indexes titles
and the second tier everything.
Question: Can you think of a better way of setting up a
multitier system? Which “zones” of a document should be
indexed in the diﬀerent tiers (title, body of document,
others?)? What criterion do you want to use for including a
document in tier 1?
Sojka, IIR Group: PV211: Scores in a complete search system 32 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Now we also need term frequencies in the index
Brutus −→ 1,2 7,3 83,1 87,2 . . .
Caesar −→ 1,1 5,1 13,1 17,1 . . .
Calpurnia −→ 7,1 8,2 40,1 97,3
term frequencies
We also need positions. Not shown here.
Sojka, IIR Group: PV211: Scores in a complete search system 34 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Term frequencies in the inverted index
Thus: In each posting, store tft,d in addition to docID d.
As an integer frequency, not as a (log-)weighted real number
. . .
. . . because real numbers are diﬃcult to compress.
Overall, additional space requirements are small: a byte per
posting or less
Sojka, IIR Group: PV211: Scores in a complete search system 35 / 54
Why rank? More on cosine The complete search system Implementation of ranking
How do we compute the top k in ranking?
We usually do not need a complete ranking.
We just need the top k for a small k (e.g., k = 100).
If we don’t need a complete ranking, is there an eﬃcient way
of computing just the top k?
Naïve:
Compute scores for all N documents
Sort
Return the top k
Not very eﬃcient
Alternative: min heap
Sojka, IIR Group: PV211: Scores in a complete search system 36 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Use min heap for selecting top k ouf of N
A binary min heap is a binary tree in which each node’s value
is less than the values of its children.
Takes O(N log k) operations to construct (where N is the
number of documents) . . .
. . . then read oﬀ k winners in O(k log k) steps
Sojka, IIR Group: PV211: Scores in a complete search system 37 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Binary min heap
0.6
0.85 0.7
0.9 0.97 0.8 0.95
Sojka, IIR Group: PV211: Scores in a complete search system 38 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Selecting top k scoring documents in O(N log k)
Goal: Keep the top k documents seen so far
Use a binary min heap
To process a new document d′ with score s′:
Get current minimum hm of heap (O(1))
If s′
≤ hm skip to next document
If s′
> hm heap-delete-root (O(log k))
Heap-add d′
/s′
(O(log k))
Sojka, IIR Group: PV211: Scores in a complete search system 39 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Even more eﬃcient computation of top k?
Ranking has time complexity O(N) where N is the number of
documents.
Optimizations reduce the constant factor, but they are still
O(N), N > 1010
Are there sublinear algorithms?
What we’re doing in eﬀect: solving the k-nearest neighbor
(kNN) problem for the query vector (= query point).
There are no general solutions to this problem that are
sublinear.
Sojka, IIR Group: PV211: Scores in a complete search system 40 / 54
Why rank? More on cosine The complete search system Implementation of ranking
More eﬃcient computation of top k: Heuristics
Idea 1: Reorder postings lists
Instead of ordering according to docID . . .
. . . order according to some measure of “expected relevance”.
Idea 2: Heuristics to prune the search space
Not guaranteed to be correct . . .
. . . but fails rarely.
In practice, close to constant time.
For this, we’ll need the concepts of document-at-a-time
processing and term-at-a-time processing.
Sojka, IIR Group: PV211: Scores in a complete search system 41 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Non-docID ordering of postings lists
So far: postings lists have been ordered according to docID.
Alternative: a query-independent measure of “goodness”
(credibility) of a page
Example: PageRank g(d) of page d, a measure of how many
“good” pages hyperlink to d (chapter 21)
Order documents in postings lists according to PageRank:
g(d1) > g(d2) > g(d3) > . . .
Deﬁne composite score of a document:
net-score(q, d) = g(d) + cos(q, d)
This scheme supports early termination: We do not have to
process postings lists in their entirety to ﬁnd top k.
Sojka, IIR Group: PV211: Scores in a complete search system 42 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Non-docID ordering of postings lists (2)
Order documents in postings lists according to PageRank:
g(d1) > g(d2) > g(d3) > . . .
Deﬁne composite score of a document:
net-score(q, d) = g(d) + cos(q, d)
Suppose: (i) g → [0, 1]; (ii) g(d) < 0.1 for the document d
we’re currently processing; (iii) smallest top k score we’ve
found so far is 1.2
Then all subsequent scores will be < 1.1.
So we’ve already found the top k and can stop processing the
remainder of postings lists.
Questions?
Sojka, IIR Group: PV211: Scores in a complete search system 43 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Document-at-a-time processing
Both docID-ordering and PageRank-ordering impose a
consistent ordering on documents in postings lists.
Computing cosines in this scheme is document-at-a-time.
We complete computation of the query-document similarity
score of document di before starting to compute the
query-document similarity score of di+1.
Alternative: term-at-a-time processing
Sojka, IIR Group: PV211: Scores in a complete search system 44 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Weight-sorted postings lists
Idea: don’t process postings that contribute little to ﬁnal score
Order documents in postings list according to weight
Simplest case: normalized tf-idf weight (rarely done: hard to
compress)
Documents in the top k are likely to occur early in these
ordered lists.
→ Early termination while processing postings lists is unlikely
to change the top k.
But:
We no longer have a consistent ordering of documents in
postings lists.
We no longer can employ document-at-a-time processing.
Sojka, IIR Group: PV211: Scores in a complete search system 45 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Term-at-a-time processing
Simplest case: completely process the postings list of the ﬁrst
query term
Create an accumulator for each docID you encounter
Then completely process the postings list of the second query
term
. . . and so forth
Sojka, IIR Group: PV211: Scores in a complete search system 46 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Term-at-a-time processing
CosineScore(q)
1 ﬂoat Scores[N] = 0
2 ﬂoat Length[N]
3 for each query term t
4 do calculate wt,q and fetch postings list for t
5 for each pair(d, tft,d ) in postings list
6 do Scores[d]+ = wt,d × wt,q
7 Read the array Length
8 for each d
9 do Scores[d] = Scores[d]/Length[d]
10 return Top k components of Scores[]
The elements of the array “Scores” are called accumulators.
Sojka, IIR Group: PV211: Scores in a complete search system 47 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Computing cosine scores
Use inverted index
At query time use an array of accumulators A to store sum (=
the cosine score)
Aj =
k
wqk · wdj k
(for document dj)
“Accumulate” scores as postings lists are being processed.
Sojka, IIR Group: PV211: Scores in a complete search system 48 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Accumulators
For the web (20 billion documents), an array of
accumulators A in memory is infeasible.
Thus: Only create accumulators for docs occurring in postings
lists
This is equivalent to: Do not create accumulators for docs
with zero scores (i.e., docs that do not contain any of the
query terms)
Sojka, IIR Group: PV211: Scores in a complete search system 49 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Accumulators: Example
Brutus −→ 1,2 7,3 83,1 87,2 . . .
Caesar −→ 1,1 5,1 13,1 17,1 . . .
Calpurnia −→ 7,1 8,2 40,1 97,3
For query: [Brutus Caesar]:
Only need accumulators for 1, 5, 7, 13, 17, 83, 87
Don’t need accumulators for 3, 8, etc.
Sojka, IIR Group: PV211: Scores in a complete search system 50 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Enforcing conjunctive search
We can enforce conjunctive search (à la Google): only
consider documents (and create accumulators) if all terms
occur.
Example: just one accumulator for [Brutus Caesar] in the
example above . . .
. . . because only d1 contains both words.
Sojka, IIR Group: PV211: Scores in a complete search system 51 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Implementation of ranking: Summary
Ranking is very expensive in applications where we have to
compute similarity scores for all documents in the collection.
In most applications, the vast majority of documents have
similarity score 0 for a given query → lots of potential for
speeding things up.
However, there is no fast nearest neighbor algorithm that is
guaranteed to be correct even in this scenario.
In practice: use heuristics to prune search space – usually
works very well.
Sojka, IIR Group: PV211: Scores in a complete search system 52 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Take-away today
The importance of ranking: User studies at Google
Length normalization: Pivot normalization
The complete search system
Implementation of ranking
Sojka, IIR Group: PV211: Scores in a complete search system 53 / 54
Why rank? More on cosine The complete search system Implementation of ranking
Resources
Chapter 7 of IIR
Resources at https://www.fi.muni.cz/~sojka/PV211/
and http://cislmu.org, materials in MU IS and FI MU
library
How Google tweaks its ranking function?
Interview with Google search guru Udi Manber
Amit Singhal on Google ranking
SEO perspective: ranking factors
Yahoo Search BOSS: Opens up the search engine to
developers. For example, you can rerank search results.
How Google uses eye tracking for improving search.
Sojka, IIR Group: PV211: Scores in a complete search system 54 / 54