Big picture Ads Duplicate detection Spam Web IR Size of the web
PV211: Introduction to Information Retrieval
https://www.fi.muni.cz/~sojka/PV211
IIR 19: Web search
Handout version
Petr Sojka, Hinrich Schütze et al.
Faculty of Informatics, Masaryk University, Brno
Center for Information and Language Processing, University of Munich
2024-05-08
(compiled on 2024-06-19 22:01:56)
Sojka, IIR Group: PV211: Web search 1 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Overview
1 Big picture
2 Ads
3 Duplicate detection
4 Spam
5 Web IR
Queries
Links
Context
Users
Documents
Size
6 Size of the web
Sojka, IIR Group: PV211: Web search 2 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Web search overview
Sojka, IIR Group: PV211: Web search 4 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Search is a top activity on the web
2008:
2024: https://fitsmallbusiness.com/google-search-statistics/
Sojka, IIR Group: PV211: Web search 5 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Without search engines, the web wouldn’t work
Without search, content is hard to ﬁnd.
→ Without search, there is no incentive to create content.
Why publish something if nobody will read it?
Why publish something if I don’t get ad revenue from it?
Somebody needs to pay for the web.
Servers, web infrastructure, content creation
A large part today is paid by search ads.
Search pays for the web.
Sojka, IIR Group: PV211: Web search 6 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Interest aggregation
Unique feature of the web: A small number of geographically
dispersed people with similar interests can ﬁnd each other.
Elementary school kids with hemophilia
People interested in translating R5 Scheme into relatively
portable C (open source project)
Search engines are a key enabler for interest aggregation.
Sojka, IIR Group: PV211: Web search 7 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
IR on the web vs. IR in general
On the web, search is not just a nice feature.
Search is a key enabler of the web: . . .
. . . ﬁnancing, content creation, interest aggregation, etc.
→ look at search ads
The web is a chaotic and uncoordinated collection. → lots of
duplicates – need to detect duplicates
No control / restrictions on who can author content → lots of
spam – need to detect spam
The web is very large. → need to know how big it is
Sojka, IIR Group: PV211: Web search 8 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Take-away today
Big picture
Ads – they pay for the web
Duplicate detection – addresses one aspect of chaotic content
creation
Spam detection – addresses one aspect of lack of central
access control
Probably won’t get to today
Web information retrieval
Size of the web
Sojka, IIR Group: PV211: Web search 9 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
First generation of search ads: Goto (1996)
Sojka, IIR Group: PV211: Web search 11 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
First generation of search ads: Goto (1996)
Buddy Blake bid the maximum ($0.38) for this search.
He paid $0.38 to Goto every time somebody clicked on the
link.
Pages were simply ranked according to bid – revenue
maximization for Goto.
No separation of ads/docs. Only one result list!
Upfront and honest. No relevance ranking, . . .
. . . but Goto did not pretend there was any.
Sojka, IIR Group: PV211: Web search 12 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Second generation of search ads: Google (2000/2001)
Strict separation of search results and search ads
Sojka, IIR Group: PV211: Web search 13 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Two ranked lists: web pages (left) and ads (right) (2001)
SogoTrade appears
in search
results.
SogoTrade appears
in ads.
Do search engines
rank advertisers
higher than non-
advertisers?
All major search
engines claim no.
Sojka, IIR Group: PV211: Web search 14 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Do ads inﬂuence editorial content?
Similar problem at newspapers / TV channels
A newspaper is reluctant to publish harsh criticism of its
major advertisers.
The line often gets blurred at newspapers / on TV.
No known case of this happening with search engines yet?
Sojka, IIR Group: PV211: Web search 15 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
How are the ads on the right ranked?
Sojka, IIR Group: PV211: Web search 16 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
How are ads ranked?
Advertisers bid for keywords – sale by auction.
Open system: Anybody can participate and bid on keywords.
Advertisers are only charged when somebody clicks on your ad.
How does the auction determine an ad’s rank and the price
paid for the ad?
Basis is a second price auction, but with twists
For the bottom line, this is perhaps the most important
research area for search engines – computational advertising.
Squeezing an additional fraction of a cent from each ad means
billions of additional revenue for the search engine.
Sojka, IIR Group: PV211: Web search 17 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Ranking ads
Selecting the ads to show for a query and ranking them is a
ranking problem . . .
. . . similar to the document ranking problem.
Key diﬀerence: The bid price of each ad is a factor in ranking
that we didn’t have in document ranking.
First cut: rank advertisers according to bid price
Sojka, IIR Group: PV211: Web search 18 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
How are ads ranked?
First cut: according to bid price à la Goto
Bad idea: open to abuse
Example: query [treatment for cancer?] → how to write your
last will
We don’t want to show nonrelevant or oﬀensive ads.
Instead: rank based on bid price and relevance
Key measure of ad relevance: clickthrough rate
clickthrough rate = CTR = clicks per impressions
Result: A nonrelevant ad will be ranked low.
Even if this decreases search engine revenue short-term
Hope: Overall acceptance of the system and overall revenue is
maximized if users get useful information.
Other ranking factors: location, time of day, quality and
loading speed of landing page
The main ranking factor: the query
Sojka, IIR Group: PV211: Web search 19 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Google’s second price auction
advertiser bid CTR ad rank rank paid
A $4.00 0.01 0.04 4 (minimum)
B $3.00 0.03 0.09 2 $2.68
C $2.00 0.06 0.12 1 $1.51
D $1.00 0.08 0.08 3 $0.51
bid: maximum bid for a click by advertiser
CTR: click-through rate: when an ad is displayed, what
percentage of time do users click on it? CTR is a measure of
relevance.
ad rank: bid × CTR: this trades oﬀ (i) how much money the
advertiser is willing to pay against (ii) how relevant the ad is
rank: rank in auction
paid: second price auction price paid by advertiser
Sojka, IIR Group: PV211: Web search 20 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Google’s second price auction (cont.)
advertiser bid CTR ad rank rank paid
A $4.00 0.01 0.04 4 (minimum)
B $3.00 0.03 0.09 2 $2.68
C $2.00 0.06 0.12 1 $1.51
D $1.00 0.08 0.08 3 $0.51
Second price auction: The advertiser pays the minimum amount
necessary to maintain their position in the auction (plus 1 cent).
price1 × CTR1 = bid2 × CTR2 (this will result in rank1=rank2)
price1 = bid2 × CTR2 / CTR1
p1 = bid2 × CTR2/CTR1 = 3.00 × 0.03/0.06 = 1.50
p2 = bid3 × CTR3/CTR2 = 1.00 × 0.08/0.03 = 2.67
p3 = bid4 × CTR4/CTR3 = 4.00 × 0.01/0.08 = 0.50
Sojka, IIR Group: PV211: Web search 21 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Keywords with high bids
In 2008, on http://www.cwire.org:80/highest-paying-search-terms/
highest paying search terms were
$69.1 mesothelioma treatment options
$65.9 personal injury lawyer michigan
$62.6 student loans consolidation
$61.4 car accident attorney los angeles
$59.4 online car insurance quotes
$59.4 arizona dui lawyer
$46.4 asbestos cancer
$40.1 home equity line of credit
$39.8 life insurance quotes
$39.2 reﬁnancing
$38.7 equity line of credit
$38.0 lasik eye surgery new york city
$37.0 2nd mortgage
$35.9 free car insurance quote
Compare today’s ones on
https://www.wordstream.com/articles/most-expensive-keywords
Sojka, IIR Group: PV211: Web search 22 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Search ads: A win-win-win?
The search engine company gets revenue every time
somebody clicks on an ad.
The user only clicks on an ad if they are interested in the ad.
Search engines punish misleading and nonrelevant ads.
As a result, users are often satisﬁed with what they ﬁnd after
clicking on an ad.
The advertiser ﬁnds new customers in a cost-eﬀective way.
Sojka, IIR Group: PV211: Web search 23 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Exercise
Why is web search potentially more attractive for advertisers
than TV spots, newspaper ads or radio spots?
The advertiser pays for all this. How can the advertiser be
cheated?
Any way this could be bad for the user?
Any way this could be bad for the search engine?
Sojka, IIR Group: PV211: Web search 24 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Not a win-win-win: Keyword arbitrage
Buy a keyword on Google
Then redirect traﬃc to a third party that is paying much more
than you are paying Google.
E.g., redirect to a page full of ads
This rarely makes sense for the user.
Ad spammers keep inventing new tricks.
The search engines need time to catch up with them.
Sojka, IIR Group: PV211: Web search 25 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Not a win-win-win: Violation of trademarks
Example: geico
During part of 2005: The search term “geico” on Google was
bought by competitors.
Geico lost this case in the United States.
Louis Vuitton lost similar case in Europe in 2005: see archival
webpage www.google.com/tm_complaint_adwords.html
It’s potentially misleading to users to trigger an ad oﬀ of a
trademark if the user can’t buy the product on the site.
Sojka, IIR Group: PV211: Web search 26 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Duplicate detection
The web is full of duplicated content.
More so than many other collections
Exact duplicates
Easy to eliminate
E.g., use hash/ﬁngerprint
Near-duplicates
Abundant on the web
Diﬃcult to eliminate
For the user, it’s annoying to get a search result with
near-identical documents.
Marginal relevance is zero: even a highly relevant document
becomes non-relevant if it appears below a (near-)duplicate.
We need to eliminate near-duplicates.
Sojka, IIR Group: PV211: Web search 28 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Near-duplicates: Example
Sojka, IIR Group: PV211: Web search 29 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Exercise
How would you eliminate near-duplicates on the web?
Sojka, IIR Group: PV211: Web search 30 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Detecting near-duplicates
Compute similarity with an edit-distance measure
We want “syntactic” (as opposed to semantic) similarity.
True semantic similarity (similarity in content) is too diﬃcult
to compute.
We do not consider documents near-duplicates if they have
the same content, but express it with diﬀerent words.
Use similarity threshold θ to make the call “is/isn’t a
near-duplicate”.
E.g., two documents are near-duplicates if similarity
> θ = 80%.
Sojka, IIR Group: PV211: Web search 31 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Represent each document as set of shingles
A shingle is simply a word n-gram.
Shingles are used as features to measure syntactic similarity of
documents.
For example, for n = 3, “a rose is a rose is a rose” would be
represented as this set of shingles:
{ a-rose-is, rose-is-a, is-a-rose }
We can map shingles to 1..2m (e.g., m = 64) by ﬁngerprinting.
From now on: sk refers to the shingle’s ﬁngerprint in 1..2m.
We deﬁne the similarity of two documents as the Jaccard
coeﬃcient of their shingle sets.
Sojka, IIR Group: PV211: Web search 32 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Recall: Jaccard coeﬃcient
A commonly used measure of overlap of two sets
Let A and B be two sets
Jaccard coeﬃcient:
jaccard(A, B) =
|A ∩ B|
|A ∪ B|
(A = ∅ or B = ∅)
jaccard(A, A) = 1
jaccard(A, B) = 0 if A ∩ B = 0
A and B don’t have to be the same size.
Always assigns a number between 0 and 1.
Sojka, IIR Group: PV211: Web search 33 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Jaccard coeﬃcient: Example
Three documents:
d1: “Jack London traveled to Oakland”
d2: “Jack London traveled to the city of Oakland”
d3: “Jack traveled from Oakland to London”
Based on shingles of size 2 (2-grams or bigrams), what are the
Jaccard coeﬃcients J(d1, d2) and J(d1, d3)?
J(d1, d2) = 3/8 = 0.375
J(d1, d3) = 0
Note: very sensitive to dissimilarity
Sojka, IIR Group: PV211: Web search 34 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Represent each document as a sketch
The number of shingles per document is large.
To increase eﬃciency, we will use a sketch, a cleverly chosen
subset of the shingles of a document.
The size of a sketch is, say, n = 200 . . .
. . . and is deﬁned by a set of permutations π1 . . . π200.
Each πi is a random permutation on 1..2m
The sketch of d is deﬁned as:
< mins∈d π1(s), mins∈d π2(s), . . . , mins∈d π200(s) >
(a vector of 200 numbers).
Sojka, IIR Group: PV211: Web search 35 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Permutation and minimum: Example
document 1: {sk } document 2: {sk}
✲
✲
✲
✲
✲
✲
✲
✲
1
1
1
1
1
1
1
1
2m
2m
2m
2m
2m
2m
2m
2ms
s1
s
s1
s
s2
s
s5
s
s3
s
s3
s
s4
s
s4
xk = π(sk ) xk = π(sk)
s ss ss ss s
x3
❝
x3
❝
x1
❝
x1
❝
x4
❝
x4
❝
x2
❝
x5
❝
x3
❝
x3
❝
x1
❝
x1
❝
x4
❝
x5
❝
x2
❝
x2
❝
xk xk
x3
❝
x3
❝
minsk
π(sk) minsk
π(sk)
We use mins∈d1
π(s) = mins∈d2
π(s) as a test for: are d1 and d2
near-duplicates? In this case: permutation π says: d1 ≈ d2
Sojka, IIR Group: PV211: Web search 36 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Computing Jaccard for sketches
Sketches: Each document is now a vector of n = 200
numbers.
Much easier to deal with than the very high-dimensional space
of shingles
But how do we compute Jaccard?
Sojka, IIR Group: PV211: Web search 37 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Computing Jaccard for sketches (2)
How do we compute Jaccard?
Let U be the union of the set of shingles of d1 and d2 and I
the intersection.
There are |U|! permutations on U.
For s′ ∈ I, for how many permutations π do we have
arg mins∈d1
π(s) = s′ = arg mins∈d2
π(s)?
Answer: (|U| − 1)!
There is a set of (|U| − 1)! diﬀerent permutations for each s
in I. ⇒ |I|(|U| − 1)! permutations make
arg mins∈d1
π(s) = arg mins∈d2
π(s) true
Thus, the proportion of permutations that make
mins∈d1
π(s) = mins∈d2
π(s) true is:
|I|(|U| − 1)!
|U|!
=
|I|
|U|
= J(d1, d2)
Sojka, IIR Group: PV211: Web search 38 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Estimating Jaccard
Thus, the proportion of successful permutations is the Jaccard
coeﬃcient.
Permutation π is successful iﬀ mins∈d1 π(s) = mins∈d2 π(s)
Picking a permutation at random and outputting 1
(successful) or 0 (unsuccessful) is a Bernoulli trial.
Estimator of probability of success: proportion of successes in
n Bernoulli trials. (n = 200)
Our sketch is based on a random selection of permutations.
Thus, to compute Jaccard, count the number k of successful
permutations for < d1, d2 > and divide by n = 200.
k/n = k/200 estimates J(d1, d2).
Sojka, IIR Group: PV211: Web search 39 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Implementation
We use hash functions as an eﬃcient type of permutation:
hi : {1..2m} → {1..2m}
Scan all shingles sk in union of two sets in arbitrary order
For each hash function hi and documents d1, d2, . . .: keep slot
for minimum value found so far
If hi (sk ) is lower than minimum found so far: update slot
Sojka, IIR Group: PV211: Web search 40 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Example
d1 d2
s1 1 0
s2 0 1
s3 1 1
s4 1 0
s5 0 1
h(x) = x mod 5
g(x) = (2x + 1) mod 5
min(h(d1)) = 1 = 0 =
min(h(d2))
min(g(d1)) = 2 = 0 =
min(g(d2))
ˆJ(d1, d2) = 0+0
2 = 0
d1 slot d2 slot
h ∞ ∞
g ∞ ∞
h(1) = 1 1 1 – ∞
g(1) = 3 3 3 – ∞
h(2) = 2 – 1 2 2
g(2) = 0 – 3 0 0
h(3) = 3 3 1 3 2
g(3) = 2 2 2 2 0
h(4) = 4 4 1 – 2
g(4) = 4 4 2 – 0
h(5) = 0 – 1 0 0
g(5) = 1 – 2 1 0
ﬁnal sketches
Sojka, IIR Group: PV211: Web search 41 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Exercise
d1 d2 d3
s1 0 1 1
s2 1 0 1
s3 0 1 0
s4 1 0 0
h(x) = 5x + 5 mod 4
g(x) = (3x + 1) mod 4
Estimate ˆJ(d1, d2), ˆJ(d1, d3), ˆJ(d2, d3)
Sojka, IIR Group: PV211: Web search 42 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Solution (1)
d1 d2 d3
s1 0 1 1
s2 1 0 1
s3 0 1 0
s4 1 0 0
h(x) = 5x + 5 mod 4
g(x) = (3x + 1) mod 4
d1 slot d2 slot d3 slot
∞ ∞ ∞
∞ ∞ ∞
h(1) = 2 – ∞ 2 2 2 2
g(1) = 0 – ∞ 0 0 0 0
h(2) = 3 3 3 – 2 3 2
g(2) = 3 3 3 – 0 3 0
h(3) = 0 – 3 0 0 – 2
g(3) = 2 – 3 2 0 – 0
h(4) = 1 1 1 – 0 – 2
g(4) = 1 1 1 – 0 – 0
ﬁnal sketches
Sojka, IIR Group: PV211: Web search 43 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Solution (2)
ˆJ(d1, d2) =
0 + 0
2
= 0
ˆJ(d1, d3) =
0 + 0
2
= 0
ˆJ(d2, d3) =
0 + 1
2
= 1/2
Sojka, IIR Group: PV211: Web search 44 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Shingling: Summary
Input: N documents
Choose n-gram size for shingling, e.g., n = 5
Pick 200 random permutations, represented as hash functions
Compute N sketches: 200 × N matrix shown on previous
slide, one row per permutation, one column per document
Compute N·(N−1)
2 pairwise similarities
Transitive closure of documents with similarity > θ
Index only one document from each equivalence class
Sojka, IIR Group: PV211: Web search 45 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Eﬃcient near-duplicate detection
Now we have an extremely eﬃcient method for estimating a
Jaccard coeﬃcient for a single pair of two documents.
But we still have to estimate O(N2) coeﬃcients where N is
the number of web pages.
Still intractable
One solution: locality sensitive hashing (LSH)
Another solution: sorting (Henzinger 2006)
Sojka, IIR Group: PV211: Web search 46 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
The goal of spamming on the web
You have a page that will generate lots of revenue for you if
people visit it.
Therefore, you would like to direct visitors to this page.
One way of doing this: get your page ranked highly in search
results.
Exercise: How can I get my page ranked highly?
Sojka, IIR Group: PV211: Web search 48 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Spam technique: Keyword stuﬃng / Hidden text
Misleading meta-tags, excessive repetition
Hidden text with colors, style sheet tricks, etc.
Used to be very eﬀective, most search engines now catch these
Sojka, IIR Group: PV211: Web search 49 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Keyword stuﬃng
Sojka, IIR Group: PV211: Web search 50 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Spam technique: Doorway and lander pages
Doorway page: optimized for a single keyword, redirects to
the real target page
Lander page: optimized for a single keyword or a misspelled
domain name, designed to attract surfers who will then click
on ads
Sojka, IIR Group: PV211: Web search 51 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Lander page
Number one hit on Google for the search “composita”
The only purpose of this page: get people to click on the ads
and make money for the page owner
Sojka, IIR Group: PV211: Web search 52 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Spam technique: Duplication
Get good content from somewhere (steal it or produce it
yourself)
Publish a large number of slight variations of it
For example, publish the answer to a tax question with the
spelling variations of “tax deferred” on the previous slide
Sojka, IIR Group: PV211: Web search 53 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Spam technique: Cloaking
Serve fake content to search engine spider
So do we just penalize this always?
No: legitimate uses (e.g., diﬀerent content to US vs.
European users)
Sojka, IIR Group: PV211: Web search 54 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Spam technique: Link spam
Create lots of links pointing to the page you want to promote
Put these links on pages with high (or at least non-zero)
PageRank
Newly registered domains (domain ﬂooding)
A set of pages that all point to each other to boost each
other’s PageRank (mutual admiration society)
Pay somebody to put your link on their highly ranked page
(“schuetze horoskop” example)
Leave comments that include the link on blogs
Sojka, IIR Group: PV211: Web search 55 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
SEO: Search engine optimization
Promoting a page in the search rankings is not necessarily
spam.
It can also be a legitimate business – which is called SEO.
You can hire an SEO ﬁrm to get your page highly ranked.
There are many legitimate reasons for doing this.
For example, Google bombs like Who is a failure?
And there are many legitimate ways of achieving this:
Restructure your content in a way that makes it easy to index
Talk with inﬂuential bloggers and have them link to your site
Add more interesting and original content
Sojka, IIR Group: PV211: Web search 56 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
The war against spam
Quality indicators
Links, statistically analyzed (PageRank, etc.)
Usage (users visiting a page)
No adult content (e.g., no pictures with ﬂesh-tone)
Distribution and structure of text (e.g., no keyword stuﬃng)
Combine all of these indicators and use machine learning
Editorial intervention
Blacklists
Top queries audited
Complaints addressed
Suspect patterns detected
Sojka, IIR Group: PV211: Web search 57 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Webmaster guidelines
Major search engines have guidelines for webmasters.
These guidelines tell you what is legitimate SEO and what is
spamming.
Ignore these guidelines at your own risk
Once a search engine identiﬁes you as a spammer, all pages
on your site may get low ranks (or disappear from the index
entirely).
There is often a ﬁne line between spam and legitimate SEO.
Scientiﬁc study of ﬁghting spam on the web: adversarial
information retrieval
Sojka, IIR Group: PV211: Web search 58 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Web IR: Diﬀerences from traditional IR
Links: The web is a hyperlinked document collection.
Queries: Web queries are diﬀerent, more varied and there are
a lot of them. How many? ≈ 109
Users: Users are diﬀerent, more varied and there are a lot of
them. How many? ≈ 109
Documents: Documents are diﬀerent, more varied and there
are a lot of them. How many? ≈ 1011
Context: Context is more important on the web than in many
other IR applications.
Ads and spam
Sojka, IIR Group: PV211: Web search 60 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Query distribution (1)
Most frequent queries on a large search engine on 2002.10.26.
1 sex 16 crack 31 juegos 46 Caramail
2 (artifact) 17 games 32 nude 47 msn
3 (artifact) 18 pussy 33 music 48 jennifer lopez
4 porno 19 cracks 34 musica 49 tits
5 mp3 20 lolita 35 anal 50 free porn
6 Halloween 21 britney spears 36 free6 51 cheats
7 sexo 22 ebay 37 avril lavigne 52 yahoo.com
8 chat 23 sexe 38 hotmail.com 53 eminem
9 porn 24 Pamela Anderson 39 winzip 54 Christina Aguilera
10 yahoo 25 warez 40 fuck 55 incest
11 KaZaA 26 divx 41 wallpaper 56 letras de canciones
12 xxx 27 gay 42 hotmail.com 57 hardcore
13 Hentai 28 harry potter 43 postales 58 weather
14 lyrics 29 playboy 44 shakira 59 wallpapers
15 hotmail 30 lolitas 45 traductor 60 lingerie
More than 1/3 of these are queries for adult content. Exercise: Does this
mean that most people are looking for adult content?
Sojka, IIR Group: PV211: Web search 62 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Query distribution (2)
Queries have a power law distribution.
Recall Zipf’s law: a few very frequent words, a large number
of very rare words
Same here: a few very frequent queries, a large number of
very rare queries
Examples of rare queries: search for names, towns, books, etc.
The proportion of adult queries is much lower than 1/3
Sojka, IIR Group: PV211: Web search 63 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Types of queries / user needs in web search
Informational user needs: I need information on something.
“low hemoglobin”
We called this “information need” earlier in the class.
On the web, information needs proper are only a subclass of
user needs.
Other user needs: Navigational and transactional
Navigational user needs: I want to go to this web site.
“hotmail”, “myspace”, “United Airlines”
Transactional user needs: I want to make a transaction.
Buy something: “MacBook Air”
Download something: “Acrobat Reader”
Chat with someone: “live soccer chat”
Diﬃcult problem: How can the search engine tell what the
user need or intent for a particular query is?
Sojka, IIR Group: PV211: Web search 64 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Search in a hyperlinked collection
Web search in most cases is interleaved with navigation . . .
. . . i.e., with following links.
Diﬀerent from most other IR collections
Sojka, IIR Group: PV211: Web search 66 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Bowtie structure of the web
Strongly connected component (SCC) in the center
Lots of pages that get linked to, but don’t link (OUT)
Lots of pages that link to other pages, but don’t get linked to (IN)
Tendrils, tubes, islands
Sojka, IIR Group: PV211: Web search 68 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
User intent: Answering the need behind the query
What can we do to guess user intent?
Guess user intent independent of context:
Spell correction
Precomputed “typing” of queries (next slide)
Better: Guess user intent based on context:
Geographic context (slide after next)
Context of user in this session (e.g., previous query)
Context provided by personal proﬁle (Yahoo/MSN do this,
Google claims it does not)
Sojka, IIR Group: PV211: Web search 70 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Guessing of user intent by “typing” queries
Calculation: 5+4
Unit conversion: 1 kg in pounds
Currency conversion: 1 euro in kronor
Tracking number: 8167 2278 6764
Flight info: LH 454
Area code: 650
Map: columbus oh
Stock price: msft
Albums/movies, etc.: coldplay
Sojka, IIR Group: PV211: Web search 71 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
The spatial context: Geo-search
Three relevant locations
Server (nytimes.com → New York)
Web page (nytimes.com article about Albania)
User (located in Palo Alto)
Locating the user
IP address
Information provided by user (e.g., in user proﬁle)
Mobile phone
Geo-tagging: Parse text and identify the coordinates of the
geographic entities
Example: East Palo Alto CA → Latitude: 37.47 N, Longitude:
122.14 W
Important NLP problem
Sojka, IIR Group: PV211: Web search 72 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
How do we use context to modify query results?
Result restriction: Don’t consider inappropriate results
For user on google.fr . . .
. . . only show .fr results
Ranking modulation: use a rough generic ranking, rerank
based on personal context
Contextualization / personalization is an area of search with a
lot of potential for improvement.
Sojka, IIR Group: PV211: Web search 73 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Users of web search
Use short queries (average < 3)
Rarely use operators
Do not want to spend a lot of time on composing a query
Only look at the ﬁrst couple of results
Want a simple UI, not a search engine start page overloaded
with graphics
Extreme variability in terms of user needs, user expectations,
experience, knowledge, . . .
Industrial/developing world, English/Estonian, old/young,
rich/poor, diﬀerences in culture and class
One interface for hugely divergent needs
Sojka, IIR Group: PV211: Web search 75 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
How do users evaluate search engines?
Classic IR relevance (as measured by F) can also be used for
web IR.
Equally important: Trust, duplicate elimination, readability,
loads fast, no pop-ups
On the web, precision is more important than recall.
Precision at 1, precision at 10, precision on the ﬁrst 2–3 pages
But there is a subset of queries where recall matters.
Sojka, IIR Group: PV211: Web search 76 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Web information needs that require high recall
Has this idea been patented?
Searching for info on a prospective ﬁnancial advisor
Searching for info on a prospective employee
Searching for info on a date
Sojka, IIR Group: PV211: Web search 77 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Web documents: diﬀerent from other IR collections
Distributed content creation: no design, no coordination
“Democratization of publishing”
Result: extreme heterogeneity of documents on the web
Unstructured (text, html), semistructured (html, xml),
structured/relational (databases)
Dynamically generated content
Sojka, IIR Group: PV211: Web search 79 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Dynamic content
Dynamic pages are generated from scratch when the user
requests them – usually from underlying data in a database.
Example: current status of ﬂight LH 454
Sojka, IIR Group: PV211: Web search 80 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Dynamic content (2)
Most (truly) dynamic content is ignored by web spiders.
It’s too much to index it all.
Actually, a lot of “static” content is also assembled on the ﬂy
(asp, php, etc.: headers, date, ads, etc.)
Sojka, IIR Group: PV211: Web search 81 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Web pages change frequently (Fetterly 1997)
Sojka, IIR Group: PV211: Web search 82 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Multilinguality
Documents in a large number of languages
Queries in a large number of languages
First cut: Don’t return English results for a Japanese query
However: Frequent mismatches query/document languages
Many people can understand, but not query in a language.
Translation is important.
Google example: “Beaujolais Nouveau -wine”
Sojka, IIR Group: PV211: Web search 83 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Duplicate documents
Signiﬁcant duplication – 30%–40% duplicates in some studies.
Duplicates in the search results were common in the early
days of the web.
Today’s search engines eliminate duplicates very eﬀectively.
Key for high user satisfaction.
Sojka, IIR Group: PV211: Web search 84 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Trust
For many collections, it is easy to assess the trustworthiness of
a document.
A collection of Reuters newswire articles
A collection of TASS (Telegraph Agency of the Soviet Union)
newswire articles from the 1980s
Your Outlook email from the last three years
Web documents are diﬀerent: In many cases, we don’t know
how to evaluate the information.
Hoaxes abound.
Sojka, IIR Group: PV211: Web search 85 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Growth of the web
The web keeps growing.
But growth is no longer exponential?
Sojka, IIR Group: PV211: Web search 87 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Size of the web: Issues
What is size? Number of web servers? Number of pages?
Terabytes of data available?
Some servers are seldom connected.
Example: Your laptop running a web server
Is it part of the web?
The “dynamic” web is inﬁnite.
Any sum of two numbers is its own dynamic page on Google.
(Example: “2+4”)
Sojka, IIR Group: PV211: Web search 88 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
“Search engine index contains N pages”: Issues
Can I claim a page is in the index if I only index the ﬁrst
4,000 bytes?
Can I claim a page is in the index if I only index anchor text
pointing to the page?
There used to be (and still are?) billions of pages that are only
indexed by anchor text.
Sojka, IIR Group: PV211: Web search 89 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Simple method for determining a lower bound
OR-query of frequent words in a number of languages
According to this query: Size of web ≥ 21,450,000,000 on
2007.07.07 and ≥ 25,350,000,000 on 2008.07.03
But page counts of Google search results are only rough
estimates.
Sojka, IIR Group: PV211: Web search 90 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Size of the web: Who cares?
Media
Users
They may switch to the search engine that has the best
coverage of the web.
Users (sometimes) care about recall. If we underestimate the
size of the web, search engine results may have low recall.
Search engine designers (how many pages do I need to be able
to handle?)
Crawler designers (which policy will crawl close to N pages?)
Sojka, IIR Group: PV211: Web search 92 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
What is the size of the web? Any guesses?
Sojka, IIR Group: PV211: Web search 93 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Simple method for determining a lower bound
OR-query of frequent words in a number of languages
According to this query: Size of web ≥ 21,450,000,000 on
2007.07.07
Big if: Page counts of Google search results are correct.
(Generally, they are just rough estimates.)
But this is just a lower bound, based on one search engine.
How can we do better?
Sojka, IIR Group: PV211: Web search 94 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Size of the web: Issues
The “dynamic” web is inﬁnite.
Any sum of two numbers is its own dynamic page on Google.
(Example: “2+4”)
Many other dynamic sites generating inﬁnite number of pages
The static web contains duplicates – each “equivalence class”
should only be counted once.
Some servers are seldom connected.
Example: Your laptop
Is it part of the web?
Sojka, IIR Group: PV211: Web search 95 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
“Search engine index contains N pages”: Issues
Can I claim a page is in the index if I only index the ﬁrst
4,000 bytes?
Can I claim a page is in the index if I only index anchor text
pointing to the page?
There used to be (and still are?) billions of pages that are only
indexed by anchor text.
Sojka, IIR Group: PV211: Web search 96 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
How can we estimate the size of the web?
Sojka, IIR Group: PV211: Web search 97 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Sampling methods
Random queries
Random searches
Random IP addresses
Random walks
Sojka, IIR Group: PV211: Web search 98 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Variant: Estimate relative sizes of indexes
There are signiﬁcant diﬀerences between indexes of diﬀerent
search engines.
Diﬀerent engines have diﬀerent preferences.
max URL depth, max count/host, anti-spam rules, priority
rules etc.
Diﬀerent engines index diﬀerent things under the same URL.
anchor text, frames, meta-keywords, size of preﬁx etc.
Sojka, IIR Group: PV211: Web search 99 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Sampling URLs
Ideal strategy: Generate a random URL
Problem: Random URLs are hard to ﬁnd (and sampling
distribution should reﬂect “user interest”)
Approach 1: Random walks / IP addresses
In theory: might give us a true estimate of the size of the web
(as opposed to just relative sizes of indexes)
Approach 2: Generate a random URL contained in a given
engine
Suﬃces for accurate estimation of relative size
Sojka, IIR Group: PV211: Web search 101 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Random URLs from random queries
Idea: Use vocabulary of the web for query generation
Vocabulary can be generated from web crawl
Use conjunctive queries w1 AND w2
Example: vocalists AND rsi
Get result set of one hundred URLs from the source engine
Choose a random URL from the result set
This sampling method induces a weight W (p) for each
page p.
Method was used by Bharat and Broder (1998).
Sojka, IIR Group: PV211: Web search 102 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Checking if a page is in the index
Either: Search for URL if the engine supports this
Or: Create a query that will ﬁnd doc d with high probability
Download doc, extract words
Use 8 low frequency word as AND query
Call this a strong query for d
Run query
Check if d is in result set
Problems
Near duplicates
Redirects
Engine time-outs
Sojka, IIR Group: PV211: Web search 103 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Random searches
Choose random searches extracted from a search engine log
(Lawrence & Giles 97)
Use only queries with small result sets
For each random query: compute ratio size(r1)/size(r2) of the
two result sets
Average over random searches
Sojka, IIR Group: PV211: Web search 106 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Advantages & disadvantages
Advantage
Might be a better reﬂection of the human perception of
coverage
Issues
Samples are correlated with source of log (unfair advantage for
originating search engine)
Duplicates
Technical statistical problems (must have non-zero results,
ratio average not statistically sound)
Sojka, IIR Group: PV211: Web search 107 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Random IP addresses [ONei97,Lawr99]
[Lawr99] exhaustively crawled 2,500 servers and extrapolated
Estimated size of the web to be 800 million
Sojka, IIR Group: PV211: Web search 111 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Advantages and disadvantages
Advantages
Can, in theory, estimate the size of the accessible web (as
opposed to the (relative) size of an index)
Clean statistics
Independent of crawling strategies
Disadvantages
Many hosts share one IP (→ oversampling)
Hosts with large web sites don’t get more weight than hosts
with small web sites (→ possible undersampling)
Sensitive to spam (multiple IPs for same spam server)
Again, duplicates
Sojka, IIR Group: PV211: Web search 112 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Conclusion
Many diﬀerent approaches to web size estimation.
None is perfect.
The problem has gotten much harder.
There has not been a good study for a couple of years.
Great topic for a thesis!
Sojka, IIR Group: PV211: Web search 116 / 117
Big picture Ads Duplicate detection Spam Web IR Size of the web
Resources
Chapter 19 of IIR
Resources at https://www.fi.muni.cz/~sojka/PV211/
and http://cislmu.org, materials in MU IS and FI MU
library
Hal Varian explains Google second price auction:
https://youtube.com/watch?v=K7l0a2PVhPQ
Size of the web queries
Trademark issues (Geico and Vuitton cases)
How ads are priced
Henzinger, Finding near-duplicate web pages: A large-scale
evaluation of algorithms, ACM SIGIR 2006.
Sojka, IIR Group: PV211: Web search 117 / 117