A simple crawler A real crawler
PV211: Introduction to Information Retrieval
https://www.fi.muni.cz/~sojka/PV211
IIR 20: Crawling
Handout version
Petr Sojka, Hinrich Schütze et al.
Faculty of Informatics, Masaryk University, Brno
Center for Information and Language Processing, University of Munich
2024-05-15
(compiled on 2024-05-06 13:41:14)
Sojka, IIR Group: PV211: Crawling 1 / 27
A simple crawler A real crawler
How hard can crawling be?
Web search engines must crawl their documents.
Getting the content of the documents is easier for many other
IR systems.
E.g., indexing all ﬁles on your hard disk: just do a recursive
descent on your ﬁle system
Ok: for web IR, getting the content of the documents takes
longer . . .
. . . because of latency.
But is that really a design/systems challenge?
Sojka, IIR Group: PV211: Crawling 3 / 27
A simple crawler A real crawler
Basic crawler operation
Initialize queue with URLs of known seed pages
Repeat
Take URL from queue
Fetch and parse page
Extract URLs from page
Add URLs to queue
Fundamental assumption: The web is well linked.
Sojka, IIR Group: PV211: Crawling 4 / 27
A simple crawler A real crawler
Exercise: What’s wrong with this crawler?
urlqueue := (some carefully selected set of seed urls)
while urlqueue is not empty:
myurl := urlqueue.getlastanddelete()
mypage := myurl.fetch()
fetchedurls.add(myurl)
newurls := mypage.extracturls()
for myurl in newurls:
if myurl not in fetchedurls and not in urlqueue:
urlqueue.add(myurl)
addtoinvertedindex(mypage)
Sojka, IIR Group: PV211: Crawling 5 / 27
A simple crawler A real crawler
What’s wrong with the simple crawler
Scale: we need to distribute.
We can’t index everything: we need to subselect. How?
Duplicates: need to integrate duplicate detection
Spam and spider traps: need to integrate spam detection
Politeness: we need to be “nice” and space out all requests
for a site over a longer period (hours, days)
Freshness: we need to recrawl periodically.
Because of the size of the web, we can do frequent recrawls
only for a small subset.
Again, subselection problem or prioritization
Sojka, IIR Group: PV211: Crawling 6 / 27
A simple crawler A real crawler
Magnitude of the crawling problem
To fetch 20,000,000,000 pages in one month . . .
. . . we need to fetch almost 8,000 pages per second!
Actually: many more since many of the pages we attempt to
crawl will be duplicates, unfetchable, spam, etc.
Sojka, IIR Group: PV211: Crawling 7 / 27
A simple crawler A real crawler
What a crawler must do
Be polite
Don’t hit a site too often
Only crawl pages you are allowed to crawl: robots.txt
Be robust
Be immune to spider traps, duplicates, very large pages, very
large websites, dynamic pages, etc.
Sojka, IIR Group: PV211: Crawling 8 / 27
A simple crawler A real crawler
robots.txt
Protocol for giving crawlers (“robots”) limited access to a
website, originally from 1994
Examples:
User-agent: *
Disallow: /yoursite/temp/
User-agent: searchengine
Disallow: /
Important: cache the robots.txt ﬁle of each site we are
crawling
Sojka, IIR Group: PV211: Crawling 9 / 27
A simple crawler A real crawler
Example of a robots.txt (nih.gov)
User-agent: PicoSearch/1.0
Disallow: /news/information/knight/
Disallow: /nidcd/
...
Disallow: /news/research_matters/secure/
Disallow: /od/ocpl/wag/
User-agent: *
Disallow: /news/information/knight/
Disallow: /nidcd/
...
Disallow: /news/research_matters/secure/
Disallow: /od/ocpl/wag/
Disallow: /ddir/
Disallow: /sdminutes/
Sojka, IIR Group: PV211: Crawling 10 / 27
A simple crawler A real crawler
What any crawler should do
Be capable of distributed operation
Be scalable: need to be able to increase crawl rate by adding
more machines
Fetch pages of higher quality ﬁrst
Continuous operation: get fresh version of already crawled
pages
Sojka, IIR Group: PV211: Crawling 11 / 27
A simple crawler A real crawler
URL frontier
URLs crawled
and parsed
URL frontier:
found, but
not yet crawled
unseen URLs
Sojka, IIR Group: PV211: Crawling 13 / 27
A simple crawler A real crawler
URL frontier
The URL frontier is the data structure that holds and manages
URLs we’ve seen, but that have not been crawled yet.
Can include multiple pages from the same host
Must avoid trying to fetch them all at the same time
Must keep all crawling threads busy
Sojka, IIR Group: PV211: Crawling 14 / 27
A simple crawler A real crawler
Basic crawl architecture
www
fetch
DNS
parse
URL frontier
content
seen?
✓
✒
✏
✑
✒✑
doc
FPs
✓
✒
✏
✑
✒✑
robots
templates
✓
✒
✏
✑
✒✑
URL
set
URL
ﬁlter
dup
URL
elim✲
✛
✲
✻
✛ ✲
❄
✻
✲ ✲ ✲
✛
✻
❄
✻
❄
✻
❄
Sojka, IIR Group: PV211: Crawling 15 / 27
A simple crawler A real crawler
URL normalization
Some URLs extracted from a document are relative URLs.
E.g., at http://www.fi.muni.cz/˜sojka/PV211/, we may
have p20crawl.pdf
This is the same as URL:
http://www.fi.muni.cz/˜sojka/PV211/p20crawl.pdf
During parsing, we must normalize (expand) all relative URLs.
Sojka, IIR Group: PV211: Crawling 16 / 27
A simple crawler A real crawler
Content seen
For each page fetched: check if the content is already in the
index
Check this using document ﬁngerprints or shingles
Skip documents whose content has already been indexed
Sojka, IIR Group: PV211: Crawling 17 / 27
A simple crawler A real crawler
Distributing the crawler
Run multiple crawl threads, potentially at diﬀerent nodes
Usually geographically distributed nodes
Partition hosts being crawled into nodes
Sojka, IIR Group: PV211: Crawling 18 / 27
A simple crawler A real crawler
Google data centers (wayfaring.com)
Sojka, IIR Group: PV211: Crawling 19 / 27
A simple crawler A real crawler
Distributed crawler
www
fetch
DNS
parse
URL frontier
content
seen?
✓
✒
✏
✑
✍ ✌
doc
FPs
✓
✒
✏
✑
✍ ✌
URL
set
URL
ﬁlter
host
splitter
to
other
nodes
from
other
nodes
dup
URL
elim✲
✛
✲
✻
✛✲
❄
✻
✲ ✲ ✲ ✲
✛
✻❄ ✻❄✻✻✻
✲✲
✲
Sojka, IIR Group: PV211: Crawling 20 / 27
A simple crawler A real crawler
URL frontier: Two main considerations
Politeness: Don’t hit a web server too frequently
E.g., insert a time gap between successive requests to the
same server
Freshness: Crawl some pages (e.g., news sites) more often
than others
Not an easy problem: simple priority queue fails.
Sojka, IIR Group: PV211: Crawling 21 / 27
A simple crawler A real crawler
Mercator URL frontier
b. queue selector
f. queue selector & b. queue router
prioritizer
♣ ♣ ♣ ♣
B back queues:
single host on each
♣ ♣ ♣
♣
♣F front queues
1 F
1 B
❳❳❳❳❳❳③
❳❳❳❳❳❳③
✘✘✘✘✘✘✾
✘✘✘✘✘✘✾
✘✘✘✘✘✘✾
❳❳❳❳❳❳③
✏✏✏✏✏✏✮
✏✏✏✏✏✏✮
q
❍❍
❍❍❍❥
❍❍
❍❍❍❥
✟✟
✟✟✟✙
❄
❄
✲✛ heap
URLs ﬂow in from the top into
the frontier.
Front queues manage
prioritization.
Back queues enforce politeness.
Each queue is FIFO.
Sojka, IIR Group: PV211: Crawling 22 / 27
A simple crawler A real crawler
Mercator URL frontier: Front queues
Prioritizer assigns
to URL an integer
priority between 1
and F.
Then appends URL
to corresponding
queue
Heuristics for
assigning priority:
refresh rate,
PageRank, etc.
Selection from front
queues is initiated
by back queues
f. queue selector & b. queue router
prioritizer
q q q qF front queues
1 F
✏✏✏✏✏✏✏✮
✏✏✏✏✏✏✏✮
q
❍
❍❍
❍❍
❍❍❥
❍
❍❍
❍❍
❍❍❥
✟
✟✟
✟✟
✟✟✙
❄
Sojka, IIR Group: PV211: Crawling 23 / 27
A simple crawler A real crawler
Mercator URL frontier: Back queues
b. queue selector
f. queue selector & b. queue router
q q q q
B back queues
Single host on each
1 B
❳❳❳❳❳❳❳③
❳❳❳❳❳❳❳③
✘✘✘✘✘✘✘✾
✘✘✘✘✘✘✘✘✾
✘✘✘✘✘✘✘✘✾
❳❳❳❳❳❳❳❳③
❄
✲
✛ heap
Invariant 1. Each
back queue is kept
non-empty while the
crawl is in progress.
Invariant 2. Each
back queue only
contains URLs from a
single host.
Maintain a table from
hosts to back queues.
In the heap:
One entry for each
back queue
The entry is the
earliest time te at
Sojka, IIR Group: PV211: Crawling 24 / 27
A simple crawler A real crawler
Mercator URL frontier
b. queue selector
f. queue selector & b. queue router
prioritizer
♣ ♣ ♣ ♣
B back queues:
single host on each
♣ ♣ ♣
♣
♣F front queues
1 F
1 B
❳❳❳❳❳❳③
❳❳❳❳❳❳③
✘✘✘✘✘✘✾
✘✘✘✘✘✘✾
✘✘✘✘✘✘✾
❳❳❳❳❳❳③
✏✏✏✏✏✏✮
✏✏✏✏✏✏✮
q
❍❍
❍❍❍❥
❍❍
❍❍❍❥
✟✟
✟✟✟✙
❄
❄
✲✛ heap
URLs ﬂow in from the top into
the frontier.
Front queues manage
prioritization.
Back queues enforce politeness.
Each queue is FIFO.
Sojka, IIR Group: PV211: Crawling 25 / 27
A simple crawler A real crawler
Spider trap
Malicious server that generates an inﬁnite sequence of linked
pages.
Sophisticated spider traps generate pages that are not easily
identiﬁed as dynamic.
Sojka, IIR Group: PV211: Crawling 26 / 27
A simple crawler A real crawler
Resources
Chapter 20 of IIR
Resources at https://www.fi.muni.cz/~sojka/PV211/
and http://cislmu.org, materials in MU IS and FI MU
library
Papers by NLP centre people crawling data for Sketch Engine
Paper on Mercator by Heydon et al.
Robot exclusion standard
Sojka, IIR Group: PV211: Crawling 27 / 27