Anchor text Citation analysis PageRank HITS: Hubs & Authorities PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/PV211 IIR 21: Link analysis Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2023-05-10 (compiled on 2023-03-22 18:16:36) Sojka, IIR Group: PV211: Link analysis 1 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Overview 1 Anchor text 2 Citation analysis 3 PageRank 4 HITS: Hubs & Authorities Sojka, IIR Group: PV211: Link analysis 2 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Take-away today Anchor text: What exactly are links on the web and why are they important for IR? Citation analysis: the mathematical foundation of PageRank and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the web Hubs & Authorities: an alternative link-based ranking algorithm Sojka, IIR Group: PV211: Link analysis 3 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities The web as a directed graph page d1 anchor text page d2 hyperlink Assumption 1: A hyperlink is a quality signal. The hyperlink d1 → d2 indicates that d1’s author deems d2 high-quality and relevant. Assumption 2: The anchor text describes the content of d2. We use anchor text somewhat loosely here for: the text surrounding the hyperlink. Example: “You can find cheap cars here.” Anchor text: “You can find cheap cars here” Sojka, IIR Group: PV211: Link analysis 5 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities [text of d2] only vs. [text of d2] + [anchor text → d2] Searching on [text of d2] + [anchor text → d2] is often more effective than searching on [text of d2] only. Example: Query IBM Matches IBM’s copyright page Matches many spam pages Matches IBM Wikipedia article May not match IBM home page! . . . if IBM home page is mostly graphics Searching on [anchor text → d2] is better for the query IBM. In this representation, the page with the most occurrences of IBM is www.ibm.com. Sojka, IIR Group: PV211: Link analysis 6 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Anchor text containing IBM pointing to www.ibm.com www.nytimes.com: “IBM acquires Webify” www.slashdot.org: “New IBM optical chip” www.stanford.edu: “IBM faculty award recipients” www.ibm.com Sojka, IIR Group: PV211: Link analysis 7 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Indexing anchor text Thus: Anchor text is often a better description of a page’s content than the page itself. Anchor text can be weighted more highly than document text. (based on Assumptions 1&2) Sojka, IIR Group: PV211: Link analysis 8 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Exercise: Assumptions underlying PageRank Assumption 1: A link on the web is a quality signal – the author of the link thinks that the linked-to page is high-quality. Assumption 2: The anchor text describes the content of the linked-to page. Is assumption 1 true in general? Is assumption 2 true in general? Sojka, IIR Group: PV211: Link analysis 9 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Google bombs A Google bomb is a search with “bad” results due to maliciously manipulated anchor text. Google introduced a new weighting function in 2007 that fixed many Google bombs. Still some remnants: [dangerous cult] on Google, Bing, Yahoo Coordinated link creation by those who dislike the Church of Scientology Defused Google bombs: [dumb motherf. . . ], [who is a failure?], [evil empire] Sojka, IIR Group: PV211: Link analysis 10 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Origins of PageRank: Citation analysis (1) Citation analysis: analysis of citations in the scientific literature Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.” We can view “Miller (2001)” as a hyperlink linking two scientific articles. One application of these “hyperlinks” in the scientific literature: Measure the similarity of two articles by the overlap of other articles citing them. This is called cocitation similarity. Cocitation similarity on the web: Google’s “related:” operator, e.g. [related:www.ford.com] Sojka, IIR Group: PV211: Link analysis 12 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Origins of PageRank: Citation analysis (2) Another application: Citation frequency can be used to measure the impact of a scientific article. Simplest measure: Each citation gets one vote. On the web: citation frequency = inlink count However: A high inlink count does not necessarily mean high quality . . . . . . mainly because of link spam. Better measure: weighted citation frequency or citation rank An citation’s vote is weighted according to its citation impact. Circular? No: can be formalized in a well-defined way. Sojka, IIR Group: PV211: Link analysis 13 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Origins of PageRank: Citation analysis (3) Better measure: weighted citation frequency or citation rank This is basically PageRank. PageRank was invented in the context of citation analysis by Pinsker and Narin in the 1960s. Citation analysis is a big deal: The budget and salary of this lecturer are / will be determined by the impact of his publications! Sojka, IIR Group: PV211: Link analysis 14 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Origins of PageRank: Summary We can use the same formal representation for citations in the scientific literature hyperlinks on the web Appropriately weighted citation frequency is an excellent measure of quality . . . . . . both for web pages and for scientific publications. Next: PageRank algorithm for computing weighted citation frequency on the web Sojka, IIR Group: PV211: Link analysis 15 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Model behind PageRank: Random walk Imagine a web surfer doing a random walk on the web Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably In the steady state, each page has a long-term visit rate. This long-term visit rate is the page’s PageRank. PageRank = long-term visit rate = steady state probability Sojka, IIR Group: PV211: Link analysis 17 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Formalization of random walk: Markov chains A Markov chain consists of N states, plus an N × N transition probability matrix P. state = page At each step, we are on exactly one of the pages. For 1 ≤ i, j ≤ N, the matrix entry Pij tells us the probability of j being the next page, given we are currently on page i. Clearly, for all i, N j=1 Pij = 1 di dj Pij Sojka, IIR Group: PV211: Link analysis 18 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Example web graph d0 d2 d1 d5 d3 d6 d4 car benz ford gm honda jaguar jag cat leopard tiger jaguar lion cheetah speed PageRank d0 0.05 d1 0.04 d2 0.11 d3 0.25 d4 0.21 d5 0.04 d6 0.31 PageRank(d2)< PageRank(d6): why? a h d0 0.10 0.03 d1 0.01 0.04 d2 0.12 0.33 d3 0.47 0.18 d4 0.16 0.04 d5 0.01 0.04 d6 0.13 0.35 highest in-degree: d2, d3, d6 highest out-degree: d2, d6 highest PageRank: d6 highest hub score: d6 (close: d2) highest authority score: d3 Sojka, IIR Group: PV211: Link analysis 19 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Link matrix for example d0 d1 d2 d3 d4 d5 d6 d0 0 0 1 0 0 0 0 d1 0 1 1 0 0 0 0 d2 1 0 1 1 0 0 0 d3 0 0 0 1 1 0 0 d4 0 0 0 0 0 0 1 d5 0 0 0 0 0 1 1 d6 0 0 0 1 1 0 1 Sojka, IIR Group: PV211: Link analysis 20 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Transition probability matrix P for example d0 d1 d2 d3 d4 d5 d6 d0 0.00 0.00 1.00 0.00 0.00 0.00 0.00 d1 0.00 0.50 0.50 0.00 0.00 0.00 0.00 d2 0.33 0.00 0.33 0.33 0.00 0.00 0.00 d3 0.00 0.00 0.00 0.50 0.50 0.00 0.00 d4 0.00 0.00 0.00 0.00 0.00 0.00 1.00 d5 0.00 0.00 0.00 0.00 0.00 0.50 0.50 d6 0.00 0.00 0.00 0.33 0.33 0.00 0.33 Sojka, IIR Group: PV211: Link analysis 21 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Long-term visit rate Recall: PageRank = long-term visit rate Long-term visit rate of page d is the probability that a web surfer is at page d at a given point in time. Next: what properties must hold of the web graph for the long-term visit rate to be well defined? The web graph must correspond to an ergodic Markov chain. First a special case: The web graph must not contain dead ends. Sojka, IIR Group: PV211: Link analysis 22 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Dead ends ?? The web is full of dead ends. Random walk can get stuck in dead ends. If there are dead ends, long-term visit rates are not well-defined (or non-sensical). Sojka, IIR Group: PV211: Link analysis 23 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Teleporting – to get us out of dead ends At a dead end, jump to a random web page with prob. 1/N. At a non-dead end, with probability 10%, jump to a random web page (to each with a probability of 0.1/N). With remaining probability (90%), go out on a random hyperlink. For example, if the page has 4 outgoing links: randomly choose one with probability (1-0.10)/4=0.225 10% is a parameter, the teleportation rate. Note: “jumping” from dead end is independent of teleportation rate. Sojka, IIR Group: PV211: Link analysis 24 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Result of teleporting With teleporting, we cannot get stuck in a dead end. But even without dead ends, a graph may not have well-defined long-term visit rates. More generally, we require that the Markov chain be ergodic. Sojka, IIR Group: PV211: Link analysis 25 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Ergodic Markov chains A Markov chain is ergodic iff it is irreducible and aperiodic. Irreducibility. Roughly: there is a path from any page to any other page. Aperiodicity. Roughly: The pages cannot be partitioned such that the random walker visits the partitions sequentially. A non-ergodic Markov chain: 1.0 1.0 Sojka, IIR Group: PV211: Link analysis 26 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Ergodic Markov chains Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state. This is the steady-state probability distribution. Over a long time period, we visit each state in proportion to this rate. It doesn’t matter where we start. Teleporting makes the web graph ergodic. ⇒ Web-graph+teleporting has a steady-state probability distribution. ⇒ Each page in the web-graph+teleporting has a PageRank. Sojka, IIR Group: PV211: Link analysis 27 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Where we are We now know what to do to make sure we have a well-defined PageRank for each page. Next: how to compute PageRank Sojka, IIR Group: PV211: Link analysis 28 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Formalization of “visit”: Probability vector A probability (row) vector x = (x1, . . . , xN) tells us where the random walk is at any point. Example: ( 0 0 0 . . . 1 . . . 0 0 0 ) 1 2 3 . . . i . . . N-2 N-1 N More generally: the random walk is on page i with probability xi . Example: ( 0.05 0.01 0.0 . . . 0.2 . . . 0.01 0.05 0.03 ) 1 2 3 . . . i . . . N-2 N-1 N xi = 1 Sojka, IIR Group: PV211: Link analysis 29 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Change in probability vector If the probability vector is x = (x1, . . . , xN) at this step, what is it at the next step? Recall that row i of the transition probability matrix P tells us where we go next from state i. So from x, our next state is distributed as xP. Sojka, IIR Group: PV211: Link analysis 30 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Steady state in vector notation The steady state in vector notation is simply a vector π = (π1, π2, . . . , πN) of probabilities. (We use π to distinguish it from the notation for the probability vector x.) πi is the long-term visit rate (or PageRank) of page i. So we can think of PageRank as a very long vector – one entry per page. Sojka, IIR Group: PV211: Link analysis 31 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Steady-state distribution: Example What is the PageRank / steady state in this example? d1 d2 0.75 0.25 0.25 0.75 Sojka, IIR Group: PV211: Link analysis 32 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Steady-state distribution: Example x1 x2 Pt(d1) Pt(d2) P11 = 0.25 P12 = 0.75 P21 = 0.25 P22 = 0.75 t0 0.25 0.75 0.25 0.75 t1 0.25 0.75 (convergence) PageRank vector = π = (π1, π2) = (0.25, 0.75) Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21 Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22 Sojka, IIR Group: PV211: Link analysis 33 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities How do we compute the steady state vector? In other words: how do we compute PageRank? Recall: π = (π1, π2, . . . , πN) is the PageRank vector, the vector of steady-state probabilities . . . . . . and if the distribution in this step is x, then the distribution in the next step is xP. But π is the steady state! So: π = πP Solving this matrix equation gives us π. π is the principal left eigenvector for P . . . . . . that is, π is the left eigenvector with the largest eigenvalue. All transition probability matrices have largest eigenvalue 1. Sojka, IIR Group: PV211: Link analysis 34 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities One way of computing the PageRank π Start with any distribution x, e.g., uniform distribution After one step, we’re at xP. After two steps, we’re at xP2. After k steps, we’re at xPk. Algorithm: multiply x by increasing powers of P until convergence. This is called the power method. Recall: regardless of where we start, we eventually reach the steady state π. Thus: we will eventually (in asymptotia) reach the steady state. Sojka, IIR Group: PV211: Link analysis 35 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Power method: Example What is the PageRank / steady state in this example? d1 d2 0.9 0.3 0.1 0.7 The steady state distribution (= the PageRanks) in this example are 0.25 for d1 and 0.75 for d2. Sojka, IIR Group: PV211: Link analysis 36 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Computing PageRank: Power method x1 x2 Pt(d1) Pt(d2) P11 = 0.1 P12 = 0.9 P21 = 0.3 P22 = 0.7 t0 0 1 0.3 0.7 = xP t1 0.3 0.7 0.24 0.76 = xP2 t2 0.24 0.76 0.252 0.748 = xP3 t3 0.252 0.748 0.2496 0.7504 = xP4 . . . . . . t∞ 0.25 0.75 0.25 0.75 = xP∞ PageRank vector = π = (π1, π2) = (0.25, 0.75) Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21 Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22 Sojka, IIR Group: PV211: Link analysis 37 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Power method: Example What is the PageRank / steady state in this example? d1 d2 0.9 0.3 0.1 0.7 The steady state distribution (= the PageRanks) in this example are 0.25 for d1 and 0.75 for d2. Sojka, IIR Group: PV211: Link analysis 38 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Exercise: Compute PageRank using power method d1 d2 0.3 0.2 0.7 0.8 Sojka, IIR Group: PV211: Link analysis 39 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Solution x1 x2 Pt(d1) Pt(d2) P11 = 0.7 P12 = 0.3 P21 = 0.2 P22 = 0.8 t0 0 1 0.2 0.8 t1 0.2 0.8 0.3 0.7 t2 0.3 0.7 0.35 0.65 t3 0.35 0.65 0.375 0.625 . . . t∞ 0.4 0.6 0.4 0.6 PageRank vector = π = (π1, π2) = (0.4, 0.6) Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21 Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22 Sojka, IIR Group: PV211: Link analysis 40 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities PageRank summary Preprocessing Given graph of links, build matrix P Apply teleportation From modified matrix, compute π πi is the PageRank of page i. Query processing Retrieve pages satisfying the query Rank them by their PageRank Return reranked list to the user Sojka, IIR Group: PV211: Link analysis 41 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities PageRank issues Real surfers are not random surfers. Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search! → Markov model is not a good model of surfing. But it’s good enough as a model for our purposes. Simple PageRank ranking (as described on previous slide) produces bad results for many pages. Consider the query [video service] The Yahoo home page (i) has a very high PageRank and (ii) contains both video and service. If we rank all Boolean hits according to PageRank, then the Yahoo home page would be top-ranked. Clearly not desirable In practice: rank according to weighted combination of raw text match, anchor text match, PageRank & other factors → see lecture on Learning to Rank Sojka, IIR Group: PV211: Link analysis 42 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Example web graph d0 d2 d1 d5 d3 d6 d4 car benz ford gm honda jaguar jag cat leopard tiger jaguar lion cheetah speed PageRank d0 0.05 d1 0.04 d2 0.11 d3 0.25 d4 0.21 d5 0.04 d6 0.31 PageRank(d2)< PageRank(d6): why? a h d0 0.10 0.03 d1 0.01 0.04 d2 0.12 0.33 d3 0.47 0.18 d4 0.16 0.04 d5 0.01 0.04 d6 0.13 0.35 highest in-degree: d2, d3, d6 highest out-degree: d2, d6 highest PageRank: d6 highest hub score: d6 (close: d2) highest authority score: d3 Sojka, IIR Group: PV211: Link analysis 43 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Transition (probability) matrix d0 d1 d2 d3 d4 d5 d6 d0 0.00 0.00 1.00 0.00 0.00 0.00 0.00 d1 0.00 0.50 0.50 0.00 0.00 0.00 0.00 d2 0.33 0.00 0.33 0.33 0.00 0.00 0.00 d3 0.00 0.00 0.00 0.50 0.50 0.00 0.00 d4 0.00 0.00 0.00 0.00 0.00 0.00 1.00 d5 0.00 0.00 0.00 0.00 0.00 0.50 0.50 d6 0.00 0.00 0.00 0.33 0.33 0.00 0.33 Sojka, IIR Group: PV211: Link analysis 44 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Transition matrix with teleporting, teleportation rate=0.14 d0 d1 d2 d3 d4 d5 d6 d0 0.02 0.02 0.88 0.02 0.02 0.02 0.02 d1 0.02 0.45 0.45 0.02 0.02 0.02 0.02 d2 0.31 0.02 0.31 0.31 0.02 0.02 0.02 d3 0.02 0.02 0.02 0.45 0.45 0.02 0.02 d4 0.02 0.02 0.02 0.02 0.02 0.02 0.88 d5 0.02 0.02 0.02 0.02 0.02 0.45 0.45 d6 0.02 0.02 0.02 0.31 0.31 0.02 0.31 Sojka, IIR Group: PV211: Link analysis 45 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Power method vectors xPk x xP1 xP2 xP3 xP4 xP5 xP6 xP7 xP8 xP9 xP10 xP11 xP12 xP13 d0 0.14 0.06 0.09 0.07 0.07 0.06 0.06 0.06 0.06 0.05 0.05 0.05 0.05 0.05 d1 0.14 0.08 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 d2 0.14 0.25 0.18 0.17 0.15 0.14 0.13 0.12 0.12 0.12 0.12 0.11 0.11 0.11 d3 0.14 0.16 0.23 0.24 0.24 0.24 0.24 0.25 0.25 0.25 0.25 0.25 0.25 0.25 d4 0.14 0.12 0.16 0.19 0.19 0.20 0.21 0.21 0.21 0.21 0.21 0.21 0.21 0.21 d5 0.14 0.08 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 d6 0.14 0.25 0.23 0.25 0.27 0.28 0.29 0.29 0.30 0.30 0.30 0.30 0.31 0.31 Sojka, IIR Group: PV211: Link analysis 46 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Example web graph d0 d2 d1 d5 d3 d6 d4 car benz ford gm honda jaguar jag cat leopard tiger jaguar lion cheetah speed PageRank d0 0.05 d1 0.04 d2 0.11 d3 0.25 d4 0.21 d5 0.04 d6 0.31 PageRank(d2)< PageRank(d6): why? a h d0 0.10 0.03 d1 0.01 0.04 d2 0.12 0.33 d3 0.47 0.18 d4 0.16 0.04 d5 0.01 0.04 d6 0.13 0.35 highest in-degree: d2, d3, d6 highest out-degree: d2, d6 highest PageRank: d6 highest hub score: d6 (close: d2) highest authority score: d3 Sojka, IIR Group: PV211: Link analysis 47 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities How important is PageRank? Frequent claim: PageRank is the most important component of web ranking. The reality: There are several components that are at least as important: e.g., anchor text, phrases, proximity, tiered indexes . . . Rumor has it that PageRank in its original form (as presented here) now has a negligible impact on ranking! However, variants of a page’s PageRank are still an essential part of ranking. Adressing link spam is difficult and crucial. Sojka, IIR Group: PV211: Link analysis 48 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities HITS – Hyperlink-Induced Topic Search Premise: there are two different types of relevance on the web. Relevance type 1: Hubs. A hub page is a good list of [links to pages answering the information need]. E.g., for query [chicago bulls]: Bob’s list of recommended resources on the Chicago Bulls sports team Relevance type 2: Authorities. An authority page is a direct answer to the information need. The home page of the Chicago Bulls sports team By definition: Links to authority pages occur repeatedly on hub pages. Most approaches to search (including PageRank ranking) don’t make the distinction between these two very different types of relevance. Sojka, IIR Group: PV211: Link analysis 50 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Hubs and authorities: Definition A good hub page for a topic links to many authority pages for that topic. A good authority page for a topic is linked to by many hub pages for that topic. Circular definition – we will turn this into an iterative computation. Sojka, IIR Group: PV211: Link analysis 51 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Example for hubs and authorities hubs authorities www.bestfares.com www.airlinesquality.com blogs.usatoday.com/sky aviationblog.dallasnews.com www.aa.com www.delta.com www.united.com Sojka, IIR Group: PV211: Link analysis 52 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities How to compute hub and authority scores Do a regular web search first Call the search result the root set Find all pages that are linked to or link to pages in the root set Call this larger set the base set Finally, compute hubs and authorities for the base set (which we’ll view as a small web graph) Sojka, IIR Group: PV211: Link analysis 53 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Root set and base set (1) base set root set 1) The root set 2) Nodes that root set nodes link to 3) Nodes that link to root set nodes 4) The base set Sojka, IIR Group: PV211: Link analysis 54 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Root set and base set (2) Root set typically has 200–1,000 nodes. Base set may have up to 5,000 nodes. Computation of base set, as shown on previous slide: Follow outlinks by parsing the pages in the root set Find d’s inlinks by searching for all pages containing a link to d Sojka, IIR Group: PV211: Link analysis 55 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Hub and authority scores Compute for each page d in the base set a hub score h(d) and an authority score a(d) Initialization: for all d: h(d) = 1, a(d) = 1 Iteratively update all h(d), a(d) After convergence: Output pages with highest h scores as top hubs Output pages with highest a scores as top authorities So we output two ranked lists Sojka, IIR Group: PV211: Link analysis 56 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Iterative update For all d: h(d) = d→y a(y) d y1 y2 y3 For all d: a(d) = y→d h(y) d y1 y2 y3 Iterate these two steps until convergence Sojka, IIR Group: PV211: Link analysis 57 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Details Scaling To prevent the a() and h() values from getting too big, can scale down after each iteration Scaling factor doesn’t really matter. We care about the relative (as opposed to absolute) values of the scores. In most cases, the algorithm converges after a few iterations. Sojka, IIR Group: PV211: Link analysis 58 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Authorities for query [Chicago Bulls] 0.85 www.nba.com/bulls 0.25 www.essex1.com/people/jmiller/bulls.htm “da Bulls” 0.20 www.nando.net/SportServer/basketball/nba/chi.html “The Chicago Bulls” 0.15 users.aol.com/rynocub/bulls.htm “The Chicago Bulls Home Page” 0.13 www.geocities.com/Colosseum/6095 “Chicago Bulls” (Ben-Shaul et al, WWW8) Sojka, IIR Group: PV211: Link analysis 59 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities The authority page for [Chicago Bulls] Sojka, IIR Group: PV211: Link analysis 60 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Hubs for query [Chicago Bulls] 1.62 www.geocities.com/Colosseum/1778 “Unbelieveabulls!!!!!” 1.24 www.webring.org/cgi-bin/webring?ring=chbulls “Erin’s Chicago Bulls Page” 0.74 www.geocities.com/Hollywood/Lot/3330/Bulls.html “Chicago Bulls” 0.52 www.nobull.net/web_position/kw-search-15-M2.htm “Excite Search Results: bulls” 0.52 www.halcyon.com/wordsltd/bball/bulls.htm “Chicago Bulls Links” (Ben-Shaul et al, WWW8) Sojka, IIR Group: PV211: Link analysis 61 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities A hub page for [Chicago Bulls] Sojka, IIR Group: PV211: Link analysis 62 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Hubs & Authorities: Comments HITS can pull together good pages regardless of page content. Once the base set is assembled, we only do link analysis, no text matching. Pages in the base set often do not contain any of the query words. In theory, an English query can retrieve Japanese-language pages! If supported by the link structure between English and Japanese pages Danger: topic drift – the pages found by following links may not be related to the original query. Sojka, IIR Group: PV211: Link analysis 63 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Proof of convergence We define an N × N adjacency matrix A. (We called this the link matrix earlier.) For 1 ≤ i, j ≤ N, the matrix entry Aij tells us whether there is a link from page i to page j (Aij = 1) or not (Aij = 0). Example: d3 d1 d2 d1 d2 d3 d1 0 1 0 d2 1 1 1 d3 1 0 0 Sojka, IIR Group: PV211: Link analysis 64 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Write update rules as matrix operations Define the hub vector h = (h1, . . . , hN) as the vector of hub scores. hi is the hub score of page di . Similarly for a, the vector of authority scores Now we can write h(d) = d→y a(y) as a matrix operation: h = Aa . . . . . . and we can write a(d) = y→d h(y) as a = AT h HITS algorithm in matrix notation: Compute h = Aa Compute a = AT h Iterate until convergence Sojka, IIR Group: PV211: Link analysis 65 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities HITS as eigenvector problem HITS algorithm in matrix notation. Iterate: Compute h = Aa Compute a = AT h By substitution we get: h = AAT h and a = AT Aa Thus, h is an eigenvector of AAT and a is an eigenvector of AT A. So the HITS algorithm is actually a special case of the power method and hub and authority scores are eigenvector values. HITS and PageRank both formalize link analysis as eigenvector problems. Sojka, IIR Group: PV211: Link analysis 66 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Example web graph d0 d2 d1 d5 d3 d6 d4 car benz ford gm honda jaguar jag cat leopard tiger jaguar lion cheetah speed PageRank d0 0.05 d1 0.04 d2 0.11 d3 0.25 d4 0.21 d5 0.04 d6 0.31 PageRank(d2)< PageRank(d6): why? a h d0 0.10 0.03 d1 0.01 0.04 d2 0.12 0.33 d3 0.47 0.18 d4 0.16 0.04 d5 0.01 0.04 d6 0.13 0.35 highest in-degree: d2, d3, d6 highest out-degree: d2, d6 highest PageRank: d6 highest hub score: d6 (close: d2) highest authority score: d3 Sojka, IIR Group: PV211: Link analysis 67 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Raw matrix A for HITS We double-weight links whose anchors contain query word: d0 d1 d2 d3 d4 d5 d6 d0 0 0 1 0 0 0 0 d1 0 1 1 0 0 0 0 d2 1 0 1 2 0 0 0 d3 0 0 0 1 1 0 0 d4 0 0 0 0 0 0 1 d5 0 0 0 0 0 1 1 d6 0 0 0 2 1 0 1 Sojka, IIR Group: PV211: Link analysis 68 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Hub vectors h0,hi = 1 di A · ai, i ≥ 1 h0 h1 h2 h3 h4 h5 d0 0.14 0.06 0.04 0.04 0.03 0.03 d1 0.14 0.08 0.05 0.04 0.04 0.04 d2 0.14 0.28 0.32 0.33 0.33 0.33 d3 0.14 0.14 0.17 0.18 0.18 0.18 d4 0.14 0.06 0.04 0.04 0.04 0.04 d5 0.14 0.08 0.05 0.04 0.04 0.04 d6 0.14 0.30 0.33 0.34 0.35 0.35 Sojka, IIR Group: PV211: Link analysis 69 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Authority vectors ai = 1 ci AT · hi−1, i ≥ 1 a1 a2 a3 a4 a5 a6 a7 d0 0.06 0.09 0.10 0.10 0.10 0.10 0.10 d1 0.06 0.03 0.01 0.01 0.01 0.01 0.01 d2 0.19 0.14 0.13 0.12 0.12 0.12 0.12 d3 0.31 0.43 0.46 0.46 0.46 0.47 0.47 d4 0.13 0.14 0.16 0.16 0.16 0.16 0.16 d5 0.06 0.03 0.02 0.01 0.01 0.01 0.01 d6 0.19 0.14 0.13 0.13 0.13 0.13 0.13 Sojka, IIR Group: PV211: Link analysis 70 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Example web graph d0 d2 d1 d5 d3 d6 d4 car benz ford gm honda jaguar jag cat leopard tiger jaguar lion cheetah speed PageRank d0 0.05 d1 0.04 d2 0.11 d3 0.25 d4 0.21 d5 0.04 d6 0.31 PageRank(d2)< PageRank(d6): why? a h d0 0.10 0.03 d1 0.01 0.04 d2 0.12 0.33 d3 0.47 0.18 d4 0.16 0.04 d5 0.01 0.04 d6 0.13 0.35 highest in-degree: d2, d3, d6 highest out-degree: d2, d6 highest PageRank: d6 highest hub score: d6 (close: d2) highest authority score: d3 Sojka, IIR Group: PV211: Link analysis 71 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities PageRank vs. HITS: Discussion PageRank can be precomputed, HITS has to be computed at query time. HITS is too expensive in most application scenarios. PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to. These two are orthogonal. We could also apply HITS to the entire web and PageRank to a small base set. Claim: On the web, a good hub almost always is also a good authority. The actual difference between PageRank ranking and HITS ranking is therefore not as large as one might expect. Sojka, IIR Group: PV211: Link analysis 72 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Exercise Why is a good hub almost always also a good authority? Sojka, IIR Group: PV211: Link analysis 73 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Take-away today Anchor text: What exactly are links on the web and why are they important for IR? Citation analysis: the mathematical foundation of PageRank and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the web Hubs & Authorities: an alternative link-based ranking algorithm Sojka, IIR Group: PV211: Link analysis 74 / 75 Anchor text Citation analysis PageRank HITS: Hubs & Authorities Resources Chapter 21 of IIR Resources at https://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and FI MU library American Mathematical Society article on PageRank (popular science style) Jon Kleinberg’s home page (main person behind HITS) A Google bomb and its defusing Google’s official description of PageRank: PageRank reflects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results. Sojka, IIR Group: PV211: Link analysis 75 / 75