Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University http://disa.fi.muni.cz ¡For the following graph ¡ ¡ ¡ ¡ ¡ §Compute the PageRank of each page, assuming no taxation Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 2 B A C 3 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) j k i rj/3 rj/3 rj/3 rj = ri/3+rk/4 ri/3 rk/4 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 4 ¡Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks ¡Power iteration: a simple iterative scheme §Suppose there are N web pages §Initialize: r(0) = [1/N,….,1/N]T §Iterate: r(t+1) = M ∙ r(t) §Stop when |r(t+1) – r(t)|1 < e § 5 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) di …. out-degree of node i |x|1 = å1≤i≤N|xi| is the L1 norm Can use any other vector norm, e.g., Euclidean ¡The transition matrix for the graph is: ¡ ¡ ¡By equation method, we get the result: ¡ ¡ ¡ ¡ ¡ ¡By power-iteration method, we get the following list: Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 6 B A C ¡For the following graph ¡ ¡ ¡ ¡ ¡ 1)Set up the PageRank equations, assuming β = 0.8 2)Order nodes by PageRank from highest to lowest Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 7 4 6 3 2 1 5 ¡The Google solution for spider traps: At each time step, the random surfer has two options §With prob. b, follow a link at random §With prob. 1-b, jump to some random page §Common values for b are in the range 0.8 to 0.9 ¡Surfer will teleport out of spider trap within a few time steps 8 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) y a m y a m Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 9 di … out-degree of node i Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 10 [1/N]NxN…N by N matrix where all entries are 1/N Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 11 4 6 3 2 1 5 ¡For the following graph ¡ ¡ ¡ ¡ ¡ §Assuming β = 0.8, compute the topic-sensitive PageRank for the following teleport sets: 1){A} 2){A, C} Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 12 B A D C ¡Random walker has a small probability of teleporting at any step ¡Teleport can go to: §Standard PageRank: Any page with equal probability §To avoid dead-end and spider-trap problems §Topic Specific PageRank: A topic-specific set of “relevant” pages (teleport set) ¡Idea: Bias the random walk §When walker teleports, she pick a page from a set S §S contains only pages that are relevant to the topic §E.g., Open Directory (DMOZ) pages for a given topic/query §For each teleport set S, we get a different vector rS 13 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 14 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 15 ¡The transition matrix for the graph is: ¡ ¡ ¡ 1)Computing PageRank for teleport set {A} using equations: ¡ ¡ ¡ Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 16 B A D C β = 0.8 = 4/5 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 17 ¡The transition matrix for the graph is: ¡ ¡ ¡ 2)Computing PageRank for teleport set {A,C} using equations: ¡ ¡ ¡ Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 18 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 19 ¡Suppose the BALANCE algorithm with bids of 0 or 1 only, to a situation where advertiser §A bids on query words x and y §B bids on query words x and z §Both have a budget of $2. Decide whether the following sequences of queries are certainly handled optimally by the algorithm: 1)yzyy 2)xyyz 3)xyzx Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 20 ¡BALANCE Algorithm by Mehta, Saberi, Vazirani, and Vazirani §For each query, pick the advertiser with the largest unspent budget §Break ties arbitrarily (but in a deterministic way) 21 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 22 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) §A bids on x and y B bids on x and z budget: $2 1) 1)Input sequence: yzyy §Balance choice: yzy ($3) Optimal: yzy ($3) Yes 2)Input sequence: xyyz §If the x is assigned to A, then the second y cannot be satisfied, while the optimum assigns all four queries §Balance choice: xyz ($3) Optimal: xyyz ($4) No 3)Input sequence: xyzx §Whichever advertiser is assigned the first x, the other will be assigned the second x, thus using all four queries §Balance choice: xyzx ($4) Optimal: xyzx ($4) Yes Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 23 ¡Bookstore has enough ratings to use a more advanced recommendation system §Suppose the mean rating of books is 3.4 stars §Alice has rated 350 books and her average rating is 0.4 stars higher than average users' ratings §Animals Farm, is a book title in the bookstore with 250,000 ratings whose average rating is 0.7 higher than global average §What is a baseline estimate of Alice's rating for Animals Farms? Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 24 ¡Baseline estimate of Alice's rating for Animals Farms: ¡ r = 3.4 + 0.7 + 0.4 = 4.5 ¡ Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 25 MCBS01705_0000[1] likes Item profiles Red Circles Triangles User profile match recommend build 26 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Item profile = vector of item features (e.g., one dimension = 0/1 whether it is red color, or not) User profile = e.g., average over item profiles ¡Computers A, B and C have the following features: ¡ ¡ ¡ ¡ ¡ §Assuming features as a vector for each computer, e.g., A’s vector is [3.06, 500, 6], we can compute the cosine distance between any two vectors §Scaling dimensions can prefer some components §Assume 1 as the scale factor for processor speed, α for the disk size, and β for the main memory size and compute: §The cosines of angles between pairs of vectors (in terms of α and β) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 27 Feature A B C Processor speed 3.06 2.68 2.92 Disk size 500 320 640 Main-memory size 6 4 6 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 28 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 29 Feature A B C Processor speed 3.06 2.68 2.92 Disk size 500 320 640 Main-memory size 6 4 6 ¡A user has rated the three computers as follows: §A: 4 stars, B: 2 stars, C: 5 stars ¡Tasks: 1)Normalize the ratings for this user 2)Compute a user profile for the user, with the following features Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 30 Feature A B C Processor speed 3.06 2.68 2.92 Disk size 500 320 640 Main-memory size 6 4 6 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 31 Feature A B C Processor speed 3.06 2.68 2.92 Disk size 500 320 640 Main-memory size 6 4 6