Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University http://disa.fi.muni.cz 2 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 1 2 3 4 a b c d Boys Girls 4 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Nodes: Boys and Girls; Edges: Preferences Goal: Match boys to girls so that maximum number of preferences is satisfied M = {(1,a),(2,b),(3,d)} is a matching Cardinality of matching = |M| = 3 1 2 3 4 a b c d Boys Girls 5 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 1 2 3 4 a b c d Boys Girls M = {(1,c),(2,b),(3,d),(4,a)} is a perfect matching 6 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Perfect matching … all vertices of the graph are matched Maximum matching … a matching that contains the largest possible number of matches 7 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 8 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 9 1 2 3 4 a b c d (1,a) (2,b) (3,d) 10 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) ¡For input I, suppose greedy produces matching Mgreedy while an optimal matching is Mopt ¡ ¡Competitive ratio = ¡ minall possible inputs I (|Mgreedy|/|Mopt|) ¡ ¡(what is greedy’s worst performance over all possible inputs I) 11 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) ¡Consider a case: Mgreedy≠ Mopt ¡Consider the set G of girls matched in Mopt but not in Mgreedy ¡Then every boy B adjacent to girls in G is already matched in Mgreedy: §If there would exist such non-matched (by Mgreedy) boy adjacent to a non-matched girl then greedy would have matched them ¡Since boys B are already matched in Mgreedy then (1) |Mgreedy|≥ |B| ¡ Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 12 a b c d G={ } B={ } Mopt 1 2 3 4 ¡Summary so far: §Girls G matched in Mopt but not in Mgreedy §(1) |Mgreedy|≥ |B| ¡There are at least |G| such boys (|G| £ |B|) otherwise the optimal algorithm couldn’t have matched all girls in G §So: |G| £ |B| £ |Mgreedy| ¡By definition of G also: |Mopt| £ |Mgreedy| + |G| §Worst case is when |G| = |B| = |Mgreedy| ¡|Mopt| £ 2|Mgreedy| then |Mgreedy|/|Mopt| ³ 1/2 ¡ ¡ Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 13 a b c d G={ } B={ } Mopt 1 2 3 4 This is a general proof. We make no assumptions about the structure of the bipartite graph. 1 2 3 4 a b c (1,a) (2,b) d 14 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) ¡Banner ads (1995-2001) §Initial form of web advertising §Popular websites charged X$ for every 1,000 “impressions” of the ad §Called “CPM” rate (Cost per thousand impressions) §Modeled similar to TV, magazine ads §From untargeted to demographically targeted §Low click-through rates §Low ROI for advertisers Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 16 CPM…cost per mille Mille…thousand in Latin 17 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) google-results google-ads 18 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 19 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) ¡Given: §1. A set of bids by advertisers for search queries §2. A click-through rate for each advertiser-query pair §3. A budget for each advertiser (say for 1 month) §4. A limit on the number of ads to be displayed with each search query ¡Respond to each search query with a set of advertisers such that: §1. The size of the set is no larger than the limit on the number of ads per query §2. Each advertiser has bid on the search query §3. Each advertiser has enough budget left to pay for the ad if it is clicked upon Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 20 21 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 22 Advertiser Bid CTR Bid * CTR A B C $1.00 $0.75 $0.50 1% 2% 2.5% 1 cent 1.5 cents 1.125 cents Click through rate Expected revenue ¡Two complications: §Budget §CTR of an ad is unknown ¡ ¡Each advertiser has a limited budget §Search engine guarantees that the advertiser will not be charged more than their daily budget Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 23 ¡CTR: Each ad has a different likelihood of being clicked §Advertiser 1 bids $2, click probability = 0.1 §Advertiser 2 bids $1, click probability = 0.5 §Clickthrough rate (CTR) is measured historically §Very hard problem: Exploration vs. exploitation Exploit: Should we keep showing an ad for which we have good estimates of click-through rate or Explore: Shall we show a brand new ad to get a better sense of its click-through rate Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 24 25 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) ¡Two advertisers A and B §A bids on query x, B bids on x and y §Both have budgets of $4 ¡Query stream: x x x x y y y y §Worst case greedy choice: B B B B _ _ _ _ §Optimal: A A A A B B B B §Competitive ratio = ½ ¡This is the worst case! §Note: Greedy algorithm is deterministic – it always resolves draws in the same way 26 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Is greedy deterministic or randomized (it is deterministic!!) ¡BALANCE Algorithm by Mehta, Saberi, Vazirani, and Vazirani §For each query, pick the advertiser with the largest unspent budget §Break ties arbitrarily (but in a deterministic way) 27 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 28 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 29 Why exhausts budget: -- remember, optimal exhausts both budgets GENERAL PROOF A1 A2 B x y B A1 A2 x Optimal revenue = 2B Assume Balance gives revenue = 2B-x = B+y Queries allocated to A1 in the optimal solution Queries allocated to A2 in the optimal solution Not used 30 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) BALANCE exhausts A2’s budget x y B A1 A2 x Not used Whatever is not used should be assigned to A2 (since if we could assign to A1 we would since we still have the budget) CASE 1) Less than half of A1’s queries got assigned to A2. So y>B/2 CASE 2) If more than half of A1’s queries got assigned to A2. Consider last of A1s queries assigned to A2. At that time B2 > B1. At that time B2 < ½. Since more than half of A1’s queries got assigned to A2 (this means the remaining budget had to be <1/2). So B1<1/2. ¡In the general case with N advertisers, worst competitive ratio of BALANCE is 1–1/e = approx. 0.63 §Interestingly, no online algorithm has a better competitive ratio! 31 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) ¡Customer X §Buys Metallica CD §Buys Megadeth CD ¡Customer Y §Does search on Metallica §Recommender system suggests Megadeth from data collected about customer X classic alive Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 33 http://blog.hubspot.com/Portals/249/images/amazon_logo.gif http://4.bp.blogspot.com/_zmoEeqomXD4/SjftFPB6UTI/AAAAAAAACZE/gxQm5CcUp_k/s400/del.icio.us-logo.jpg http://upload.moldova.org/IT/logos/youtube_logo.gif Items MCBS01705_0000[1] Search Recommendations Products, web sites, blogs, news items, … 34 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Pandora Logo http://scrapetv.com/News/News%20Pages/Business/images-5/netflix-logo.jpg http://www.growyourwritingbusiness.com/images/stumbleupon_logo.bmp http://admintell.napco.com/ee/images/uploads/appletell/618px-Last.fm_logo_.svg_.png Examples: http://upload.wikimedia.org/wikipedia/commons/5/52/Movielens-helping.gif http://blog.ithenticate.com/wp-content/uploads/2010/11/google-news-logo.png http://beefjack.com/files/2010/04/xbox-live-arcade.thumbnail.jpg Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 35 In 1988, a British mountain climber named Joe Simpson wrote a book called Touching the Void, a harrowing account of near death in the Peruvian Andes. It got good reviews but, only a modest success, it was soon forgotten. Then, a decade later, a strange thing happened. Jon Krakauer wroteInto Thin Air, another book about a mountain-climbing tragedy, which became a publishing sensation. SuddenlyTouching the Void started to sell again. Random House rushed out a new edition to keep up with demand. Booksellers began to promote it next to their Into Thin Air displays, and sales rose further. A revised paperback edition, which came out in January, spent 14 weeks on theNew York Times bestseller list. That same month, IFC Films released a docudrama of the story to critical acclaim. NowTouching the Void outsells Into Thin Air more than two to one. What happened? In short, Amazon.com recommendations. The online bookseller's software noted patterns in buying behavior and suggested that readers who liked Into Thin Airwould also like Touching the Void. People took the suggestion, agreed wholeheartedly, wrote rhapsodic reviews. More sales, more algorithm-fueled recommendations, and the positive feedback loop kicked in. Anatomy edit Source: Chris Anderson (2004) Full-size image 36 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 37 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 38 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Avatar LOTR Matrix Pirates Alice Bob Carol David 39 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 40 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 41 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 42 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Latent factor based – predict recommendations that the user has not seen yet by building a system working well on known ratings and hope it predicts unknown ratings also well ¡Main idea: Recommend items to customer x similar to previous items rated highly by x ¡ ¡Example: ¡Movie recommendations §Recommend movies with same actor(s), director, genre, … ¡Websites, blogs, news §Recommend other sites with “similar” content § Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 44 MCBS01705_0000[1] likes Item profiles Red Circles Triangles User profile match recommend build 45 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Item profile = vector of item features (e.g., one dimension = 0/1 whether it is red color, or not) User profile = e.g., average over item profiles J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46 Harnessing quality judgments of other users 48 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Figure x N