Recommender Systems @ Seznam.cz Ondřej Javornický Product manager Matěj Jakimov Researcher/ML Engineer What? ● Who? ● Product side ● Engineering and research side ● Questions (after each section) Why product and engineering side? ● RecSys ○ context is important ○ implementation and infrastructure is important ● bad infrastructure → expensive/impossible experiments and analysis ● good engineering→ good research ● engineering + research → ability to do things right ● good product management → ability to do right things Who? ● Ondřej Javornický ○ In Seznam.cz from 2011 ○ 2013 - 2016 responsible for advertising system and ad targeting ○ In 2014 he set up the Recommender Systems Department ● Matěj Jakimov ○ Completed this course at 2014 ○ Research and engineering on recommender systems at GaussAlgo → Seznam Product side About Seznam.cz The most visited web on Czech internet • 35 milions PV / day • 3,2 milions RU / day In November 2017, 6.2 million real users visited Seznam.cz. That same month, the entire Czech internet population was 7.22 RUs. Eminent Publisher Daily publish over • 300 own articles / videos • 400 external articles Pre-recommender era Based on real-time statistic of each article and manuali positioning articles on homepage. • teams of editors • real-time statistic • operation 24/7 • manually selecting each article • manually positioning each article editor's choice 1st recommender Our start with recommendation Performance • global popularity Personalization • hiding of the read content • field of interest - based on advertisements targeting editor's choice 1st recommender Our start with recommendation Performance • global popularity Personalization • hiding of the read content • field of interest - based on advertisements targeting editor's choice 1st recommender 20 % CTR editor's choice 1st recommender 20 % CTR Collaborative Filtering editor's choice 1st recommender 20 % CTR Collaborative Filtering 20 % CTR editor's choice 1st recommender 20 % CTR Collaborative Filtering 20 % CTR 20 % CTR editor's choice 1st recommender 20 % CTR Collaborative Filtering 20 % CTR UX 20 % CTR editor's choice 1st recommender 20 % CTR Collaborative Filtering 20 % CTR UX 20 % CTR Vowpal Wabbit editor's choice 1st recommender 20 % CTR Collaborative Filtering 20 % CTR UX 20 % CTR Vowpal Wabbit 8 % CTR Evaluation Evaluation Evaluation AB testing Evaluation AB testing ● no offline evaluation will tell you what one AB test can Evaluation AB testing ● no offline evaluation will tell you what one AB test can ● you have to know what exactly you want to test and how you can measure it Evaluation AB testing ● no offline evaluation will tell you what one AB test can ● you have to know what exactly you want to test and how you can measure it ● at Seznam we run about 10 AB tests on RecSys in same moment and 2-3 test of UI Evaluation User experience Evaluation User experience ● ask users ○ use forms and user testing Current goals ● cookie → registered user ○ for user identification ○ GDPR ● more interactive elements in UI ○ like / dislike, comment, following… ● new metrics ○ oriented on user satisfaction Time for questions Engineering and research Technical point of view ● Primary problem is not an algorithm, but: ○ System must respond ○ ~1000 req/sec/box ○ ~50ms avg latency (100ms max latency) ○ All components - tens of physical machines - High Availability ○ Data are noisy and big Our stack ● Python, Docker, Kubernetes, Git, CI/CD, Spark, Scala ○ Ability to quickly develop something is the most important ● MariaDB Galler, Couchbase, Kafka, Elasticsearch, Memcache, Swift Object Store ● Monitoring and alerting ○ Prometheus, Grafana, Influxdb ● → Lot of engineering :) 8 developers + 2 researchers + admins + many other teams ● Scrum … Constantly under construction Architecture ● very simplified, ignoring some important subsystems ● forked for each content service (e.g. Novinky, Stream) ● no precomputed recommendations Item-source RSS feed Feedserver Index Contentserver Web Kafka Event storage Model trainers Features storage Time for questions Architecture ● (almost) everything quickly reconfigurable (AB testable) ● Controlserver - key ingredient Item-source RSS feed Feedserver Index Contentserver Web Kafka Event storage Model trainers Features storage Data collection ● Clean data, get to know your data ○ unittests, integration tests ○ post deployment analysis - do not trust (your) code ○ beware of JavaScripts ○ beware of bots (good, harmless, evil) ○ garbage in-garbage out, misleading measurements Deployed algorithms ● Matrix Factorization ○ Alternating Least Squares ■ faster than SGD ■ parallelize better ○ Each 10-50 minutes - one run of ALS from scratch -> Couchbase ○ Now - single machine computation ■ 10 min instead of 50 min ■ ~30-100GB RAM, 32 CPU ■ online, more flexible for experiments, need some additional infrastructure Deployed algorithms ● Not every user have vector ● New items do not have vector ● Cold start ○ Beta distribution - CTR ○ Thompson sampling ○ Explore-exploit dillema ○ Clustering on CF vectors (demo) ● Cold start - logistic regression ○ next slide ● Logistic regression - Vowpal Wabbit Deployed algorithms ● Logistic regression - Vowpal Wabbit ○ Reranking of top few hundreds of candidates from CF/CTR-clu ○ many signals/features (ML terminology) ○ user ID, item ID, content tags, vector of interests from targeting department (~hundreds of categories), estimation of age/sex, item IDs user clicked in past … + combinations of those ○ score(u, i) ~ user CTR + item CTR + user-tags CTR + item-age CTR + itemH-item CTR + … ○ not enough time to explain in detail ○ intro: Wide & Deep Learning for Recommender Systems ○ for those who know ML: log features in time of request - add labels later Deployed algorithms Time for questions Diversity / callibration ● Intra List Similarity ○ LDA vectors ○ similar articles have similar vector ○ titles + perex from RSS feed ○ content - from fulltext robot department ○ curse of dimensionality ● Most papers ○ for each position p: ■ arg max {relevancy(u, i) - λ*avg(similarity(i, j) for j on position < p)} ○ O(num_positions^2 * num_candidates) ○ infinite feed - no research Diversity / callibration ● Product view ○ feed is supposed to substitute missing boxes on HP (originally) ○ → some degree of low diversity is intended ○ → diversify just in cases if diversity becomes an issue → ILS treshold ○ simple formula: score(pos) = 1 / topics_seen(pos) / articles_topic_seen(topic, pos) ○ topics - LDA vectors clusters Offline evaluation ● always use time-based train-test split ● for regression problems (price prediction) ○ RMSE ● for logistic regression problems (CTR prediction) ○ use LogLoss or RMSE ● for ranking problems ○ use ranking metrics, nDCG, MRR ○ clicks should be high in recommendation list ○ use of RMSE for ranking problem is wrong ■ just multiply score by some constant ■ creators of MovieLens are saying it too Offline evaluation ● offline results - weak results ● infinite feed and ranking metrics ● strong position bias ● many other biases ● Inverse Propensity Score (multi-armed bandits) ○ promising solution ○ many assumptions, limited usability so far ● Offline evaluation of diversity ○ alpha-nDCG = nonsense in practice ○ nDCG-ILS curve? Online evaluation ● Long-term effects are hard to measure ○ Too good recommender example ○ Complex feedback loops - research needed (simulations) ● Isolation of variants ○ AB variants are affecting each other through training data ○ FB, Netflix, … - ignoring this issue ○ preparing infrastructure to make experiments possible (log AB variant for each interaction) ○ evaluation of explore strategies will be possible ■ some portion of randomness is beneficial for algorithms ■ how big portion? ○ simulations - promising area of research Time for questions Research aspects in industry ● “To make great products: do machine learning like the great engineer you are, not like the great machine learning expert you aren’t.” Rules of ML, Google Research aspects in industry ● Performance vs. simplicity vs. explainability ● Common scenario: ○ researcher suggest new algorithm ○ 3 days of work = 100 lines of code + 1000 lines of code for proof of concept (offline evaluation) ○ 6 months of work for 4 developers ○ many thousands of lines of production code, 10 components that requires care (monitoring, maintenance) ○ black box for others ● Every nonstationary hyperparameter is problematic ○ never ending story RecSys research ● RecSys research is not easily portable ○ different use case - different results for same solution ○ many use cases: ■ Spotify/Netflix - millions of songs, week ■ Seznam.cz - 2000 articles, 50ms ■ Booking.com - lot of domain constraints RecSys research ● Almost no academic work is useful in practice: ○ RMSE do not count, still lot of research based on it ○ item-based CF - Amazon ○ Matrix factorization - Netflix prize (Simon Funk’s blog!) ○ ALS - people from Yahoo ○ Factorization Machines - Kaggle competitions winners ■ Criteo ■ Avazu ■ Outbrain ○ Neural nets, Deep and Wide - Google, Gravity, YouTube ○ Facebook ... - logistic regression ○ Where is academia? … practical field indeed Last questions Thanks