Recommender Systems
@
Seznam.cz
Ondřej Javornický
Product manager
Matěj Jakimov
Researcher/ML Engineer
What?
● Who?
● Product side
● Engineering and research side
● Questions (after each section)
Why product and engineering side?
● RecSys
○ context is important
○ implementation and infrastructure is important
● bad infrastructure → expensive/impossible experiments
and analysis
● good engineering→ good research
● engineering + research → ability to do things right
● good product management → ability to do right things
Who?
● Ondřej Javornický
○ In Seznam.cz from 2011
○ 2013 - 2016 responsible for advertising system and ad targeting
○ In 2014 he set up the Recommender Systems Department
● Matěj Jakimov
○ Completed this course at 2014
○ Research and engineering on recommender systems at
GaussAlgo → Seznam
Product side
About Seznam.cz
The most visited web on Czech internet
• 35 milions PV / day
• 3,2 milions RU / day
In November 2017, 6.2 million real users visited
Seznam.cz. That same month, the entire Czech
internet population was 7.22 RUs.
Eminent Publisher
Daily publish over
• 300 own articles / videos
• 400 external articles
Pre-recommender era
Based on real-time statistic of each article and
manuali positioning articles on homepage.
• teams of editors
• real-time statistic
• operation 24/7
• manually selecting each article
• manually positioning each article
editor's choice 1st
recommender
Our start with recommendation
Performance
• global popularity
Personalization
• hiding of the read content
• field of interest - based on advertisements targeting
editor's choice 1st
recommender
Our start with recommendation
Performance
• global popularity
Personalization
• hiding of the read content
• field of interest - based on advertisements targeting
editor's choice 1st
recommender
20 % CTR
editor's choice 1st
recommender
20 % CTR
Collaborative
Filtering
editor's choice 1st
recommender
20 % CTR
Collaborative
Filtering
20 % CTR
editor's choice 1st
recommender
20 % CTR
Collaborative
Filtering
20 % CTR
20 % CTR
editor's choice 1st
recommender
20 % CTR
Collaborative
Filtering
20 % CTR
UX
20 % CTR
editor's choice 1st
recommender
20 % CTR
Collaborative
Filtering
20 % CTR
UX
20 % CTR
Vowpal Wabbit
editor's choice 1st
recommender
20 % CTR
Collaborative
Filtering
20 % CTR
UX
20 % CTR
Vowpal Wabbit
8 % CTR
Evaluation
Evaluation
Evaluation
AB testing
Evaluation
AB testing
● no offline evaluation will tell you what one AB test can
Evaluation
AB testing
● no offline evaluation will tell you what one AB test can
● you have to know what exactly you want to test and
how you can measure it
Evaluation
AB testing
● no offline evaluation will tell you what one AB test can
● you have to know what exactly you want to test and
how you can measure it
● at Seznam we run about 10 AB tests on RecSys in
same moment and 2-3 test of UI
Evaluation
User experience
Evaluation
User experience
● ask users
○ use forms and user testing
Current goals
● cookie → registered user
○ for user identification
○ GDPR
● more interactive elements in UI
○ like / dislike, comment, following…
● new metrics
○ oriented on user satisfaction
Time for questions
Engineering and research
Technical point of view
● Primary problem is not an algorithm, but:
○ System must respond
○ ~1000 req/sec/box
○ ~50ms avg latency (100ms max latency)
○ All components - tens of physical machines - High Availability
○ Data are noisy and big
Our stack
● Python, Docker, Kubernetes, Git, CI/CD, Spark, Scala
○ Ability to quickly develop something is the most important
● MariaDB Galler, Couchbase, Kafka, Elasticsearch,
Memcache, Swift Object Store
● Monitoring and alerting
○ Prometheus, Grafana, Influxdb
● → Lot of engineering :) 8 developers + 2 researchers +
admins + many other teams
● Scrum … Constantly under construction
Architecture
● very simplified, ignoring some important subsystems
● forked for each content service (e.g. Novinky, Stream)
● no precomputed recommendations
Item-source
RSS feed
Feedserver Index Contentserver Web
Kafka
Event storage
Model
trainers
Features
storage
Time for questions
Architecture
● (almost) everything quickly reconfigurable (AB testable)
● Controlserver - key ingredient
Item-source
RSS feed
Feedserver Index Contentserver Web
Kafka
Event storage
Model
trainers
Features
storage
Data collection
● Clean data, get to know your data
○ unittests, integration tests
○ post deployment analysis - do not trust (your) code
○ beware of JavaScripts
○ beware of bots (good, harmless, evil)
○ garbage in-garbage out, misleading measurements
Deployed algorithms
● Matrix Factorization
○ Alternating Least Squares
■ faster than SGD
■ parallelize better
○ Each 10-50 minutes - one run of ALS from scratch -> Couchbase
○ Now - single machine computation
■ 10 min instead of 50 min
■ ~30-100GB RAM, 32 CPU
■ online, more flexible for experiments, need some additional infrastructure
Deployed algorithms
● Not every user have vector
● New items do not have vector
● Cold start
○ Beta distribution - CTR
○ Thompson sampling
○ Explore-exploit dillema
○ Clustering on CF vectors (demo)
● Cold start - logistic regression
○ next slide
● Logistic regression - Vowpal Wabbit
Deployed algorithms
● Logistic regression - Vowpal Wabbit
○ Reranking of top few hundreds of candidates from CF/CTR-clu
○ many signals/features (ML terminology)
○ user ID, item ID, content tags, vector of interests from targeting
department (~hundreds of categories), estimation of age/sex, item
IDs user clicked in past … + combinations of those
○ score(u, i) ~ user CTR + item CTR + user-tags CTR + item-age
CTR + itemH-item CTR + …
○ not enough time to explain in detail
○ intro: Wide & Deep Learning for Recommender Systems
○ for those who know ML:
log features in time of request - add labels later
Deployed algorithms
Time for questions
Diversity / callibration
● Intra List Similarity
○ LDA vectors
○ similar articles have similar vector
○ titles + perex from RSS feed
○ content - from fulltext robot department
○ curse of dimensionality
● Most papers
○ for each position p:
■ arg max {relevancy(u, i) - λ*avg(similarity(i, j) for j on position < p)}
○ O(num_positions^2 * num_candidates)
○ infinite feed - no research
Diversity / callibration
● Product view
○ feed is supposed to substitute
missing boxes on HP (originally)
○ → some degree of low diversity
is intended
○ → diversify just in cases if diversity
becomes an issue → ILS treshold
○ simple formula:
score(pos) = 1 / topics_seen(pos) / articles_topic_seen(topic, pos)
○ topics - LDA vectors clusters
Offline evaluation
● always use time-based train-test split
● for regression problems (price prediction)
○ RMSE
● for logistic regression problems (CTR prediction)
○ use LogLoss or RMSE
● for ranking problems
○ use ranking metrics, nDCG, MRR
○ clicks should be high in recommendation list
○ use of RMSE for ranking problem is wrong
■ just multiply score by some constant
■ creators of MovieLens are saying it too
Offline evaluation
● offline results - weak results
● infinite feed and ranking metrics
● strong position bias
● many other biases
● Inverse Propensity Score (multi-armed bandits)
○ promising solution
○ many assumptions, limited usability so far
● Offline evaluation of diversity
○ alpha-nDCG = nonsense in practice
○ nDCG-ILS curve?
Online evaluation
● Long-term effects are hard to measure
○ Too good recommender example
○ Complex feedback loops - research needed (simulations)
● Isolation of variants
○ AB variants are affecting each other through training data
○ FB, Netflix, … - ignoring this issue
○ preparing infrastructure to make experiments possible
(log AB variant for each interaction)
○ evaluation of explore strategies will be possible
■ some portion of randomness is beneficial for algorithms
■ how big portion?
○ simulations - promising area of research
Time for questions
Research aspects in industry
● “To make great products: do machine learning like the
great engineer you are, not like the great machine
learning expert you aren’t.” Rules of ML, Google
Research aspects in industry
● Performance vs. simplicity vs. explainability
● Common scenario:
○ researcher suggest new algorithm
○ 3 days of work = 100 lines of code + 1000 lines of code for proof of
concept (offline evaluation)
○ 6 months of work for 4 developers
○ many thousands of lines of production code, 10 components that
requires care (monitoring, maintenance)
○ black box for others
● Every nonstationary hyperparameter is problematic
○ never ending story
RecSys research
● RecSys research is not easily portable
○ different use case - different results for same solution
○ many use cases:
■ Spotify/Netflix - millions of songs, week
■ Seznam.cz - 2000 articles, 50ms
■ Booking.com - lot of domain constraints
RecSys research
● Almost no academic work is useful in practice:
○ RMSE do not count, still lot of research based on it
○ item-based CF - Amazon
○ Matrix factorization - Netflix prize (Simon Funk’s blog!)
○ ALS - people from Yahoo
○ Factorization Machines - Kaggle competitions winners
■ Criteo
■ Avazu
■ Outbrain
○ Neural nets, Deep and Wide - Google, Gravity, YouTube
○ Facebook ... - logistic regression
○ Where is academia? … practical field indeed
Last questions
Thanks