word embeddings
what, how and whither
Yoav Goldberg
Bar Ilan University
understanding
word2vec
word2vec
Seems magical.
“Neural computation, just like in the brain!”
Seems magical.
“Neural computation, just like in the brain!”
How does this actually work?
How does word2vec work?
word2vec implements several different algorithms:
Two training methods
Negative Sampling
Hierarchical Softmax
Two context representations
Continuous Bag of Words (CBOW)
Skip-grams
How does word2vec work?
word2vec implements several different algorithms:
Two training methods
Negative Sampling
Hierarchical Softmax
Two context representations
Continuous Bag of Words (CBOW)
Skip-grams
We’ll focus on skip-grams with negative sampling.
intuitions apply for other models as well.
How does word2vec work?
Represent each word as a d dimensional vector.
Represent each context as a d dimensional vector.
Initalize all vectors to random weights.
Arrange vectors in two matrices, W and C.
How does word2vec work?
While more text:
Extract a word window:
A springer is [ a cow or heifer close to calving ] .
c1 c2 c3 w c4 c5 c6
w is the focus word vector (row in W).
ci are the context word vectors (rows in C).
How does word2vec work?
While more text:
Extract a word window:
A springer is [ a cow or heifer close to calving ] .
c1 c2 c3 w c4 c5 c6
Try setting the vector values such that:
σ(w· c1)+σ(w· c2)+σ(w· c3)+σ(w· c4)+σ(w· c5)+σ(w· c6)
is high
How does word2vec work?
While more text:
Extract a word window:
A springer is [ a cow or heifer close to calving ] .
c1 c2 c3 w c4 c5 c6
Try setting the vector values such that:
σ(w· c1)+σ(w· c2)+σ(w· c3)+σ(w· c4)+σ(w· c5)+σ(w· c6)
is high
Create a corrupt example by choosing a random word w
[ a cow or comet close to calving ]
c1 c2 c3 w c4 c5 c6
Try setting the vector values such that:
σ(w · c1)+σ(w · c2)+σ(w · c3)+σ(w · c4)+σ(w · c5)+σ(w · c6)
is low
How does word2vec work?
The training procedure results in:
w · c for good word-context pairs is high.
w · c for bad word-context pairs is low.
w · c for ok-ish word-context pairs is neither high nor low.
As a result:
Words that share many contexts get close to each other.
Contexts that share many words get close to each other.
At the end, word2vec throws away C and returns W.
Reinterpretation
Imagine we didn’t throw away C. Consider the product WC
Reinterpretation
Imagine we didn’t throw away C. Consider the product WC
The result is a matrix M in which:
Each row corresponds to a word.
Each column corresponds to a context.
Each cell correspond to w · c, an association measure
between a word and a context.
Reinterpretation
Does this remind you of something?
Reinterpretation
Does this remind you of something?
Very similar to SVD over distributional representation:
What is SGNS learning?
• A 𝑉 𝑊 × 𝑉𝐶 matrix
• Each cell describes the relation between a specific word-context pair
𝑤 ⋅ 𝑐 = ?
𝑊
𝑑
𝑉𝑊
𝐶
𝑉𝐶
𝑑
“Neural Word Embeddings as Implicit Matrix Factorization”
Levy & Goldberg, NIPS 2014
?=
𝑉𝑊
𝑉𝐶
What is SGNS learning?
• We prove that for large enough 𝑑 and enough iterations
𝑊
𝑑
𝑉𝑊
𝐶
𝑉𝐶
𝑑
“Neural Word Embeddings as Implicit Matrix Factorization”
Levy & Goldberg, NIPS 2014
?=
𝑉𝑊
𝑉𝐶
What is SGNS learning?
• We prove that for large enough 𝑑 and enough iterations
• We get the word-context PMI matrix
𝑊
𝑑
𝑉𝑊
𝐶
𝑉𝐶
𝑑
“Neural Word Embeddings as Implicit Matrix Factorization”
Levy & Goldberg, NIPS 2014
𝑀 𝑃𝑀𝐼=
𝑉𝑊
𝑉𝐶
What is SGNS learning?
• We prove that for large enough 𝑑 and enough iterations
• We get the word-context PMI matrix, shifted by a global constant
𝑂𝑝𝑡 𝑤 ⋅ 𝑐 = 𝑃𝑀𝐼 𝑤, 𝑐 − log 𝑘
𝑊
𝑑
𝑉𝑊
𝐶
𝑉𝐶
𝑑
“Neural Word Embeddings as Implicit Matrix Factorization”
Levy & Goldberg, NIPS 2014
𝑀 𝑃𝑀𝐼=
𝑉𝑊
𝑉𝐶
− log 𝑘
What is SGNS learning?
• SGNS is doing something very similar to the older approaches
• SGNS is factorizing the traditional word-context PMI matrix
• So does SVD!
• Do they capture the same similarity function?
SGNS vs SVD
Target Word SGNS SVD
dog dog
rabbit rabbit
cat cats pet
poodle monkey
pig pig
SGNS vs SVD
Target Word SGNS SVD
wines wines
grape grape
wine grapes grapes
winemaking varietal
tasting vintages
SGNS vs SVD
Target Word SGNS SVD
October October
December December
November April April
January June
July March
But word2vec is still better, isn’t it?
• Plenty of evidence that word2vec outperforms traditional methods
• In particular: “Don’t count, predict!” (Baroni et al., 2014)
• How does this fit with our story? The Big Impact of “Small” Hyperparameters
Hyperparameters
• word2vec is more than just an algorithm…
• Introduces many engineering tweaks and hyperpararameter settings
• May seem minor, but make a big difference in practice
• Their impact is often more significant than the embedding algorithm’s
• These modifications can be ported to distributional methods!
Levy, Goldberg, Dagan (In submission)
the magic of cbow
the magic of cbow
• Represent a sentence / paragraph / document as a
(weighted) average vectors of its words.
• Now we have a single, 100-dim representation of
the text.
• Similar texts have similar vectors!
• Isn't this magical? (no)
the math of cbow
the math of cbow the math of cbow
the math of cbow the magic of cbow
• It's all about (weighted) all-pairs similarity
• ... done in an efficient manner.
• That's it. no more, no less.
• I'm amazed by how few people realize this.
(the math is so simple... even I could do it)
this also explains
king-man+woman
this also explains
king-man+woman
and once we understand
we can improve
and once we understand
we can improve
and once we understand
we can improve
math > magic
can we improve analogies
even further?
which brings me to:
which brings me to:
• Yes. Please stop evaluating on word analogies.
• It is an artificial and useless task.
• Worse, it is just a proxy for (a very particular kind of) word
similarity.
• Unless you have a good use case, don't do it.
• Alternatively: show that it correlates well with a real and
useful task.
let's take a step back
• We don't really care about the vectors.
• We care about the similarity function they induce.
• (or, maybe we want to use them in an external task)
• We want similar words to have similar vectors.
• So evaluating on word-similarity tasks is great.
• But what does similar mean?
many faces of similarity
• dog -- cat
• dog -- poodle
• dog -- animal
• dog -- bark
• dog -- leash
many faces of similarity
• dog -- cat
• dog -- poodle
• dog -- animal
• dog -- bark
• dog -- leash
• dog -- chair
• dog -- dig
• dog -- god
• dog -- fog
• dog -- 6op
many faces of similarity
• dog -- cat
• dog -- poodle
• dog -- animal
• dog -- bark
• dog -- leash
• dog -- chair
• dog -- dig
• dog -- god
• dog -- fog
• dog -- 6op
same POS
edit distance
same letters
rhyme
shape
some forms of similarity look
more useful than they really are
• Almost every algorithm you come up with will be
good at capturing:
• countries
• cities
• months
• person names
some forms of similarity look
more useful than they really are
• Almost every algorithm you come up with will be
good at capturing:
• countries
• cities
• months
• person names
useful for tagging/parsing/NER
some forms of similarity look
more useful than they really are
• Almost every algorithm you come up with will be
good at capturing:
• countries
• cities
• months
• person names
but do we really want
"John went to China in June"
to be similar to
"Carl went to Italy in February"
??
useful for tagging/parsing/NER
there is no single
downstream task
• Different tasks require different kinds of similarity.
• Different vector-inducing algorithms produce
different similarity functions.
• No single representation for all tasks.
• If your vectors do great on task X, I don't care that
they suck on task Y.
"but my algorithm works great for all these
different word-similarity datasets!
doesn't it mean something?"
"but my algorithm works great for all these
different word-similarity datasets!
doesn't it mean something?"
• Sure it does.
• It means these datasets are not diverse enough.
• They should have been a single dataset.
• (alternatively: our evaluation metrics are not
discriminating enough.)
which brings us back to:
• This is really, really il-defined.
• What does it mean for legal contracts to be similar?
• What does it mean for newspaper articles to be similar?
• Think about this before running to design your next super-
LSTM-recursive-autoencoding-document-embedder.
• Start from the use case!!!!
so how to evaluate?
• Define the similarity / task you care about.
• Score on this particular similarity / task.
• Design your vectors to match this similarity
• ...and since the methods we use are distributional and
unsupervised...
• ...design has less to do with the fancy math
(= objective function, optimization procedure) and
more with what you feed it.
context matters
What’s in a Context?
• Importing ideas from embeddings improves distributional methods
• Can distributional ideas also improve embeddings?
• Idea: change SGNS’s default BoW contexts into dependency contexts
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Australian scientist discovers star with telescope
Example
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Australian scientist discovers star with telescope
Target Word
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Australian scientist discovers star with telescope
Bag of Words (BoW) Context
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Australian scientist discovers star with telescope
Bag of Words (BoW) Context
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Australian scientist discovers star with telescope
Bag of Words (BoW) Context
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Australian scientist discovers star with telescope
Syntactic Dependency Context
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Australian scientist discovers star with telescope
Syntactic Dependency Context
prep_withnsubj
dobj
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Australian scientist discovers star with telescope
Syntactic Dependency Context
prep_withnsubj
dobj
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Embedding Similarity with Different Contexts
Target Word Bag of Words (k=5) Dependencies
Dumbledore Sunnydale
hallows Collinwood
Hogwarts half-blood Calarts
(Harry Potter’s school) Malfoy Greendale
Snape Millfield
Related to
Harry Potter
Schools
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Embedding Similarity with Different Contexts
Target Word Bag of Words (k=5) Dependencies
nondeterministic Pauling
non-deterministic Hotelling
Turing computability Heting
(computer scientist) deterministic Lessing
finite-state Hamming
Related to
computability
Scientists
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Embedding Similarity with Different Contexts
Target Word Bag of Words (k=5) Dependencies
singing singing
dance rapping
dancing dances breakdancing
(dance gerund) dancers miming
tap-dancing busking
Related to
dance
Gerunds
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
What is the effect of different context types?
• Thoroughly studied in distributional methods
• Lin (1998), Padó and Lapata (2007), and many others…
General Conclusion:
• Bag-of-words contexts induce topical similarities
• Dependency contexts induce functional similarities
• Share the same semantic type
• Cohyponyms
• Holds for embeddings as well
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
• Same algorithm, different inputs -- very different
kinds of similarity.
• Inputs matter much more than algorithm.
• Think about your inputs.
what's left to do?
• Pretty much nothing, and pretty much everything.
• Word embeddings are just a small step on top of
distributional lexical semantics.
• All of the previous open questions remain open,
including:
• composition.
• multiple senses.
• multi-word units.
looking beyond words
• word2vec will easily identify that "hotﬁx" if similar to
"hf", "hot-ﬁx" and "patch"
• But what about "hot ﬁx"?
• How do we know that "New York" is a single entity?
• Sure we can use a collocation-extraction method,
but is it really the best we can do? can't it be
integrated in the model?
what happens when we look
outside of English?
• Things don't work nearly as well.
• Known problems from English become more extreme.
• We get some new problems as well.
a quick look at Hebrew
word senses
‫ספר‬
book(N). barber(N). counted(V). tell!(V). told(V).
‫חומה‬
brown (feminine, singular)
wall (noun)
her fever (possessed noun)
multi-word units
•‫דין‬ ‫עורך‬
•‫ספר‬ ‫בית‬
•‫ראש‬ ‫שומר‬
•‫ראש‬ ‫יושב‬
•‫עיר‬ ‫ראש‬
•‫שימוש‬ ‫בית‬
words vs. tokens
and when from the house
‫וכשמהבית‬
words vs. tokens
and when from the house
‫וכשמהבית‬
‫בצל‬
‫בצל‬
in shadow
onion
and of course: inflections
• nouns, pronouns and adjectives
--> are inflected for number and gender
• verbs
--> are inflected for number, gender, tense, person
• syntax requires agreement between
- nouns and adjectives
- verbs and subjects
and of course: inflections
she saw a brown fox
he saw a brown fence
and of course: inflections
she saw a brown fox
he saw a brown fence
[masc]
[masc]
[fem]
[fem]
and of course: inflections
‫היא‬ ‫ראתה‬ ‫שועל‬ ‫חום‬
‫הוא‬ ‫ראה‬ ‫גדר‬ ‫חומה‬
she saw a brown fox
he saw a brown fence
[masc]
[masc]
[fem]
[fem]
inflections and dist-sim
• More word forms -- more sparsity
• But more importantly: agreement patterns affect the
resulting similarities.
adjectives
green [m,sg]
‫ירוק‬
green [f,sg]
‫ירוקה‬
green [m,pl]
‫ירוקים‬
blue [m,sg] gray [f,sg] gray [m,pl]
orange [m,sg] orange [f,sg] blue [m,pl]
yellow [m,sg] yellow [f,sg] black [m,pl]
red [m,sg] magical [f,g] heavenly [m,pl]
verbs
(he) walked
‫הלך‬
(she) thought
‫חשבה‬
(they) ate
‫אכלו‬
(they) walked (she) is thinking (they) will eat
(he) is walking (she) felt (they) are eating
(he) turned (she) is convinved (he) ate
(he) came closer (she) insisted (they) drank
nouns
Doctor [m,sg]
‫רופא‬
Doctor [f, sg]
‫רופאה‬
psychiatrist [m,sg] student [f, sg]
psychologist [m, sg] nun [f, sg]
neurologist [m, sg] waitress [f, sg]
engineer [m, sg] photographer [f, sg]
nouns
sweater
‫סוודר‬
shirt
‫חולצה‬
jacket suit
down robe
overall dress
turban helmet
nouns
sweater
‫סוודר‬
shirt
‫חולצה‬
jacket suit
down robe
overall dress
turban helmet
masculine feminine
nouns
sweater
‫סוודר‬
shirt
‫חולצה‬
jacket suit
down robe
overall dress
turban helmet
masculine feminine
completely arbitrary
inflections and dist-sim
• Inflections and agreement really influence the results.
• We get a mix of syntax and semantics.
• Which aspect of the similarity we care about? what
does it mean to be similar?
• Need better control of the different aspects.
inflections and dist-sim
• Work with lemmas instead of words!!
• Sure, but where do you get the lemmas?
• ...for unknown words?
• And what should you lemmatize? everything?
somethings? context-dependent?
• Ongoing work in my lab -- but still much to do.
to summarize
• Magic is bad. Understanding is good. Once you
Understand you can control and improve.
• Word embeddings are just distributional semantics in
disguise.
• Need to think of what you actually want to solve.
--> focus on a specific task!
• Inputs >> fancy math.
• Look beyond just words.
• Look beyond just English.