Natural Language Processing
with Deep Learning
CS224N/Ling284
Christopher Manning
Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
Lecture Plan
2
Lecture 2: Word Vectors, Word Senses, and Neural Network Classifiers
1. Course organization (2 mins)
2. Finish looking at word vectors and word2vec (10 mins)
3. Optimization basics (8 mins)
4. Can we capture the essence of word meaning more effectively by counting? (8m)
5. The GloVe model of word vectors (8 min)
6. Evaluating word vectors (12 mins)
7. Word senses (6 mins)
8. Review of classification and how neural nets differ (8 mins)
9. Introducing neural networks (14 mins)
Key Goal: To be able to read word embeddings papers by the end of class
1. Course Organization
• Come to office hours/help sessions!
• Come to discuss final project ideas as well as the assignments
• Try to come early, often and off-cycle
• TA office hours: 3-hour block on Mon, Tue, Wed, Thu, & Sat, with multiple TAs
• Just show up on Nooks and go to the Welcome room!
• Our friendly course staff will be on hand to assist you
• Chris’s office hours:
• Mon 2–4pm. Book slot on Calendly. You can still come along this coming Monday!
• Week 10:
• We lose a week this year with no exam week, so we need to do final projects sooner
• No lectures in week 10; projects due Tue in week 10; extra project help sessions
3
2. Review: Main idea of word2vec
4
• Start with random word vectors
• Iterate through each word in the whole corpus
• Try to predict surrounding words using word vectors: 𝑃 𝑜 𝑐 =
!"#(%!
"&#)
∑$∈& !"#(%$
" &#)
• Learning: Update vectors so they can predict actual surrounding words better
• Doing no more than this, this algorithm learns word vectors that capture
well word similarity and meaningful directions in a wordspace!
…crisesbankingintoturningproblems… as
𝑃 𝑤!"# | 𝑤!
𝑃 𝑤!"$ | 𝑤!
𝑃 𝑤!%# | 𝑤!
𝑃 𝑤!%$ | 𝑤!
Word2vec parameters and computations
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
U V 𝑈. 𝑣,
- softmax(𝑈. 𝑣,
-)
outside center dot product probabilities
The model makes the same predictions at each position
5
We want a model that gives a reasonably high
probability estimate to all words that occur in the
context (at all often)
“Bag of words” model!
Word2vec maximizes objective function by
putting similar words nearby in space
6
3. Optimization: Gradient Descent
• To learn good word vectors: We have a cost function 𝐽 𝜃 we want to minimize
• Gradient Descent is an algorithm to minimize 𝐽 𝜃 by changing 𝜃
• Idea: from current value of 𝜃, calculate gradient of 𝐽 𝜃 , then take small step in the
direction of negative gradient. Repeat.
7
Note: Our
objectives
may not
be convex
like this L
But life turns
out to be
okay J
• Update equation (in matrix notation):
• Update equation (for a single parameter):
• Algorithm:
Gradient Descent
𝛼 = step size or learning rate
8
Stochastic Gradient Descent
• Problem: 𝐽 𝜃 is a function of all windows in the corpus (often, billions!)
• So is very expensive to compute
• You would wait a very long time before making a single update!
• Very bad idea for pretty much all neural nets!
• Solution: Stochastic gradient descent (SGD)
• Repeatedly sample windows, and update after each one, or each small batch
• Algorithm:
9
Stochastic gradients with word vectors! [Aside]
• Iteratively take gradients at each such window for SGD
• But in each window, we only have at most 2m + 1 words,
so is very sparse!
10
Stochastic gradients with word vectors!
• We might only update the word vectors that actually appear!
• Solution: either you need sparse matrix update operations to
only update certain rows of full embedding matrices U and V,
or you need to keep around a hash for word vectors
• If you have millions of word vectors and do distributed
computing, it is important to not have to send gigantic
updates around!
[ ]|V|
d
11
Rows not columns
in actual DL
packages!
2b. Word2vec algorithm family: More details
Why two vectors? à Easier optimization. Average both at the end
• But can implement the algorithm with just one vector per word … and it helps
Two model variants:
1. Skip-grams (SG)
Predict context (“outside”) words (position independent) given center word
2. Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words
We presented: Skip-gram model
Additional efficiency in training:
1. Negative sampling
So far: Focus on naïve softmax (simpler, but expensive, training method)
12
The skip-gram model with negative sampling (HW2)
• The normalization term is computationally expensive
• 𝑃 𝑜 𝑐 =
!"#(%!
"&#)
∑$∈& !"#(%$
" &#)
• Hence, in standard word2vec and HW2 you implement the skip-gram model with
negative sampling
• Main idea: train binary logistic regressions for a true pair (center word and a word in its
context window) versus several noise pairs (the center word paired with a random
word)
13
The skip-gram model with negative sampling (HW2)
14
• From paper: “Distributed Representations of Words and Phrases and their
Compositionality” (Mikolov et al. 2013)
• Overall objective function (they maximize):
• The logistic/sigmoid function:
(we’ll become good friends soon)
• We maximize the probability
of two words co-occurring in first log
and minimize probability of noise words
The skip-gram model with negative sampling (HW2)
• Notation more similar to class and HW2:
𝐽./0123456/ 𝒖7, 𝒗8, 𝑈 = − log 𝜎 𝒖7
- 𝒗8 − ,
9∈{< 23456/= >.=>8/2}
log 𝜎(−𝒖9
-
𝒗8)
• We take k negative samples (using word probabilities)
• Maximize probability that real outside word appears,
minimize probability that random words appear around center word
• Sample with P(w)=U(w)3/4/Z, the unigram distribution U(w) raised to the 3/4 power
(We provide this function in the starter code).
• The power makes less frequent words be sampled more often
15
4. Why not capture co-occurrence counts directly?
Building a co-occurrence matrix X
• 2 options: windows vs. full document
• Window: Similar to word2vec, use window around
each word à captures some syntactic and semantic
information
• Word-document co-occurrence matrix will give
general topics (all sports terms will have similar
entries) leading to “Latent Semantic Analysis”
16
Example: Window based co-occurrence matrix
17
• Window length 1 (more common: 5–10)
• Symmetric (irrelevant whether left or right context)
• Example corpus:
• I like deep learning
• I like NLP
• I enjoy flying
counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
. 0 0 0 0 1 1 1 0
Co-occurrence vectors
18
• Simple count co-occurrence vectors
• Vectors increase in size with vocabulary
• Very high dimensional: require a lot of storage (though sparse)
• Subsequent classification models have sparsity issues à Models are less robust
• Low-dimensional vectors
• Idea: store “most” of the important information in a fixed, small number of
dimensions: a dense vector
• Usually 25–1000 dimensions, similar to word2vec
• How to reduce the dimensionality?
Classic Method: Dimensionality Reduction on X (HW1)
Singular Value Decomposition of co-occurrence matrix X
Factorizes X into UΣVT, where U and V are orthonormal
Retain only k singular values, in order to generalize.
"𝑋 is the best rank k approximation to X , in terms of least squares.
Classic linear algebra result. Expensive to compute for large matrices.
19
k
X
Hacks to X (several used in Rohde et al. 2005 in COALS)
20
• Running an SVD on raw counts doesn’t work well
• Scaling the counts in the cells can help a lot
• Problem: function words (the, he, has) are too frequent à syntax has too much
impact. Some fixes:
• log the frequencies
• min(X,t), with t ≈ 100
• Ignore the function words
• Ramped windows that count closer words more than further away words
• Use Pearson correlations instead of counts, then set negative values to 0
• Etc.
Interesting semantic patterns emerge in the scaled vectors
Rohde, Gonnerman, Plaut Modeling Word Meaning Using Lexical Co-Occurrence
DRIVE
LEARN
DOCTOR
CLEAN
DRIVER
STUDENT
TEACH
TEACHER
TREAT PRAY
PRIEST
MARRY
SWIM
BRIDE
JANITOR
SWIMMER
Figure 13: Multidimensional scaling for nouns and their associated verbs.
Table 10
The 10 nearest neighbors and their percent correlation similarities for a set of nouns, under the COALS-14K model.
gun point mind monopoly cardboard lipstick leningrad feet
21
COALS model from
Rohde et al. ms., 2005. An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
5. Towards GloVe: Count based vs. direct prediction
• LSA, HAL (Lund & Burgess),
• COALS, Hellinger-PCA (Rohde
et al, Lebret & Collobert)
• Fast training
• Efficient usage of statistics
• Primarily used to capture word
similarity
• Disproportionate importance
given to large counts
• Skip-gram/CBOW (Mikolov et al)
• NNLM, HLBL, RNN (Bengio et
al; Collobert & Weston; Huang et al; Mnih
& Hinton)
• Scales with corpus size
• Inefficient usage of statistics
• Can capture complex patterns
beyond word similarity
• Generate improved performance
on other tasks
22
Ratios of co-occurrence probabilities can encode
meaning components
Crucial insight:
x = solid x = water
large
x = gas
small
x = random
smalllarge
small large large small
~1 ~1large small
Encoding meaning components in vector differences
[Pennington, Socher, and Manning, EMNLP 2014]
Ratios of co-occurrence probabilities can encode
meaning components
Crucial insight:
x = solid x = water
1.9 x 10-4
x = gas x = fashion
2.2 x 10-5
1.36 0.96
Encoding meaning in vector differences
[Pennington, Socher, and Manning, EMNLP 2014]
8.9
7.8 x 10-4 2.2 x 10-3
3.0 x 10-3 1.7 x 10-5
1.8 x 10-5
6.6 x 10-5
8.5 x 10-2
A: Log-bilinear model:
with vector differences
Encoding meaning in vector differences
Q: How can we capture ratios of co-occurrence probabilities as
linear meaning components in a word vector space?
Combining the best of both worlds
GloVe [Pennington, Socher, and Manning, EMNLP 2014]
• Fast training
• Scalable to huge corpora
• Good performance even with
small corpus and small vectors
GloVe results
1. frogs
2. toad
3. litoria
4. leptodactylidae
5. rana
6. lizard
7. eleutherodactylus
litoria leptodactylidae
rana eleutherodactylus
Nearest words to
frog:
27
6. How to evaluate word vectors?
• Related to general evaluation in NLP: Intrinsic vs. extrinsic
• Intrinsic:
• Evaluation on a specific/intermediate subtask
• Fast to compute
• Helps to understand that system
• Not clear if really helpful unless correlation to real task is established
• Extrinsic:
• Evaluation on a real task
• Can take a long time to compute accuracy
• Unclear if the subsystem is the problem or its interaction or other subsystems
• If replacing exactly one subsystem with another improves accuracy à Winning!
28
Intrinsic word vector evaluation
• Word Vector Analogies
• Evaluate word vectors by how well
their cosine distance after addition
captures intuitive semantic and
syntactic analogy questions
• Discarding the input words from the
search!
• Problem: What if the information is
there but not linear?
man:woman :: king:?
a:b :: c:?
king
man
woman
29
Glove Visualizations
30
Glove Visualizations: Company - CEO
31
Glove Visualizations: Comparatives and Superlatives
32
Analogy evaluation and hyperparameters
Glove word vectors evaluation
33
um in terms of
Hn,m. The upmaximum
frethe
number of
This number is
f r in Eqn. (17)
. Therefore we
(19)
ted to |C| when
we are free to
2013); skip-gram (SG) and CBOW results are
from (Mikolov et al., 2013a,b); we trained SG†
and CBOW† using the word2vec tool3. See text
for details and a description of the SVD models.
Model Dim. Size Sem. Syn. Tot.
ivLBL 100 1.5B 55.9 50.1 53.2
HPCA 100 1.6B 4.2 16.4 10.8
GloVe 100 1.6B 67.5 54.3 60.3
SG 300 1B 61 61 61
CBOW 300 1.6B 16.1 52.6 36.1
vLBL 300 1.5B 54.2 64.8 60.0
ivLBL 300 1.5B 65.2 63.0 64.0
GloVe 300 1.6B 80.8 61.5 70.3
SVD 300 6B 6.3 8.1 7.3
(19)
ted to |C| when
we are free to
uation for large
pansion of genol,
1976),
s > 0, s , 1 ,
(20)
SG 300 1B 61 61 61
CBOW 300 1.6B 16.1 52.6 36.1
vLBL 300 1.5B 54.2 64.8 60.0
ivLBL 300 1.5B 65.2 63.0 64.0
GloVe 300 1.6B 80.8 61.5 70.3
SVD 300 6B 6.3 8.1 7.3
SVD-S 300 6B 36.7 46.6 42.1
SVD-L 300 6B 56.6 63.0 60.1
CBOW† 300 6B 63.6 67.4 65.7
SG† 300 6B 73.0 66.0 69.1
GloVe 300 6B 77.4 67.0 71.7
CBOW 1000 6B 57.3 68.9 63.7
SG 1000 6B 66.1 65.1 65.6
SVD-L 300 42B 38.4 58.2 49.2
Analogy evaluation and hyperparameters
• More data helps
• Wikipedia is better than news text!
• Dimensionality
• Good dimension is ~300
34
ctors.
. We
MN,
50
55
60
65
70
75
80
85
OverallSyntacticSemantic
Wiki2010
1B tokens
Accuracy[%]
Wiki2014
1.6B tokens
Gigaword5
4.3B tokens
Gigaword5 +
Wiki2014
6B tokens
Common Crawl
42B tokens
Figure 3: Accuracy on the analogy task for 300-
0 100 200 300 400 500 600
20
30
40
50
60
70
80
Vector Dimension
Accuracy[%]
Semantic
Syntactic
Overall
(a) Symmetric context
2
40
50
55
60
65
70
45
Accuracy[%]
Another intrinsic word vector evaluation
• Word vector distances and their correlation with human judgments
• Example dataset: WordSim353 http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/
35
Word 1 Word 2 Human (mean)
tiger cat 7.35
tiger tiger 10
book paper 7.46
computer internet 7.58
plane car 5.77
professor doctor 6.62
stock phone 1.62
stock CD 1.31
stock jaguar 0.92
Correlation evaluation
• Word vector distances and their correlation with human judgments
• Some ideas from Glove paper have been shown to improve skip-gram (SG) model also
(e.g., average both vectors)
rmance, with the
nalogy task.
d results of a vaas
well as with
the word2vec
ing SVDs. With
gram (SG†) and
†) models on the
a 2014 + Gigaop
400,000 most
ndow size of 10.
hich we show in
r this corpus.
erate a truncated
ormation of how
th only the top
his step is typiTable
3: Spearman rank correlation on word similarity
tasks. All vectors are 300-dimensional. The
CBOW⇤ vectors are from the word2vec website
and differ in that they contain phrase vectors.
Model Size WS353 MC RG SCWS RW
SVD 6B 35.3 35.1 42.5 38.3 25.6
SVD-S 6B 56.5 71.5 71.0 53.6 34.7
SVD-L 6B 65.7 72.7 75.1 56.5 37.0
CBOW† 6B 57.2 65.6 68.2 57.0 32.5
SG† 6B 62.8 65.2 69.7 58.1 37.2
GloVe 6B 65.8 72.7 77.8 53.9 38.1
SVD-L 42B 74.0 76.4 74.1 58.3 39.9
GloVe 42B 75.9 83.6 82.9 59.6 47.8
CBOW⇤ 100B 68.4 79.6 75.4 59.4 45.5
L model on this larger corpus. The fact that this
basic SVD model does not scale well to large corpora
lends further evidence to the necessity of the36
Extrinsic word vector evaluation
• Extrinsic evaluation of word vectors: All subsequent NLP tasks in this class. More examples soon.
• One example where good word vectors should help directly: named entity recognition: identifying
references to a person, organization or location
Table 4: F1 score on NER task with 50d vectors.
Discrete is the baseline without word vectors. We
use publicly-available vectors for HPCA, HSMN,
and CW. See text for details.
Model Dev Test ACE MUC7
Discrete 91.0 85.4 77.4 73.4
SVD 90.8 85.7 77.3 73.7
SVD-S 91.0 85.5 77.6 74.3
SVD-L 90.5 84.8 73.6 71.5
HPCA 92.6 88.7 81.7 80.7
HSMN 90.5 85.7 78.7 74.7
CW 92.2 87.4 81.7 80.2
CBOW 93.1 88.2 82.2 81.1
GloVe 93.2 88.3 82.9 82.2
shown for neural vectors in (Turian et al., 2010).
50
55
60
65
70
75
80
85
SyntacSemantic
Wiki2010
1B tokens
Accuracy[%]
Wiki2014
1.6B tokens
Gigaword
4.3B token
Figure 3: Accuracy on the a
dimensional vectors trained o
entries are updated to assimi
whereas Gigaword is a ﬁxed
outdated and possibly incorre
37
7. Word senses and word sense ambiguity
• Most words have lots of meanings!
• Especially common words
• Especially words that have existed for a long time
• Example: pike
• Does one vector capture all these meanings or do we have a mess?
38
pike
• A sharp point or staff
• A type of elongated fish
• A railroad line or system
• A type of road
• The future (coming down the pike)
• A type of body position (as in diving)
• To kill or pierce with a pike
• To make one’s way (pike along)
• In Australian English, pike means to pull out from doing something: I reckon he could
have climbed that cliff, but he piked!
39
Improving Word Representations Via Global Context And
Multiple Word Prototypes (Huang et al. 2012)
• Idea: Cluster word windows around words, retrain with each word assigned to multiple
different clusters bank1, bank2, etc.
40
Linear Algebraic Structure of Word Senses, with
Applications to Polysemy (Arora, …, Ma, …, TACL 2018)
• Different senses of a word reside in a linear superposition (weighted
sum) in standard word embeddings like word2vec
• 𝑣pike = 𝛼! 𝑣pike!
+ 𝛼" 𝑣pike"
+ 𝛼# 𝑣pike#
• Where 𝛼! =
$!
$!%$"%$#
, etc., for frequency f
• Surprising result:
• Because of ideas from sparse coding you can actually separate out
the senses (providing they are relatively common)!
41
8. Classification review and notation
• Generally, we have a training dataset consisting of samples
{xi,yi}N
i=1
• xi are inputs, e.g., words (indices or vectors!), sentences, documents, etc.
• Dimension d
• yi are labels (one of C classes) we try to predict, for example:
• classes: sentiment (+/–), named entities, buy/sell decision
• other words
• later: multi-word sequences
42
Classification intuition
• Training data: {xi,yi}N
i=1
• Simple illustration case:
• Fixed 2D word vectors inputs to classify
• Using softmax/logistic regression
• Linear decision boundary
• Traditional ML/Stats approach: assume xi are fixed,
train (i.e., set) softmax/logistic regression weights 𝑊 ∈ ℝ!×#
to determine a decision boundary (hyperplane) as in the picture
• Method: For each fixed x, predict:
Visualizations with ConvNetJS by Andrej Karpathy!
http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
43
Softmax classifier
Again, we can tease apart the prediction function into three steps:
1. For each row y row of W, calculate dot product with x:
2. Apply softmax function to get normalized probability:
= softmax(𝑓$)
3. Choose the y with maximum probability
• For each training example (x,y), our objective is to maximize the probability of the
correct class y or we can minimize the negative log probability of that class:
44
Training with “cross entropy loss”
• Concept of “cross entropy” is from information theory
• Let the true probability distribution be p
• Let our computed model probability be q
• The cross entropy is:
• Assuming a ground truth (or true or gold or target) probability distribution that is 1 at
the right class and 0 everywhere else:
p = [0,…,0,1,0,…0] then:
• Because of one-hot p, the only term left is the negative log probability of the true
class: − log 𝑝(𝑦%|𝑥%)
45
Classification over a full dataset
• Cross entropy loss function over
full dataset {xi,yi}N
i=1
• Instead of
We will write f in matrix notation:
46
Traditional ML optimization
• For statistical machine learning 𝜃 usually
only consists of the elements of W:
• So, we update the decision
boundary via only updating W Visualizations with ConvNetJS by Karpathy
47
9. Neural Network Classifiers
• Softmax (≈ logistic regression) alone is not very powerful
• Softmax classifier only gives linear decision boundaries
This can be quite limiting
à Unhelpful when a
problem is complex
Wouldn’t it be cool to
get these correct?
48
Neural Nets for the Win!
• Neural networks can learn much more complex
functions with nonlinear decision boundaries!
• Non-linear in the original space
49
Classification difference with word vectors #1
• Commonly in NLP deep learning:
• We learn both W and word vectors x
• We learn both conventional parameters and (distributed!) representations
• The word vectors re-represent one-hot vectors—they move them around in an
intermediate layer vector space—for easy classification with a (linear) softmax
classifier, conceptually via an embedding layer: x = Le
Very large number of
parameters!
50
Neural computation
51
A neuron can be modeled as a binary logistic regression unit
hw,b (x) = f (wT
x + b)
f (z) =
1
1+e−z
w, b are the parameters of this neuron
i.e., this logistic regression model
b: We can have an “always on” bias
feature, which gives a class prior, or
separate it out, as a bias term
52
f = nonlinear activation function (e.g. sigmoid), w = weights, b = bias, h = hidden, x = inputs
Difference #2: A neural network
= running several logistic regressions at the same time
If we feed a vector of inputs through a bunch of logistic regression functions, then we get
a vector of outputs …
But we don’t have to decide
ahead of time what variables
these logistic regressions are
trying to predict!
53
Difference #2: A neural network
= running several logistic regressions at the same time
… which we can feed into another logistic regression function
It is the loss function
that will direct what
the intermediate
hidden variables should
be, so as to do a good
job at predicting the
targets for the next
layer, etc.
54
Difference #2: A neural network
= running several logistic regressions at the same time
Before we know it, we have a multilayer neural network….
55
Matrix notation for a layer
We have
In matrix notation
Activation f is applied element-wise:
a1
a2
a3
a1 = f (W11x1 +W12 x2 +W13x3 + b1)
a2 = f (W21x1 +W22 x2 +W23x3 + b2 )
etc.
z = Wx + b
a = f (z)
f ([z1, z2, z3]) =[ f (z1), f (z2 ), f (z3)]
W12
b3
56
Non-linearities (aka “f ”): Why they’re needed
• Example: function approximation,
e.g., regression or classification
• Without non-linearities, deep neural
networks can’t do anything more than a
linear transform
• Extra layers could just be compiled down
into a single linear transform: W1 W2 x = Wx
• With more layers, they can approximate
more complex functions!
57