Natural Language Processing
with Deep Learning
CS224N/Ling284
Christopher Manning
Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
2. Review: Main idea of word2vec
5
• Start with random word vectors
• Iterate through each word position in the whole corpus
• Try to predict surrounding words using word vectors: 𝑃 𝑜 𝑐 =
!"#(%!
"&#)
∑$∈& !"#(%$
" &#)
• Learning: Update vectors so they can predict actual surrounding words better
• Doing no more than this, this algorithm learns word vectors that capture
well word similarity and meaningful directions in a word space!
…crisesbankingintoturningproblems… as
𝑃 𝑤!"# | 𝑤!
𝑃 𝑤!"$ | 𝑤!
𝑃 𝑤!%# | 𝑤!
𝑃 𝑤!%$ | 𝑤!
Word2vec parameters … and computations
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
U V 𝑈. 𝑣)
* softmax(𝑈. 𝑣)
*)
outside center dot product probabilities
The model makes the same predictions at each position
6
We want a model that gives a reasonably high
probability estimate to all words that occur in the
context (at all often)
“Bag of words” model!
Word2vec maximizes objective function by
putting similar words nearby in space
7
The skip-gram model with negative sampling (HW2)
• The normalization term is computationally expensive (when many output classes):
• 𝑃 𝑜 𝑐 =
!"#(%!
"&#)
∑$∈& !"#(%$
" &#)
• Hence, in standard word2vec and HW2 you implement the skip-gram model with
negative sampling
• Main idea: train binary logistic regressions to differentiate a true pair (center word and
a word in its context window) versus several “noise” pairs (the center word paired with
a random word)
8
A big sum over words
Word2vec algorithm family (Mikolov et al. 2013): More details
Why two vectors? à Easier optimization. Average both at the end
• But can implement the algorithm with just one vector per word … and it helps a bit
Two model variants:
1. Skip-grams (SG)
Predict context (“outside”) words (position independent) given center word
2. Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words
We presented: Skip-gram model
Loss functions for training:
1. Naïve softmax (simple but expensive loss function, when many output classes)
2. More optimized variants like hierarchical softmax
3. Negative sampling
So far, we explained naïve softmax
9
The skip-gram model with negative sampling (HW2)
10
• Introduced in: “Distributed Representations of Words and Phrases and their
Compositionality” (Mikolov et al. 2013)
• Overall objective function (they maximize):
• The logistic/sigmoid function:
(we’ll become good friends soon)
• We maximize the probability of two words
co-occurring in first log and minimize probability
of noise words in second part
Stochastic gradients with negative sampling [aside]
• We iteratively take gradients at each window for SGD
• In each window, we only have at most 2m + 1 words plus 2km negative
words with negative sampling, so ∇C 𝐽D(𝜃) is very sparse!
12
Stochastic gradients with with negative sampling [aside]
• We might only update the word vectors that actually appear!
• Solution: either you need sparse matrix update operations to
only update certain rows of full embedding matrices U and V,
or you need to keep around a hash for word vectors
• If you have millions of word vectors and do distributed
computing, it is important to not have to send gigantic
updates around!
[ ]|V|
d
13
Rows not columns
in actual DL
packages!
This is also a
particular issue with
more advanced
optimization
methods in the
Adagrad family
Interesting semantic patterns emerge in the scaled vectors
Rohde, Gonnerman, Plaut Modeling Word Meaning Using Lexical Co-Occurrence
DRIVE
LEARN
DOCTOR
CLEAN
DRIVER
STUDENT
TEACH
TEACHER
TREAT PRAY
PRIEST
MARRY
SWIM
BRIDE
JANITOR
SWIMMER
Figure 13: Multidimensional scaling for nouns and their associated verbs.
Table 10
The 10 nearest neighbors and their percent correlation similarities for a set of nouns, under the COALS-14K model.
gun point mind monopoly cardboard lipstick leningrad feet
19
COALS model from
Rohde et al. ms., 2005. An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
4. How to evaluate word vectors?
• Related to general evaluation in NLP: Intrinsic vs. extrinsic
• Intrinsic:
• Evaluation on a specific/intermediate subtask
• Fast to compute
• Helps to understand that system
• Not clear if really helpful unless correlation to real task is established
• Extrinsic:
• Evaluation on a real task
• Can take a long time to compute accuracy
• Unclear if the subsystem is the problem or its interaction or other subsystems
• If replacing exactly one subsystem with another improves accuracy à Winning!
21
Intrinsic word vector evaluation
• Word Vector Analogies
• Evaluate word vectors by how well
their cosine distance after addition
captures intuitive semantic and
syntactic analogy questions
• Discarding the input words from the
search (!)
• Problem: What if the information is
there but not linear?
man:woman :: king:?
a:b :: c:?
king
man
woman
22
GloVe Visualization
23
Meaning similarity: Another intrinsic word vector evaluation
• Word vector distances and their correlation with human judgments
• Example dataset: WordSim353 http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/
24
Word 1 Word 2 Human (mean)
tiger cat 7.35
tiger tiger 10
book paper 7.46
computer internet 7.58
plane car 5.77
professor doctor 6.62
stock phone 1.62
stock CD 1.31
stock jaguar 0.92
Correlation evaluation
• Word vector distances and their correlation with human judgments
• Some ideas from Glove paper have been shown to improve skip-gram (SG) model also
(e.g., average both vectors)
rmance, with the
nalogy task.
d results of a vaas
well as with
the word2vec
ing SVDs. With
gram (SG†) and
†) models on the
a 2014 + Gigaop
400,000 most
ndow size of 10.
hich we show in
r this corpus.
erate a truncated
ormation of how
th only the top
his step is typiTable
3: Spearman rank correlation on word similarity
tasks. All vectors are 300-dimensional. The
CBOW⇤ vectors are from the word2vec website
and differ in that they contain phrase vectors.
Model Size WS353 MC RG SCWS RW
SVD 6B 35.3 35.1 42.5 38.3 25.6
SVD-S 6B 56.5 71.5 71.0 53.6 34.7
SVD-L 6B 65.7 72.7 75.1 56.5 37.0
CBOW† 6B 57.2 65.6 68.2 57.0 32.5
SG† 6B 62.8 65.2 69.7 58.1 37.2
GloVe 6B 65.8 72.7 77.8 53.9 38.1
SVD-L 42B 74.0 76.4 74.1 58.3 39.9
GloVe 42B 75.9 83.6 82.9 59.6 47.8
CBOW⇤ 100B 68.4 79.6 75.4 59.4 45.5
L model on this larger corpus. The fact that this
basic SVD model does not scale well to large corpora
lends further evidence to the necessity of the25
Extrinsic word vector evaluation
• One example where good word vectors should help directly: named entity recognition: identifying
references to a person, organization or location: Chris Manning lives in Palo Alto.
• Subsequent NLP tasks in this class are other examples. So, more examples soon.
Table 4: F1 score on NER task with 50d vectors.
Discrete is the baseline without word vectors. We
use publicly-available vectors for HPCA, HSMN,
and CW. See text for details.
Model Dev Test ACE MUC7
Discrete 91.0 85.4 77.4 73.4
SVD 90.8 85.7 77.3 73.7
SVD-S 91.0 85.5 77.6 74.3
SVD-L 90.5 84.8 73.6 71.5
HPCA 92.6 88.7 81.7 80.7
HSMN 90.5 85.7 78.7 74.7
CW 92.2 87.4 81.7 80.2
CBOW 93.1 88.2 82.2 81.1
GloVe 93.2 88.3 82.9 82.2
shown for neural vectors in (Turian et al., 2010).
4.4 Model Analysis: Vector Length and
Context Size
50
55
60
65
70
75
80
85
SyntacticSemantic
Wiki2010
1B tokens
Accuracy[%]
Wiki2014
1.6B tokens
Gigaword5
4.3B tokens
Figure 3: Accuracy on the ana
dimensional vectors trained on
entries are updated to assimila
whereas Gigaword is a ﬁxed ne
outdated and possibly incorrec
4.6 Model Analysis: Run-ti
The total run-time is split betw26
5. Word senses and word sense ambiguity
• Most words have lots of meanings!
• Especially common words
• Especially words that have existed for a long time
• Example: pike
• Does one vector capture all these meanings or do we have a mess?
27
pike
• A sharp point or staff
• A type of elongated fish
• A railroad line or system
• A type of road
• The future (coming down the pike)
• A type of body position (as in diving)
• To kill or pierce with a pike
• To make one’s way (pike along)
• In Australian English, pike means to pull out from doing something: I reckon he could
have climbed that cliff, but he piked!
28
Improving Word Representations Via Global Context And
Multiple Word Prototypes (Huang et al. 2012)
• Idea: Cluster word windows around words, retrain with each word assigned to multiple
different clusters bank1, bank2, etc.
29
Linear Algebraic Structure of Word Senses, with
Applications to Polysemy (Arora, …, Ma, …, TACL 2018)
• Different senses of a word reside in a linear superposition (weighted
sum) in standard word embeddings like word2vec
• 𝑣pike = 𝛼! 𝑣pike#
+ 𝛼" 𝑣pike$
+ 𝛼# 𝑣pike%
• Where 𝛼! =
$#
$#%$$%$%
, etc., for frequency f
• Surprising result:
• Because of ideas from sparse coding you can actually separate out
the senses (providing they are relatively common)!
30
6. Deep Learning Classification: Named Entity Recognition (NER)
• The task: find and classify names in text, by labeling word tokens, for example:
Last night , Paris Hilton wowed in a sequin gown .
PER PER
Samuel Quinn was arrested in the Hilton Hotel in Paris in April 1989 .
PER PER LOC LOC LOC DATE DATE
• Possible uses:
• Tracking mentions of particular entities in documents
• For question answering, answers are usually named entities
• Relating sentiment analysis to the entity under discussion
• Often followed by Entity Linking/Canonicalization into a Knowledge Base such as Wikidata
31
Simple NER: Window classification using binary logistic classifier
• Idea: classify each word in its context window of neighboring words
• Train logistic classifier on hand-labeled data to classify center word {yes/no} for each
class based on a concatenation of word vectors in a window
• Really, we usually use multi-class softmax, but we’re trying to keep it simple J
• Example: Classify “Paris” as +/– location in context of sentence with window length 2:
the museums in Paris are amazing to see .
Xwindow = [ xmuseums xin xParis xare xamazing ]T
• Resulting vector xwindow = x ∈ R5d
• To classify all words: run classifier for each class on the vector centered on each word
in the sentence
32
Classification review and notation
• Supervised learning: we have a training dataset consisting of samples
{xi,yi}N
i=1
• xi are inputs, e.g., words (indices or vectors!), sentences, documents, etc.
• Dimension d
• yi are labels (one of C classes) we try to predict, for example:
• classes: sentiment (+/–), named entities, buy/sell decision
• other words
• later: multi-word sequences
33
Neural classification
34
• Typical ML/stats softmax classifier:
• Learned parameters θ are just elements
of W (not input representation x, which has sparse symbolic features)
• Classifier gives linear decision boundary, which can be limiting
• A neural network classifier differs in that:
• We learn both W and (distributed!) representations for words
• The word vectors x re-represent one-hot vectors, moving them
around in an intermediate layer vector space, for easy classification
with a (linear) softmax classifier
• Conceptually, we have an embedding layer: x = Le
• We use deep networks—more layers—that let us re-represent and
compose our data multiple times, giving a non-linear classifier
But typically, it is linear
relative to the pre-final
layer representation
Softmax classifier
Again, we can tease apart the prediction function into three steps:
1. For each row y of W, calculate dot product with x:
2. Apply softmax function to get normalized probability:
= softmax(𝑓E)
3. Choose the y with maximum probability
• For each training example (x,y), our objective is to maximize the probability of the
correct class y or we can minimize the negative log probability of that class:
35
NER: Binary classification for center word being location
36
• We do supervised training and want high score if it’s a location
𝐽D 𝜃 = 𝜎 𝑠 =
1
1 + 𝑒./
x = [ xmuseums xin xParis xare xamazing ]
predicted model
probability of class
f = Some elementwise
non-linear
function, e.g.,
logistic, tanh, ReLU
Training with “cross entropy loss” – you use this in PyTorch!
• Until now, our objective was stated as to maximize the probability of the correct class y
or equivalently we can minimize the negative log probability of that class
• Now restated in terms of cross entropy, a concept from information theory
• Let the true probability distribution be p; let our computed model probability be q
• The cross entropy is:
• Assuming a ground truth (or true or gold or target) probability distribution that is 1 at
the right class and 0 everywhere else, p = [0, …, 0, 1, 0, …, 0], then:
• Because of one-hot p, the only term left is the negative log probability of the true
class yi: − log 𝑝(𝑦;|𝑥;)
37
Cross entropy can be used in other ways with a more interesting p,
but for now just know that you’ll want to use it as the loss in PyTorch
Classification over a full dataset
• Cross entropy loss function over
full dataset {xi,yi}N
i=1
38
Remember: Stochastic Gradient Descent
Update equation:
i.e., for each parameter: 𝜃F
+,G
= 𝜃F
43:
− 𝛼
HI C
HC'
!()
In deep learning, 𝜃 includes the data representation (e.g., word vectors) too!
How can we compute ∇C 𝐽(𝜃)?
1. By hand
2. Algorithmically: the backpropagation algorithm (next lecture!)
𝛼 = step size or learning rate
39
7. Neural computation
40
A binary logistic regression unit is a bit similar to a neuron
hw,b (x) = f (wT
x + b)
f (z) =
1
1+e−z
w, b are the parameters of this neuron
i.e., this logistic regression model
b: We can have an “always on” bias
feature, which gives a class prior, or
separate it out, as a bias term
41
f = nonlinear activation function (e.g. sigmoid), w = weights, b = bias, h = hidden, x = inputs
A neural network
= running several logistic regressions at the same time
If we feed a vector of inputs through a bunch of logistic regression functions, then we get
a vector of outputs …
But we don’t have to decide
ahead of time what variables
these logistic regressions are
trying to predict!
42
A neural network
= running several logistic regressions at the same time
… which we can feed into another logistic regression function, giving composed functions
It is the loss function
that will direct what
the intermediate
hidden variables should
be, so as to do a good
job at predicting the
targets for the next
layer, etc.
43
A neural network
= running several logistic regressions at the same time
Before we know it, we have a multilayer neural network….
44
This allows us to
re-represent and
compose our data
multiple times and to
learn a classifier that is
highly non-linear in
terms of the original
inputs
(but typically is linear in terms of
the pre-final layer representations)
Matrix notation for a layer
We have
In matrix notation
Activation f is applied element-wise:
a1
a2
a3
a1 = f (W11x1 +W12 x2 +W13x3 + b1)
a2 = f (W21x1 +W22 x2 +W23x3 + b2 )
etc.
z = Wx + b
a = f (z)
f ([z1, z2, z3]) =[ f (z1), f (z2 ), f (z3)]
W12
b3
45
Non-linearities (like f or sigmoid): Why they’re needed
46
• Neural networks do function approximation,
e.g., regression or classification
• Without non-linearities, deep neural networks
can’t do anything more than a linear transform
• Extra layers could just be compiled down into a
single linear transform: W1 W2 x = Wx
• But, with more layers that include non-linearities,
they can approximate more complex functions!