Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers Lecture Plan 2 Lecture 2: Word Vectors, Word Senses, and Neural Network Classifiers 1. Course organization (2 mins) 2. Finish looking at word vectors and word2vec (10 mins) 3. Optimization basics (8 mins) 4. Can we capture the essence of word meaning more effectively by counting? (8m) 5. The GloVe model of word vectors (8 min) 6. Evaluating word vectors (12 mins) 7. Word senses (6 mins) 8. Review of classification and how neural nets differ (8 mins) 9. Introducing neural networks (14 mins) Key Goal: To be able to read word embeddings papers by the end of class 1. Course Organization • Come to office hours/help sessions! • Come to discuss final project ideas as well as the assignments • Try to come early, often and off-cycle • TA office hours: 3-hour block on Mon, Tue, Wed, Thu, & Sat, with multiple TAs • Just show up on Nooks and go to the Welcome room! • Our friendly course staff will be on hand to assist you • Chris’s office hours: • Mon 2–4pm. Book slot on Calendly. You can still come along this coming Monday! • Week 10: • We lose a week this year with no exam week, so we need to do final projects sooner • No lectures in week 10; projects due Tue in week 10; extra project help sessions 3 2. Review: Main idea of word2vec 4 • Start with random word vectors • Iterate through each word in the whole corpus • Try to predict surrounding words using word vectors: 𝑃 𝑜 𝑐 = !"#(%! "&#) ∑$∈& !"#(%$ " &#) • Learning: Update vectors so they can predict actual surrounding words better • Doing no more than this, this algorithm learns word vectors that capture well word similarity and meaningful directions in a wordspace! …crisesbankingintoturningproblems… as 𝑃 𝑤!"# | 𝑤! 𝑃 𝑤!"$ | 𝑤! 𝑃 𝑤!%# | 𝑤! 𝑃 𝑤!%$ | 𝑤! Word2vec parameters and computations • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • U V 𝑈. 𝑣, - softmax(𝑈. 𝑣, -) outside center dot product probabilities The model makes the same predictions at each position 5 We want a model that gives a reasonably high probability estimate to all words that occur in the context (at all often) “Bag of words” model! Word2vec maximizes objective function by putting similar words nearby in space 6 3. Optimization: Gradient Descent • To learn good word vectors: We have a cost function 𝐽 𝜃 we want to minimize • Gradient Descent is an algorithm to minimize 𝐽 𝜃 by changing 𝜃 • Idea: from current value of 𝜃, calculate gradient of 𝐽 𝜃 , then take small step in the direction of negative gradient. Repeat. 7 Note: Our objectives may not be convex like this L But life turns out to be okay J • Update equation (in matrix notation): • Update equation (for a single parameter): • Algorithm: Gradient Descent 𝛼 = step size or learning rate 8 Stochastic Gradient Descent • Problem: 𝐽 𝜃 is a function of all windows in the corpus (often, billions!) • So is very expensive to compute • You would wait a very long time before making a single update! • Very bad idea for pretty much all neural nets! • Solution: Stochastic gradient descent (SGD) • Repeatedly sample windows, and update after each one, or each small batch • Algorithm: 9 Stochastic gradients with word vectors! [Aside] • Iteratively take gradients at each such window for SGD • But in each window, we only have at most 2m + 1 words, so is very sparse! 10 Stochastic gradients with word vectors! • We might only update the word vectors that actually appear! • Solution: either you need sparse matrix update operations to only update certain rows of full embedding matrices U and V, or you need to keep around a hash for word vectors • If you have millions of word vectors and do distributed computing, it is important to not have to send gigantic updates around! [ ]|V| d 11 Rows not columns in actual DL packages! 2b. Word2vec algorithm family: More details Why two vectors? à Easier optimization. Average both at the end • But can implement the algorithm with just one vector per word … and it helps Two model variants: 1. Skip-grams (SG) Predict context (“outside”) words (position independent) given center word 2. Continuous Bag of Words (CBOW) Predict center word from (bag of) context words We presented: Skip-gram model Additional efficiency in training: 1. Negative sampling So far: Focus on naïve softmax (simpler, but expensive, training method) 12 The skip-gram model with negative sampling (HW2) • The normalization term is computationally expensive • 𝑃 𝑜 𝑐 = !"#(%! "&#) ∑$∈& !"#(%$ " &#) • Hence, in standard word2vec and HW2 you implement the skip-gram model with negative sampling • Main idea: train binary logistic regressions for a true pair (center word and a word in its context window) versus several noise pairs (the center word paired with a random word) 13 The skip-gram model with negative sampling (HW2) 14 • From paper: “Distributed Representations of Words and Phrases and their Compositionality” (Mikolov et al. 2013) • Overall objective function (they maximize): • The logistic/sigmoid function: (we’ll become good friends soon) • We maximize the probability of two words co-occurring in first log and minimize probability of noise words The skip-gram model with negative sampling (HW2) • Notation more similar to class and HW2: 𝐽./0123456/ 𝒖7, 𝒗8, 𝑈 = − log 𝜎 𝒖7 - 𝒗8 − , 9∈{< 23456/= >.=>8/2} log 𝜎(−𝒖9 - 𝒗8) • We take k negative samples (using word probabilities) • Maximize probability that real outside word appears, minimize probability that random words appear around center word • Sample with P(w)=U(w)3/4/Z, the unigram distribution U(w) raised to the 3/4 power (We provide this function in the starter code). • The power makes less frequent words be sampled more often 15 4. Why not capture co-occurrence counts directly? Building a co-occurrence matrix X • 2 options: windows vs. full document • Window: Similar to word2vec, use window around each word à captures some syntactic and semantic information • Word-document co-occurrence matrix will give general topics (all sports terms will have similar entries) leading to “Latent Semantic Analysis” 16 Example: Window based co-occurrence matrix 17 • Window length 1 (more common: 5–10) • Symmetric (irrelevant whether left or right context) • Example corpus: • I like deep learning • I like NLP • I enjoy flying counts I like enjoy deep learning NLP flying . I 0 2 1 0 0 0 0 0 like 2 0 0 1 0 1 0 0 enjoy 1 0 0 0 0 0 1 0 deep 0 1 0 0 1 0 0 0 learning 0 0 0 1 0 0 0 1 NLP 0 1 0 0 0 0 0 1 flying 0 0 1 0 0 0 0 1 . 0 0 0 0 1 1 1 0 Co-occurrence vectors 18 • Simple count co-occurrence vectors • Vectors increase in size with vocabulary • Very high dimensional: require a lot of storage (though sparse) • Subsequent classification models have sparsity issues à Models are less robust • Low-dimensional vectors • Idea: store “most” of the important information in a fixed, small number of dimensions: a dense vector • Usually 25–1000 dimensions, similar to word2vec • How to reduce the dimensionality? Classic Method: Dimensionality Reduction on X (HW1) Singular Value Decomposition of co-occurrence matrix X Factorizes X into UΣVT, where U and V are orthonormal Retain only k singular values, in order to generalize. "𝑋 is the best rank k approximation to X , in terms of least squares. Classic linear algebra result. Expensive to compute for large matrices. 19 k X Hacks to X (several used in Rohde et al. 2005 in COALS) 20 • Running an SVD on raw counts doesn’t work well • Scaling the counts in the cells can help a lot • Problem: function words (the, he, has) are too frequent à syntax has too much impact. Some fixes: • log the frequencies • min(X,t), with t ≈ 100 • Ignore the function words • Ramped windows that count closer words more than further away words • Use Pearson correlations instead of counts, then set negative values to 0 • Etc. Interesting semantic patterns emerge in the scaled vectors Rohde, Gonnerman, Plaut Modeling Word Meaning Using Lexical Co-Occurrence DRIVE LEARN DOCTOR CLEAN DRIVER STUDENT TEACH TEACHER TREAT PRAY PRIEST MARRY SWIM BRIDE JANITOR SWIMMER Figure 13: Multidimensional scaling for nouns and their associated verbs. Table 10 The 10 nearest neighbors and their percent correlation similarities for a set of nouns, under the COALS-14K model. gun point mind monopoly cardboard lipstick leningrad feet 21 COALS model from Rohde et al. ms., 2005. An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence 5. Towards GloVe: Count based vs. direct prediction • LSA, HAL (Lund & Burgess), • COALS, Hellinger-PCA (Rohde et al, Lebret & Collobert) • Fast training • Efficient usage of statistics • Primarily used to capture word similarity • Disproportionate importance given to large counts • Skip-gram/CBOW (Mikolov et al) • NNLM, HLBL, RNN (Bengio et al; Collobert & Weston; Huang et al; Mnih & Hinton) • Scales with corpus size • Inefficient usage of statistics • Can capture complex patterns beyond word similarity • Generate improved performance on other tasks 22 Ratios of co-occurrence probabilities can encode meaning components Crucial insight: x = solid x = water large x = gas small x = random smalllarge small large large small ~1 ~1large small Encoding meaning components in vector differences [Pennington, Socher, and Manning, EMNLP 2014] Ratios of co-occurrence probabilities can encode meaning components Crucial insight: x = solid x = water 1.9 x 10-4 x = gas x = fashion 2.2 x 10-5 1.36 0.96 Encoding meaning in vector differences [Pennington, Socher, and Manning, EMNLP 2014] 8.9 7.8 x 10-4 2.2 x 10-3 3.0 x 10-3 1.7 x 10-5 1.8 x 10-5 6.6 x 10-5 8.5 x 10-2 A: Log-bilinear model: with vector differences Encoding meaning in vector differences Q: How can we capture ratios of co-occurrence probabilities as linear meaning components in a word vector space? Combining the best of both worlds GloVe [Pennington, Socher, and Manning, EMNLP 2014] • Fast training • Scalable to huge corpora • Good performance even with small corpus and small vectors GloVe results 1. frogs 2. toad 3. litoria 4. leptodactylidae 5. rana 6. lizard 7. eleutherodactylus litoria leptodactylidae rana eleutherodactylus Nearest words to frog: 27 6. How to evaluate word vectors? • Related to general evaluation in NLP: Intrinsic vs. extrinsic • Intrinsic: • Evaluation on a specific/intermediate subtask • Fast to compute • Helps to understand that system • Not clear if really helpful unless correlation to real task is established • Extrinsic: • Evaluation on a real task • Can take a long time to compute accuracy • Unclear if the subsystem is the problem or its interaction or other subsystems • If replacing exactly one subsystem with another improves accuracy à Winning! 28 Intrinsic word vector evaluation • Word Vector Analogies • Evaluate word vectors by how well their cosine distance after addition captures intuitive semantic and syntactic analogy questions • Discarding the input words from the search! • Problem: What if the information is there but not linear? man:woman :: king:? a:b :: c:? king man woman 29 Glove Visualizations 30 Glove Visualizations: Company - CEO 31 Glove Visualizations: Comparatives and Superlatives 32 Analogy evaluation and hyperparameters Glove word vectors evaluation 33 um in terms of Hn,m. The upmaximum frethe number of This number is f r in Eqn. (17) . Therefore we (19) ted to |C| when we are free to 2013); skip-gram (SG) and CBOW results are from (Mikolov et al., 2013a,b); we trained SG† and CBOW† using the word2vec tool3. See text for details and a description of the SVD models. Model Dim. Size Sem. Syn. Tot. ivLBL 100 1.5B 55.9 50.1 53.2 HPCA 100 1.6B 4.2 16.4 10.8 GloVe 100 1.6B 67.5 54.3 60.3 SG 300 1B 61 61 61 CBOW 300 1.6B 16.1 52.6 36.1 vLBL 300 1.5B 54.2 64.8 60.0 ivLBL 300 1.5B 65.2 63.0 64.0 GloVe 300 1.6B 80.8 61.5 70.3 SVD 300 6B 6.3 8.1 7.3 (19) ted to |C| when we are free to uation for large pansion of genol, 1976), s > 0, s , 1 , (20) SG 300 1B 61 61 61 CBOW 300 1.6B 16.1 52.6 36.1 vLBL 300 1.5B 54.2 64.8 60.0 ivLBL 300 1.5B 65.2 63.0 64.0 GloVe 300 1.6B 80.8 61.5 70.3 SVD 300 6B 6.3 8.1 7.3 SVD-S 300 6B 36.7 46.6 42.1 SVD-L 300 6B 56.6 63.0 60.1 CBOW† 300 6B 63.6 67.4 65.7 SG† 300 6B 73.0 66.0 69.1 GloVe 300 6B 77.4 67.0 71.7 CBOW 1000 6B 57.3 68.9 63.7 SG 1000 6B 66.1 65.1 65.6 SVD-L 300 42B 38.4 58.2 49.2 Analogy evaluation and hyperparameters • More data helps • Wikipedia is better than news text! • Dimensionality • Good dimension is ~300 34 ctors. . We MN, 50 55 60 65 70 75 80 85 OverallSyntacticSemantic Wiki2010 1B tokens Accuracy[%] Wiki2014 1.6B tokens Gigaword5 4.3B tokens Gigaword5 + Wiki2014 6B tokens Common Crawl 42B tokens Figure 3: Accuracy on the analogy task for 300- 0 100 200 300 400 500 600 20 30 40 50 60 70 80 Vector Dimension Accuracy[%] Semantic Syntactic Overall (a) Symmetric context 2 40 50 55 60 65 70 45 Accuracy[%] Another intrinsic word vector evaluation • Word vector distances and their correlation with human judgments • Example dataset: WordSim353 http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/ 35 Word 1 Word 2 Human (mean) tiger cat 7.35 tiger tiger 10 book paper 7.46 computer internet 7.58 plane car 5.77 professor doctor 6.62 stock phone 1.62 stock CD 1.31 stock jaguar 0.92 Correlation evaluation • Word vector distances and their correlation with human judgments • Some ideas from Glove paper have been shown to improve skip-gram (SG) model also (e.g., average both vectors) rmance, with the nalogy task. d results of a vaas well as with the word2vec ing SVDs. With gram (SG†) and †) models on the a 2014 + Gigaop 400,000 most ndow size of 10. hich we show in r this corpus. erate a truncated ormation of how th only the top his step is typiTable 3: Spearman rank correlation on word similarity tasks. All vectors are 300-dimensional. The CBOW⇤ vectors are from the word2vec website and differ in that they contain phrase vectors. Model Size WS353 MC RG SCWS RW SVD 6B 35.3 35.1 42.5 38.3 25.6 SVD-S 6B 56.5 71.5 71.0 53.6 34.7 SVD-L 6B 65.7 72.7 75.1 56.5 37.0 CBOW† 6B 57.2 65.6 68.2 57.0 32.5 SG† 6B 62.8 65.2 69.7 58.1 37.2 GloVe 6B 65.8 72.7 77.8 53.9 38.1 SVD-L 42B 74.0 76.4 74.1 58.3 39.9 GloVe 42B 75.9 83.6 82.9 59.6 47.8 CBOW⇤ 100B 68.4 79.6 75.4 59.4 45.5 L model on this larger corpus. The fact that this basic SVD model does not scale well to large corpora lends further evidence to the necessity of the36 Extrinsic word vector evaluation • Extrinsic evaluation of word vectors: All subsequent NLP tasks in this class. More examples soon. • One example where good word vectors should help directly: named entity recognition: identifying references to a person, organization or location Table 4: F1 score on NER task with 50d vectors. Discrete is the baseline without word vectors. We use publicly-available vectors for HPCA, HSMN, and CW. See text for details. Model Dev Test ACE MUC7 Discrete 91.0 85.4 77.4 73.4 SVD 90.8 85.7 77.3 73.7 SVD-S 91.0 85.5 77.6 74.3 SVD-L 90.5 84.8 73.6 71.5 HPCA 92.6 88.7 81.7 80.7 HSMN 90.5 85.7 78.7 74.7 CW 92.2 87.4 81.7 80.2 CBOW 93.1 88.2 82.2 81.1 GloVe 93.2 88.3 82.9 82.2 shown for neural vectors in (Turian et al., 2010). 50 55 60 65 70 75 80 85 SyntacSemantic Wiki2010 1B tokens Accuracy[%] Wiki2014 1.6B tokens Gigaword 4.3B token Figure 3: Accuracy on the a dimensional vectors trained o entries are updated to assimi whereas Gigaword is a fixed outdated and possibly incorre 37 7. Word senses and word sense ambiguity • Most words have lots of meanings! • Especially common words • Especially words that have existed for a long time • Example: pike • Does one vector capture all these meanings or do we have a mess? 38 pike • A sharp point or staff • A type of elongated fish • A railroad line or system • A type of road • The future (coming down the pike) • A type of body position (as in diving) • To kill or pierce with a pike • To make one’s way (pike along) • In Australian English, pike means to pull out from doing something: I reckon he could have climbed that cliff, but he piked! 39 Improving Word Representations Via Global Context And Multiple Word Prototypes (Huang et al. 2012) • Idea: Cluster word windows around words, retrain with each word assigned to multiple different clusters bank1, bank2, etc. 40 Linear Algebraic Structure of Word Senses, with Applications to Polysemy (Arora, …, Ma, …, TACL 2018) • Different senses of a word reside in a linear superposition (weighted sum) in standard word embeddings like word2vec • 𝑣pike = 𝛼! 𝑣pike! + 𝛼" 𝑣pike" + 𝛼# 𝑣pike# • Where 𝛼! = $! $!%$"%$# , etc., for frequency f • Surprising result: • Because of ideas from sparse coding you can actually separate out the senses (providing they are relatively common)! 41 8. Classification review and notation • Generally, we have a training dataset consisting of samples {xi,yi}N i=1 • xi are inputs, e.g., words (indices or vectors!), sentences, documents, etc. • Dimension d • yi are labels (one of C classes) we try to predict, for example: • classes: sentiment (+/–), named entities, buy/sell decision • other words • later: multi-word sequences 42 Classification intuition • Training data: {xi,yi}N i=1 • Simple illustration case: • Fixed 2D word vectors inputs to classify • Using softmax/logistic regression • Linear decision boundary • Traditional ML/Stats approach: assume xi are fixed, train (i.e., set) softmax/logistic regression weights 𝑊 ∈ ℝ!×# to determine a decision boundary (hyperplane) as in the picture • Method: For each fixed x, predict: Visualizations with ConvNetJS by Andrej Karpathy! http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html 43 Softmax classifier Again, we can tease apart the prediction function into three steps: 1. For each row y row of W, calculate dot product with x: 2. Apply softmax function to get normalized probability: = softmax(𝑓$) 3. Choose the y with maximum probability • For each training example (x,y), our objective is to maximize the probability of the correct class y or we can minimize the negative log probability of that class: 44 Training with “cross entropy loss” • Concept of “cross entropy” is from information theory • Let the true probability distribution be p • Let our computed model probability be q • The cross entropy is: • Assuming a ground truth (or true or gold or target) probability distribution that is 1 at the right class and 0 everywhere else: p = [0,…,0,1,0,…0] then: • Because of one-hot p, the only term left is the negative log probability of the true class: − log 𝑝(𝑦%|𝑥%) 45 Classification over a full dataset • Cross entropy loss function over full dataset {xi,yi}N i=1 • Instead of We will write f in matrix notation: 46 Traditional ML optimization • For statistical machine learning 𝜃 usually only consists of the elements of W: • So, we update the decision boundary via only updating W Visualizations with ConvNetJS by Karpathy 47 9. Neural Network Classifiers • Softmax (≈ logistic regression) alone is not very powerful • Softmax classifier only gives linear decision boundaries This can be quite limiting à Unhelpful when a problem is complex Wouldn’t it be cool to get these correct? 48 Neural Nets for the Win! • Neural networks can learn much more complex functions with nonlinear decision boundaries! • Non-linear in the original space 49 Classification difference with word vectors #1 • Commonly in NLP deep learning: • We learn both W and word vectors x • We learn both conventional parameters and (distributed!) representations • The word vectors re-represent one-hot vectors—they move them around in an intermediate layer vector space—for easy classification with a (linear) softmax classifier, conceptually via an embedding layer: x = Le Very large number of parameters! 50 Neural computation 51 A neuron can be modeled as a binary logistic regression unit hw,b (x) = f (wT x + b) f (z) = 1 1+e−z w, b are the parameters of this neuron i.e., this logistic regression model b: We can have an “always on” bias feature, which gives a class prior, or separate it out, as a bias term 52 f = nonlinear activation function (e.g. sigmoid), w = weights, b = bias, h = hidden, x = inputs Difference #2: A neural network = running several logistic regressions at the same time If we feed a vector of inputs through a bunch of logistic regression functions, then we get a vector of outputs … But we don’t have to decide ahead of time what variables these logistic regressions are trying to predict! 53 Difference #2: A neural network = running several logistic regressions at the same time … which we can feed into another logistic regression function It is the loss function that will direct what the intermediate hidden variables should be, so as to do a good job at predicting the targets for the next layer, etc. 54 Difference #2: A neural network = running several logistic regressions at the same time Before we know it, we have a multilayer neural network…. 55 Matrix notation for a layer We have In matrix notation Activation f is applied element-wise: a1 a2 a3 a1 = f (W11x1 +W12 x2 +W13x3 + b1) a2 = f (W21x1 +W22 x2 +W23x3 + b2 ) etc. z = Wx + b a = f (z) f ([z1, z2, z3]) =[ f (z1), f (z2 ), f (z3)] W12 b3 56 Non-linearities (aka “f ”): Why they’re needed • Example: function approximation, e.g., regression or classification • Without non-linearities, deep neural networks can’t do anything more than a linear transform • Extra layers could just be compiled down into a single linear transform: W1 W2 x = Wx • With more layers, they can approximate more complex functions! 57