Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers 2. Review: Main idea of word2vec 5 • Start with random word vectors • Iterate through each word position in the whole corpus • Try to predict surrounding words using word vectors: 𝑃 𝑜 𝑐 = !"#(%! "&#) ∑$∈& !"#(%$ " &#) • Learning: Update vectors so they can predict actual surrounding words better • Doing no more than this, this algorithm learns word vectors that capture well word similarity and meaningful directions in a word space! …crisesbankingintoturningproblems… as 𝑃 𝑤!"# | 𝑤! 𝑃 𝑤!"$ | 𝑤! 𝑃 𝑤!%# | 𝑤! 𝑃 𝑤!%$ | 𝑤! Word2vec parameters … and computations • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • U V 𝑈. 𝑣) * softmax(𝑈. 𝑣) *) outside center dot product probabilities The model makes the same predictions at each position 6 We want a model that gives a reasonably high probability estimate to all words that occur in the context (at all often) “Bag of words” model! Word2vec maximizes objective function by putting similar words nearby in space 7 The skip-gram model with negative sampling (HW2) • The normalization term is computationally expensive (when many output classes): • 𝑃 𝑜 𝑐 = !"#(%! "&#) ∑$∈& !"#(%$ " &#) • Hence, in standard word2vec and HW2 you implement the skip-gram model with negative sampling • Main idea: train binary logistic regressions to differentiate a true pair (center word and a word in its context window) versus several “noise” pairs (the center word paired with a random word) 8 A big sum over words Word2vec algorithm family (Mikolov et al. 2013): More details Why two vectors? à Easier optimization. Average both at the end • But can implement the algorithm with just one vector per word … and it helps a bit Two model variants: 1. Skip-grams (SG) Predict context (“outside”) words (position independent) given center word 2. Continuous Bag of Words (CBOW) Predict center word from (bag of) context words We presented: Skip-gram model Loss functions for training: 1. Naïve softmax (simple but expensive loss function, when many output classes) 2. More optimized variants like hierarchical softmax 3. Negative sampling So far, we explained naïve softmax 9 The skip-gram model with negative sampling (HW2) 10 • Introduced in: “Distributed Representations of Words and Phrases and their Compositionality” (Mikolov et al. 2013) • Overall objective function (they maximize): • The logistic/sigmoid function: (we’ll become good friends soon) • We maximize the probability of two words co-occurring in first log and minimize probability of noise words in second part Stochastic gradients with negative sampling [aside] • We iteratively take gradients at each window for SGD • In each window, we only have at most 2m + 1 words plus 2km negative words with negative sampling, so ∇C 𝐽D(𝜃) is very sparse! 12 Stochastic gradients with with negative sampling [aside] • We might only update the word vectors that actually appear! • Solution: either you need sparse matrix update operations to only update certain rows of full embedding matrices U and V, or you need to keep around a hash for word vectors • If you have millions of word vectors and do distributed computing, it is important to not have to send gigantic updates around! [ ]|V| d 13 Rows not columns in actual DL packages! This is also a particular issue with more advanced optimization methods in the Adagrad family Interesting semantic patterns emerge in the scaled vectors Rohde, Gonnerman, Plaut Modeling Word Meaning Using Lexical Co-Occurrence DRIVE LEARN DOCTOR CLEAN DRIVER STUDENT TEACH TEACHER TREAT PRAY PRIEST MARRY SWIM BRIDE JANITOR SWIMMER Figure 13: Multidimensional scaling for nouns and their associated verbs. Table 10 The 10 nearest neighbors and their percent correlation similarities for a set of nouns, under the COALS-14K model. gun point mind monopoly cardboard lipstick leningrad feet 19 COALS model from Rohde et al. ms., 2005. An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence 4. How to evaluate word vectors? • Related to general evaluation in NLP: Intrinsic vs. extrinsic • Intrinsic: • Evaluation on a specific/intermediate subtask • Fast to compute • Helps to understand that system • Not clear if really helpful unless correlation to real task is established • Extrinsic: • Evaluation on a real task • Can take a long time to compute accuracy • Unclear if the subsystem is the problem or its interaction or other subsystems • If replacing exactly one subsystem with another improves accuracy à Winning! 21 Intrinsic word vector evaluation • Word Vector Analogies • Evaluate word vectors by how well their cosine distance after addition captures intuitive semantic and syntactic analogy questions • Discarding the input words from the search (!) • Problem: What if the information is there but not linear? man:woman :: king:? a:b :: c:? king man woman 22 GloVe Visualization 23 Meaning similarity: Another intrinsic word vector evaluation • Word vector distances and their correlation with human judgments • Example dataset: WordSim353 http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/ 24 Word 1 Word 2 Human (mean) tiger cat 7.35 tiger tiger 10 book paper 7.46 computer internet 7.58 plane car 5.77 professor doctor 6.62 stock phone 1.62 stock CD 1.31 stock jaguar 0.92 Correlation evaluation • Word vector distances and their correlation with human judgments • Some ideas from Glove paper have been shown to improve skip-gram (SG) model also (e.g., average both vectors) rmance, with the nalogy task. d results of a vaas well as with the word2vec ing SVDs. With gram (SG†) and †) models on the a 2014 + Gigaop 400,000 most ndow size of 10. hich we show in r this corpus. erate a truncated ormation of how th only the top his step is typiTable 3: Spearman rank correlation on word similarity tasks. All vectors are 300-dimensional. The CBOW⇤ vectors are from the word2vec website and differ in that they contain phrase vectors. Model Size WS353 MC RG SCWS RW SVD 6B 35.3 35.1 42.5 38.3 25.6 SVD-S 6B 56.5 71.5 71.0 53.6 34.7 SVD-L 6B 65.7 72.7 75.1 56.5 37.0 CBOW† 6B 57.2 65.6 68.2 57.0 32.5 SG† 6B 62.8 65.2 69.7 58.1 37.2 GloVe 6B 65.8 72.7 77.8 53.9 38.1 SVD-L 42B 74.0 76.4 74.1 58.3 39.9 GloVe 42B 75.9 83.6 82.9 59.6 47.8 CBOW⇤ 100B 68.4 79.6 75.4 59.4 45.5 L model on this larger corpus. The fact that this basic SVD model does not scale well to large corpora lends further evidence to the necessity of the25 Extrinsic word vector evaluation • One example where good word vectors should help directly: named entity recognition: identifying references to a person, organization or location: Chris Manning lives in Palo Alto. • Subsequent NLP tasks in this class are other examples. So, more examples soon. Table 4: F1 score on NER task with 50d vectors. Discrete is the baseline without word vectors. We use publicly-available vectors for HPCA, HSMN, and CW. See text for details. Model Dev Test ACE MUC7 Discrete 91.0 85.4 77.4 73.4 SVD 90.8 85.7 77.3 73.7 SVD-S 91.0 85.5 77.6 74.3 SVD-L 90.5 84.8 73.6 71.5 HPCA 92.6 88.7 81.7 80.7 HSMN 90.5 85.7 78.7 74.7 CW 92.2 87.4 81.7 80.2 CBOW 93.1 88.2 82.2 81.1 GloVe 93.2 88.3 82.9 82.2 shown for neural vectors in (Turian et al., 2010). 4.4 Model Analysis: Vector Length and Context Size 50 55 60 65 70 75 80 85 SyntacticSemantic Wiki2010 1B tokens Accuracy[%] Wiki2014 1.6B tokens Gigaword5 4.3B tokens Figure 3: Accuracy on the ana dimensional vectors trained on entries are updated to assimila whereas Gigaword is a fixed ne outdated and possibly incorrec 4.6 Model Analysis: Run-ti The total run-time is split betw26 5. Word senses and word sense ambiguity • Most words have lots of meanings! • Especially common words • Especially words that have existed for a long time • Example: pike • Does one vector capture all these meanings or do we have a mess? 27 pike • A sharp point or staff • A type of elongated fish • A railroad line or system • A type of road • The future (coming down the pike) • A type of body position (as in diving) • To kill or pierce with a pike • To make one’s way (pike along) • In Australian English, pike means to pull out from doing something: I reckon he could have climbed that cliff, but he piked! 28 Improving Word Representations Via Global Context And Multiple Word Prototypes (Huang et al. 2012) • Idea: Cluster word windows around words, retrain with each word assigned to multiple different clusters bank1, bank2, etc. 29 Linear Algebraic Structure of Word Senses, with Applications to Polysemy (Arora, …, Ma, …, TACL 2018) • Different senses of a word reside in a linear superposition (weighted sum) in standard word embeddings like word2vec • 𝑣pike = 𝛼! 𝑣pike# + 𝛼" 𝑣pike$ + 𝛼# 𝑣pike% • Where 𝛼! = $# $#%$$%$% , etc., for frequency f • Surprising result: • Because of ideas from sparse coding you can actually separate out the senses (providing they are relatively common)! 30 6. Deep Learning Classification: Named Entity Recognition (NER) • The task: find and classify names in text, by labeling word tokens, for example: Last night , Paris Hilton wowed in a sequin gown . PER PER Samuel Quinn was arrested in the Hilton Hotel in Paris in April 1989 . PER PER LOC LOC LOC DATE DATE • Possible uses: • Tracking mentions of particular entities in documents • For question answering, answers are usually named entities • Relating sentiment analysis to the entity under discussion • Often followed by Entity Linking/Canonicalization into a Knowledge Base such as Wikidata 31 Simple NER: Window classification using binary logistic classifier • Idea: classify each word in its context window of neighboring words • Train logistic classifier on hand-labeled data to classify center word {yes/no} for each class based on a concatenation of word vectors in a window • Really, we usually use multi-class softmax, but we’re trying to keep it simple J • Example: Classify “Paris” as +/– location in context of sentence with window length 2: the museums in Paris are amazing to see . Xwindow = [ xmuseums xin xParis xare xamazing ]T • Resulting vector xwindow = x ∈ R5d • To classify all words: run classifier for each class on the vector centered on each word in the sentence 32 Classification review and notation • Supervised learning: we have a training dataset consisting of samples {xi,yi}N i=1 • xi are inputs, e.g., words (indices or vectors!), sentences, documents, etc. • Dimension d • yi are labels (one of C classes) we try to predict, for example: • classes: sentiment (+/–), named entities, buy/sell decision • other words • later: multi-word sequences 33 Neural classification 34 • Typical ML/stats softmax classifier: • Learned parameters θ are just elements of W (not input representation x, which has sparse symbolic features) • Classifier gives linear decision boundary, which can be limiting • A neural network classifier differs in that: • We learn both W and (distributed!) representations for words • The word vectors x re-represent one-hot vectors, moving them around in an intermediate layer vector space, for easy classification with a (linear) softmax classifier • Conceptually, we have an embedding layer: x = Le • We use deep networks—more layers—that let us re-represent and compose our data multiple times, giving a non-linear classifier But typically, it is linear relative to the pre-final layer representation Softmax classifier Again, we can tease apart the prediction function into three steps: 1. For each row y of W, calculate dot product with x: 2. Apply softmax function to get normalized probability: = softmax(𝑓E) 3. Choose the y with maximum probability • For each training example (x,y), our objective is to maximize the probability of the correct class y or we can minimize the negative log probability of that class: 35 NER: Binary classification for center word being location 36 • We do supervised training and want high score if it’s a location 𝐽D 𝜃 = 𝜎 𝑠 = 1 1 + 𝑒./ x = [ xmuseums xin xParis xare xamazing ] predicted model probability of class f = Some elementwise non-linear function, e.g., logistic, tanh, ReLU Training with “cross entropy loss” – you use this in PyTorch! • Until now, our objective was stated as to maximize the probability of the correct class y or equivalently we can minimize the negative log probability of that class • Now restated in terms of cross entropy, a concept from information theory • Let the true probability distribution be p; let our computed model probability be q • The cross entropy is: • Assuming a ground truth (or true or gold or target) probability distribution that is 1 at the right class and 0 everywhere else, p = [0, …, 0, 1, 0, …, 0], then: • Because of one-hot p, the only term left is the negative log probability of the true class yi: − log 𝑝(𝑦;|𝑥;) 37 Cross entropy can be used in other ways with a more interesting p, but for now just know that you’ll want to use it as the loss in PyTorch Classification over a full dataset • Cross entropy loss function over full dataset {xi,yi}N i=1 38 Remember: Stochastic Gradient Descent Update equation: i.e., for each parameter: 𝜃F +,G = 𝜃F 43: − 𝛼 HI C HC' !() In deep learning, 𝜃 includes the data representation (e.g., word vectors) too! How can we compute ∇C 𝐽(𝜃)? 1. By hand 2. Algorithmically: the backpropagation algorithm (next lecture!) 𝛼 = step size or learning rate 39 7. Neural computation 40 A binary logistic regression unit is a bit similar to a neuron hw,b (x) = f (wT x + b) f (z) = 1 1+e−z w, b are the parameters of this neuron i.e., this logistic regression model b: We can have an “always on” bias feature, which gives a class prior, or separate it out, as a bias term 41 f = nonlinear activation function (e.g. sigmoid), w = weights, b = bias, h = hidden, x = inputs A neural network = running several logistic regressions at the same time If we feed a vector of inputs through a bunch of logistic regression functions, then we get a vector of outputs … But we don’t have to decide ahead of time what variables these logistic regressions are trying to predict! 42 A neural network = running several logistic regressions at the same time … which we can feed into another logistic regression function, giving composed functions It is the loss function that will direct what the intermediate hidden variables should be, so as to do a good job at predicting the targets for the next layer, etc. 43 A neural network = running several logistic regressions at the same time Before we know it, we have a multilayer neural network…. 44 This allows us to re-represent and compose our data multiple times and to learn a classifier that is highly non-linear in terms of the original inputs (but typically is linear in terms of the pre-final layer representations) Matrix notation for a layer We have In matrix notation Activation f is applied element-wise: a1 a2 a3 a1 = f (W11x1 +W12 x2 +W13x3 + b1) a2 = f (W21x1 +W22 x2 +W23x3 + b2 ) etc. z = Wx + b a = f (z) f ([z1, z2, z3]) =[ f (z1), f (z2 ), f (z3)] W12 b3 45 Non-linearities (like f or sigmoid): Why they’re needed 46 • Neural networks do function approximation, e.g., regression or classification • Without non-linearities, deep neural networks can’t do anything more than a linear transform • Extra layers could just be compiled down into a single linear transform: W1 W2 x = Wx • But, with more layers that include non-linearities, they can approximate more complex functions!