Continuous Space Representation PA153 Pavel Rychlý 18 Sep 2023 Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 1 / 24 Problems with statistical NLP many distinct words (items) (from Zipf) zero counts MLE gives zero probability p(w3|w1, w2) = count(w1, w2, w3) count(w1, w2) not handling similarities some words share some (important) features driver, teacher, butcher small, little, tiny Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 2 / 24 Many distinct words How to solve: use only most frequent ones (ignore outliers) use smaller units (subwords) prefixes, suffixes -er, -less, pre- But: we want to add more words black hole is not black or hole even less frequent words are important deagrofertizace from “The deagrofertization of the state must come.” humans process them easilyPavel Rychlý ·Continuous Space Representation ·18 Sep 2023 3 / 24 Zero counts How to solve: complicated smoothing strategies Good-Turing, Kneser–Ney, back-off, ... bigger corpora more data = better estimation But: sometimes there is no more data Shakespeare, new research field Maltese (MaCoCu Web/EUR-Lex: 400M tokens) any size is not big enough Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 4 / 24 How big corpus? Noun test British National Corpus 15789 hits, rank 918 word sketches from the Sketch Engine object-of: pass, undergo, satisfy, fail, devise, conduct, administer, perform, apply, boycott modifier: blood, driving, fitness, beta, nuclear, pregnancy can we freely combine any two from that lists? Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 5 / 24 How big corpus? Collocations of noun test blood test in BNC object-of: order (3), take (12) blood test in enClueWeb16 (16 billion tokens) object-of: order (708), perform (959), undergo (174), administer (123), conduct (229), require (676), repeat (80), run (347), request (105), take (1215) Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 6 / 24 How big corpus? Phrase pregnancy test in 16 billion corpus Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 7 / 24 How big corpus? Phrase black hole in 16 billion corpus Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 8 / 24 Similarities of words Distinct words?: supermassive, super-massive, Supermassive small, little, tiny black hole, star apple, banana, orange red, green, orange auburn, burgundy, mahogony, ruby Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 9 / 24 Continuous space representation words are not distinct represented by a vector of numbers similar words are closer each other more dimensions = more features tens to hundreds, up to 1000 Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 10 / 24 Words as vectors continue = [0.286, 0.792, −0.177, −0.107, 0.109, −0.542, 0.349] Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 11 / 24 Word features gramatical part of speech, nuber, gender syntactic used with “in”/“at”, always with a particle semantic positive sentiment, movement meaning, fruits style (formal, colloquial) domain (math, biology) form starting with “a”, in capital letters Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 12 / 24 Word features features are not independent math – scientific used with “in” – noun in capital form – proper noun features are not discrete each feature corespond to a (set of) dimension most features are valid for only small set of words most words have (almost) 0 for most features multiple meanings = union of features Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 13 / 24 How to create a vector representation From co-occurrence counts: Singular value decomposition (SVD) each word one dimension select/combine important dimenstions factorization of co-occurrence matrix Principal component analysis (PCA) Latent Dirichlet Allocation (LDA) learning probabilities of hidden variables Neural Networks Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 14 / 24 Neural Networks training from examples = supervised training sometimes negative examples generating examples from texts from very simple (one layer) to deep ones (many layers) Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 15 / 24 NN training method one training example = (input, expected output) = (x, y) random initialization of parameters for each example: get output for input: y = NN(x) compute loss = difference between expected output and real output: loss = |y − y | update paremeters to decrease loss Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 16 / 24 Are vectors better than IDs even one hit could provide useful information Little Prince corpus (21,000 tokens) modifiers of “planet” _seventh, stately, sixth, wrong, tine, fifth, ordinary, next, little, whole each with 1 hit many are close together, share a feature Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 17 / 24 Simple vector learning each word has two vectors node vector (nodew) context vector (ctxw) generate (node, context) pairs from text for example from bigrams: w1, w2 w1 is context, w2 is node move closer ctxw1 and nodew2 Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 18 / 24 Simple vector learning node_vec = np.random.rand(len(vocab), dim) * 2 -1 ctx_vec = np.zeros((len(vocab), dim)) def train_pair(nodeid, ctxid, alpha): global node_vec, ctx_vec Nd = node_vec[nodeid] Ct = ctx_vec[ctxid] loss = 1 - expit(np.dot(Nd, Ct)) corr = loss * alpha Nd += corr * (Ct - Nd) Ct += corr * (Nd - Ct) Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 19 / 24 Expit (sigmoid) function expit(x) = 1/(1 + exp(−x)) = 1/(1 + e−x) limit range: output in (0, 1) Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 20 / 24 Simple vector learning for e in range(epochs): last = tokIDs[0] for wid in tokIDs[1:]: train_pair(wid, last, alpha) last = wid # update alpha Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 21 / 24 Embeddings advantages no problem in number of parameters similarity in many different directions good estimations of scores generalization learnig for some words generalize to similar words Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 22 / 24 Embeddings of other items lemmata part of speech topics any list of items with some structure Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 23 / 24 Summary numeric vectors provides continues space representation of words similar words are closer similarity in many different directions (features) morphology (number, gender) domain/style word formation Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 24 / 24