Continuous Space Representation PA153 Pavel Rychlý 18 Sep 2023 Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 1 / 22 Problems with statistical NLP many distinct words (items) (from Zipf) zero counts MLE gives zero probability p(w3|w1, w2) = count(w1, w2, w3) count(w1, w2) not handling similarities some words share some (important) features driver, teacher, butcher small, little, tiny Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 2 / 22 Many distinct words How to solve: use only most frequent ones (ignore outliers) use smaller units (subwords) prefixes, suffixes -er, -less, pre- But: we want to add more words black hole is not black or hole even less frequent words are important deagrofertizace from “The deagrofertization of the state must come.” humans process them easilyPavel Rychlý ·Continuous Space Representation ·18 Sep 2023 3 / 22 Zero counts How to solve: complicated smoothing strategies Good-Turing, Kneser–Ney, back-off, ... bigger corpora more data = better estimation But: sometimes there is no more data Shakespeare, new research field any size is not big enough Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 4 / 22 How big corpus? Noun test British National Corpus 15789 hits, rank 918 word sketches from the Sketch Engine object-of: pass, undergo, satisfy, fail, devise, conduct, administer, perform, apply, boycott modifier: blood, driving, fitness, beta, nuclear, pregnancy can we freely combine any two from that lists? Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 5 / 22 How big corpus? Collocations of noun test blood test in BNC object-of: order (3), take (12) blood test in enClueWeb16 (16 billion tokens) object-of: order (708), perform (959), undergo (174), administer (123), conduct (229), require (676), repeat (80), run (347), request (105), take (1215) Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 6 / 22 How big corpus? Phrase pregnancy test in 16 billion corpus Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 7 / 22 How big corpus? Phrase black hole in 16 billion corpus Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 8 / 22 Similarities of words Distinct words?: supermassive, super-massive, Supermassive small, little, tiny black hole, star apple, banana, orange red, green, orange auburn, burgundy, mahogony, ruby Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 9 / 22 Continuous space representation words are not distinct represented by a vector of numbers similar words are closer each other more dimensions = more features tens to hundreds, up to 1000 Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 10 / 22 Words as vectors continue = [0.286, 0.792, −0.177, −0.107, 0.109, −0.542, 0.349] Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 11 / 22 How to create a vector representation From co-occurrence counts: Singular value decomposition (SVD) each word one dimension select/combine important dimenstions factorization of co-occurrence matrix Principal component analysis (PCA) Latent Dirichlet Allocation (LDA) learning probabilities of hidden variables Neural Networks Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 12 / 22 Neural Networks training from examples = supervised training sometimes negative examples generating examples from texts from very simple (one layer) to deep ones (many layers) Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 13 / 22 NN training method one training example = (input, expected output) = (x, y) random initialization of parameters for each example: get output for input: y‘ = NN(x) compute loss = difference between expected output and real output: loss = |y − y | update paremeters to decrease loss Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 14 / 22 Are vectors better than IDs even one hit could provide useful information Little Prince corpus (21,000 tokens) modifiers of “planet” seventh, stately, sixth, wrong, tine, fifth, ordinary, next, little, whole each with 1 hit many are close together, share a feature Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 15 / 22 Simple vector learning each word has two vectors node vector (nodew) context vector (ctxw) generate (node, context) pairs from text for example from bigrams: w1, w2 w1 is context, w2 is node move closer ctxw1 and nodew2 Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 16 / 22 Simple vector learning node_vec = np.random.rand(len(vocab), dim) * 2 -1 ctx_vec = np.zeros((len(vocab), dim)) def train_pair(nodeid, ctxid, alpha): global node_vec, ctx_vec Nd = node_vec[nodeid] Ct = ctx_vec[ctxid] loss = 1 - expit(np.dot(Nd, Ct)) corr = loss * alpha Nd += corr * (Ct - Nd) Ct += corr * (Nd - Ct) Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 17 / 22 Expit (sigmoid) function expit(x) = 1/(1 + exp(−x)) = 1/(1 + e−x) limit range: output in (0, 1) Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 18 / 22 Simple vector learning for e in range(epochs): last = tokIDs[0] for wid in tokIDs[1:]: train_pair(wid, last, alpha) last = wid # update alpha Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 19 / 22 Embeddings advantages no problem in number of parameters similarity in many different directions good estimations of scores generalization learnig for some words generalize to similar words Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 20 / 22 Embeddings of other items lemmata part of speech topics any list of items with some structure Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 21 / 22 Summary numeric vectors provides continues space representation of words similar words are closer similarity in many different directions (features) morphology (number, gender) domain/style word formation Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 22 / 22