Continuous Space Reprasentation (PA153) Pavel Rychlý Problems with statistical NLP many distinct words (items) (from Zipf) zero counts MLE gives zero probability not handling similarities some words share some (important) features driver, teacher, butcher small, little, tiny Many distinct words How to solve: use only most frequent ones (ignore outliers) use smaller units (subwords) prefixes, suffixes -er, -less, pre- But: we want to add more words black hole is not black or hole even less frequent words are important Zero counts How to solve: bigger corpora more data = better estimation But: sometimes there is no more data Shakespeare, new research field any size is not big enough How big corpus? Noun test British National Corpus 15789 hits, rank 918 word sketches from the Sketch Engine object-of: pass, undergo, satisfy, fail, devise, conduct, administer, perform, apply, boycott modifier: blood, driving, fitness, beta, nuclear, pregnancy can we freely combine any two from that lists? How big corpus? Collocations of noun test blood test in BNC object-of: order (3), take (12) blood test in enClueWeb16 (16 billion tokens) object-of: order (708), perform (959), undergo (174), administer (123), conduct (229), require (676), repeat (80), run (347), request (105), take (1215) How big corpus? Phrase pregnancy test in 16 billion corpus Figure 1: pregnancy test word sketch How big corpus? Phrase black hole in 16 billion corpus Figure 2: black hole word sketch Similarities of words Distinct words?: supermassive, super-massive, Supermassive small, little, tiny black hole, star apple, banana, orange red, green, orange auburn, burgundy, mahogony, ruby Continuous space representation words are not distinct represented by a vector of numbers similar words are closer each other more dimensions = more features tens to hundreds, up to 1000 Words as vectors continue = [0.286, 0.792, −0.177, −0.107, 0.109, −0.542, 0.349] How to create a vector representation From co-occurrence counts: Singular value decomposition (SVD) each word one dimension select/combine important dimenstions factorization of co-occurrence matrix Principal component analysis (PCA) Latent Dirichlet Allocation (LDA) learning probabilities of hidden variables Neural Networks Neural Networks training from examples = supervised training sometimes negative examples generating examples from texts from very simple (one layer) to deep ones (many layers) Are vectors better than IDs even one hit could provide useful information Little Prince corpus (21,000 tokens) modifiers of “planet” seventh, stately, sixth, wrong, tine, fifth, ordinary, next, little, whole each with 1 hit many are close together, share a feature Simple vector learning each word has two vectors node vector (nodew ) context vector (ctxw ) generate (node, context) pairs from text for example from bigrams: w1, w2 w1 is context, w2 is node move closer ctxw1 and nodew2 Simple vector learning node_vec = np.random.rand(len(vocab), dim) * 2 -1 ctx_vec = np.zeros((len(vocab), dim)) def train_pair(nodeid, ctxid, alpha): global node_vec, ctx_vec L1 = node_vec[nodeid] L2 = ctx_vec[ctxid] corr = (1 - expit(np.dot(L2, L1)))* alpha L1 += corr * (L2 - L1) L2 += corr * (L1 - L2) Simple vector learning for e in range(epochs): last = tokIDs[0] for wid in tokIDs[1:]: train_pair(wid, last, alpha) last = wid # update alpha