Continuous Space Representation
PA153
Pavel Rychlý
18 Sep 2023
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 1 / 22
Problems with statistical NLP
many distinct words (items) (from Zipf)
zero counts
MLE gives zero probability
p(w3|w1, w2) =
count(w1, w2, w3)
count(w1, w2)
not handling similarities
some words share some (important) features
driver, teacher, butcher
small, little, tiny
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 2 / 22
Many distinct words
How to solve:
use only most frequent ones (ignore outliers)
use smaller units (subwords)
preﬁxes, sufﬁxes
-er, -less, pre-
But:
we want to add more words
black hole is not black or hole
even less frequent words are important
deagrofertizace from “The deagrofertization of the state must come.”
humans process them easilyPavel Rychlý ·Continuous Space Representation ·18 Sep 2023 3 / 22
Zero counts
How to solve:
complicated smoothing strategies
Good-Turing, Kneser–Ney, back-off, ...
bigger corpora
more data = better estimation
But:
sometimes there is no more data
Shakespeare, new research ﬁeld
any size is not big enough
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 4 / 22
How big corpus?
Noun test
British National Corpus
15789 hits, rank 918
word sketches from the Sketch Engine
object-of: pass, undergo, satisfy, fail, devise, conduct, administer, perform,
apply, boycott
modiﬁer: blood, driving, ﬁtness, beta, nuclear, pregnancy
can we freely combine any two from that lists?
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 5 / 22
How big corpus?
Collocations of noun test
blood test in BNC
object-of: order (3), take (12)
blood test in enClueWeb16 (16 billion tokens)
object-of: order (708), perform (959), undergo (174), administer (123),
conduct (229), require (676), repeat (80), run (347), request (105), take
(1215)
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 6 / 22
How big corpus?
Phrase pregnancy test in 16 billion corpus
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 7 / 22
How big corpus?
Phrase black hole in 16 billion corpus
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 8 / 22
Similarities of words
Distinct words?:
supermassive, super-massive, Supermassive
small, little, tiny
black hole, star
apple, banana, orange
red, green, orange
auburn, burgundy, mahogony, ruby
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 9 / 22
Continuous space representation
words are not distinct
represented by a vector of numbers
similar words are closer each other
more dimensions = more features
tens to hundreds, up to 1000
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 10 / 22
Words as vectors
continue = [0.286, 0.792, −0.177, −0.107, 0.109, −0.542, 0.349]
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 11 / 22
How to create a vector representation
From co-occurrence counts:
Singular value decomposition (SVD)
each word one dimension
select/combine important dimenstions
factorization of co-occurrence matrix
Principal component analysis (PCA)
Latent Dirichlet Allocation (LDA)
learning probabilities of hidden variables
Neural Networks
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 12 / 22
Neural Networks
training from examples = supervised training
sometimes negative examples
generating examples from texts
from very simple (one layer) to deep ones (many layers)
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 13 / 22
NN training method
one training example = (input, expected output) = (x, y)
random initialization of parameters
for each example:
get output for input: y‘ = NN(x)
compute loss = difference between expected output and real output:
loss = |y − y |
update paremeters to decrease loss
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 14 / 22
Are vectors better than IDs
even one hit could provide useful information
Little Prince corpus (21,000 tokens)
modiﬁers of “planet”
seventh, stately, sixth, wrong, tine, ﬁfth, ordinary, next, little, whole
each with 1 hit
many are close together, share a feature
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 15 / 22
Simple vector learning
each word has two vectors
node vector (nodew)
context vector (ctxw)
generate (node, context) pairs from text
for example from bigrams: w1, w2
w1 is context, w2 is node
move closer ctxw1 and nodew2
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 16 / 22
Simple vector learning
node_vec = np.random.rand(len(vocab), dim) * 2 -1
ctx_vec = np.zeros((len(vocab), dim))
def train_pair(nodeid, ctxid, alpha):
global node_vec, ctx_vec
Nd = node_vec[nodeid]
Ct = ctx_vec[ctxid]
loss = 1 - expit(np.dot(Nd, Ct))
corr = loss * alpha
Nd += corr * (Ct - Nd)
Ct += corr * (Nd - Ct)
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 17 / 22
Expit (sigmoid) function
expit(x) = 1/(1 + exp(−x)) = 1/(1 + e−x)
limit range: output in (0, 1)
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 18 / 22
Simple vector learning
for e in range(epochs):
last = tokIDs[0]
for wid in tokIDs[1:]:
train_pair(wid, last, alpha)
last = wid
# update alpha
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 19 / 22
Embeddings advantages
no problem in number of parameters
similarity in many different directions
good estimations of scores
generalization
learnig for some words generalize to similar words
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 20 / 22
Embeddings of other items
lemmata
part of speech
topics
any list of items with some structure
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 21 / 22
Summary
numeric vectors provides continues space representation of words
similar words are closer
similarity in many different directions (features)
morphology (number, gender)
domain/style
word formation
Pavel Rychlý ·Continuous Space Representation ·18 Sep 2023 22 / 22