MUNI FI Evaluation of Word Embeddings PA154 Language Modeling (8.2) Pavel Rychlý Natural Language Processing Centre Faculty of Informatics, Masaryk University April 4, 2023 Pavel Rychlý • Evaluation of Word Embeddings • April 4, 2023 1/17 Word Embeddings ■ many hyperparameters, diffrent training data ■ different results even for same parameters and data ■ what is better? ■ how to compare quality of vectors? ■ evaluate a direct outcome: word similarities Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 2/17 Sketch Engine Thesaurus Lemma Score Freq king 0.242 16,899 prince 0.213 6,355 Charles 0.189 8,952 elizabeth 0.177 3,567 edward 0.176 6,484 mary 0.173 6,870 aentleman 0.171 6,274 lady 0.170 11,905 husband 0.167 11,669 sister 0.167 8,062 mother 0.164 27,536 princess 0.160 2,944 father 0.159 23,824 wife 0.157 18,308 brother 0.155 11,049 henry 0.151 6,699 daughter 0.150 11,216 anne 0.149 4,386 rii ippn(noun) L|UUUI I British National Corpus (BNC) freq = 7,872 (70.10 per million) mptherdoctor edwanieeatl.effianparent someone Wlfe§ISteiHI ■' Ut£ -e^one did anneSban^^^aerleeS ™9irlKHQ řr0therduke jane guy john guest J™asTer Pavel Rychlý • Evaluation of Word Embeddings • April 4, 2023 3/17 Thesaurus evaluation Gold standard Source Most similar words to queen serelex king, brooklyn, bowie, prime minister, mary, bronx, rolling stone, elton John, royal family, princess Thesaurs.com monarch, ruler, consort, empress, regent, female ruler, female sovereign, queen consort, queen dowager SkE on BNC king, prince, Charles, elizabeth, edward, mary, gentle- man, lady, husband, sister, mother, princess, father SkE on enTenTen08 princess, prince, king, emperor, monarch, lord, lady, sister, lover, ruler, goddess, hero, mistress, warrior word2vec on BNC princess, prince, Princess, king, Diana, Queen, duke, palace, Buckingham, duchess, lady-in-waiting, Prince powerthesaurus.org empress, sovereign, monarch, ruler, czarina, queen consort, king, queen regnant, princess, rani, queen regent Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 4/17 Thesaurus evaluation Gold standard ■ very Low inter-annotater agreement ■ there are many directions of similarities ■ existing gold standards not usable Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 Analogy queries ■ evaluation of word embeddings (word2vec) ■ "a is to a* as b is to b*", where b* is hidden Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 Analogy queries ■ evaluation of word embeddings (word2vec) ■ "a is to a* as b is to b*n, where b* is hidden ■ syntactic: good is to best as swarf is to smarter ■ semantic: Par/s is to France as 7b/cyo is to Japan ■ agreement by humans: Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 6/17 Analogy queries ■ evaluation of word embeddings (word2vec) ■ "a is to a* as b is to b*n, where b* is hidden ■ syntactic: good is to best as swarf is to smarter ■ semantic: Par/s is to France as 7b/cyo is to Japan ■ agreement by humans: Berlin - Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 6/17 Analogy queries ■ evaluation of word embeddings (word2vec) ■ "a is to a* as b is to b*n, where b* is hidden ■ syntactic: good is to best as swarf is to smarter ■ semantic: Par/s is to France as 7b/cyo is to Japan ■ agreement by humans: Berlin - Germany Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 6/17 Analogy queries ■ evaluation of word embeddings (word2vec) ■ "a is to a* as b is to b*n, where b* is hidden ■ syntactic: good is to best as swarf is to smarter ■ semantic: Par/s is to France as 7b/cyo is to Japan ■ agreement by humans: Berlin - Germany London - Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 6/17 Analogy queries ■ evaluation of word embeddings (word2vec) ■ "a is to a* as b is to b*n, where b* is hidden ■ syntactic: good is to best as swarf is to smarter ■ semantic: Par/s is to France as 7b/cyo is to Japan ■ agreement by humans: Berlin - Germany London - England / Britain / UK ? Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 6/17 Analogy queries ■ evaluation of word embeddings (word2vec) ■ "a is to a* as b is to b*", where b* is hidden ■ syntactic: good is to best as smart is to smarter ■ semantic: Paris is to France as 7o/cyo is to Japan ■ agreement by humans: Berlin - Germany London - England / Britain / UK ? ■ best match for Linear combination of vectors: arg \naxb*eV cos(b*, a* — a + b) Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 6/17 Analogy queries Alternatives to cosine similarity ■ cos(x' y) = i0Sm ■ arg maxö.€y cos(b*,o* - a + b) = Pavel Rye hly • Evaluation of Word Embeddings • April 4, 2023 Analogy queries Alternatives to cosine similarity ■ cosix v) = y_ ■ arg \naxb*eV cos(b*, a* - a + b) = arg max£*G^(cos(b*, a*) - cos(b*, a) + cos(b*,b)) (CosAdd) Pavel Rychlý • Evaluation of Word Embeddings • April 4, 2023 7/17 Analogy queries Alternatives to cosine similarity ■ arg max^*G\/Cos(Ď*,a* - a + b) = arg max^*G^(cos(ó*, a*) - cos(b*, a) + cos(b*,b)) (CosAdd) ■ argmaxd*ey ^ cos(>%a) > (CosMuL) ■ SkE uses Jaccard similarity instead of cosine similarity: JacAddJacMul Pavel Rychlý • Evaluation of Word Embeddings • April 4, 2023 7/17 Thesaurus Evaluation Results on capitaL-common-countries question set (462 queries) BNC SkELL count percent count percent CosAdd 58 12.6 183 39.6 CosMuL 99 21.4 203 43.9 JacAdd 32 6.9 319 69.0 JacMuL 57 12.3 443 95.9 word2vec 159 34.4 366 79.2 Results depends not only on data but also on the evaluation method. Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 8/17 Results on other corpora More English corpora, using JacMuL Corpus size (M) correct BNC 112 57 SkELL 1,520 443 araneum maius (LCL sketches) 1,200 224 encluewebl6 16,398 448 ententen 08 3,268 0 ententen 12 12,968 0 ententen 13 22,878 439 Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 9/17 Problems of analogy queries ■ Pair of words does not define an exact relation ■ Berlin - Germany: capital, biggest city ■ in what time? ■ Canberra Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 10/17 Problems of analogy queries ■ Pair of words does not define an exact relation ■ Berlin - Germany: capital, biggest city ■ in what time? ■ Canberra, Rome Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 10/17 Problems of analogy queries ■ Pair of words does not define an exact relation ■ Berlin - Germany: capital, biggest city ■ in what time? ■ Canberra, Rome ■ rare words/phrases ■ Baltimore - Baltimore Sun: Cincinnati - Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 10/17 Problems of analogy queries ■ Pair of words does not define an exact relation ■ Berlin - Germany: capital, biggest city ■ in what time? ■ Canberra, Rome ■ rare words/phrases ■ Baltimore - Baltimore Sun: Cincinnati - Cincinnati Enquirer Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 10/17 Outlier detection ■ List of words ■ find the one which is not part of the cluster ■ examples: ■ red, blue, green, dark, yellow, purple, pink, orange, brown Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 11/17 Outlier detection ■ List of words ■ find the one which is not part of the cluster ■ examples: ■ red, blue, green, dark, yellow, purple, pink, orange, brown ■ t-shirt, sheet, dress, trousers, shorts, jumper, skirt, shirt, coat Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 11/17 Evaluating Outlier Detection ■ original data set by Camacho-CoUados, NavigLi ■ 8 pairs of 8 words in a cluster and 8 outliers ■ 8 x 8 = 64 queries ■ Accuracy - the percentage of successfully answered queries, ■ Outlier Position Percentage (OPP) Score - average percentage of the right answer (Outlier Position) in the List of possible clusters ordered by their compactness Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 12/17 Problems of original data set ■ English only ■ needs extra knowledge ■ Mercedes Benz, BMW, Michelin, Audi, Opel, Volkswagen, Porsche, Alpina, Smart Pavel Ryehly • Evaluation of Word Embeddings • April 4, 2023 13/17 Problems of original data set ■ English only ■ needs extra knowledge ■ Mercedes Benz, BMW, Michelin, Audi, Opel, Volkswagen, Porsche, Alpina, Smart ■ (Bridgestone, Boeing, Samsung, Michael Schumacher, Angela Merkel, Capri, pineapple) Pavel Rye hly • Evaluation of Word Embeddings • April 4, 2023 13/17 Problems of original data set ■ English only ■ needs extra knowledge ■ Mercedes Benz, BMW, Michelin, Audi, Opel, Volkswagen, Porsche, Alpina, Smart ■ (Bridgestone, Boeing, Samsung, Michael Schumacher, Angela Merkel, Capri, pineapple) ■ Peter, Andrew, James, John, Thaddaeus, Bartholomew, Thomas, Noah, Matthew Pavel Rye hly • Evaluation of Word Embeddings • April 4, 2023 13/17 Problems of original data set ■ English only ■ needs extra knowledge ■ Mercedes Benz, BMW, Michelin, Audi, Opel, Volkswagen, Porsche, Alpina, Smart ■ (Bridgestone, Boeing, Samsung, Michael Schumacher, Angela Merkel, Capri, pineapple) ■ Peter, Andrew, James, John, Thaddaeus, Bartholomew, Thomas, Noah, Matthew ■ January, March, May, July, Wednesday, September, November, February, June Pavel Rye hly • Evaluation of Word Embeddings • April 4, 2023 13/17 Problems of original data set ■ English only ■ needs extra knowledge ■ Mercedes Benz, BMW, Michelin, Audi, Opel, Volkswagen, Porsche, Alpina, Smart ■ (Bridgestone, Boeing, Samsung, Michael Schumacher, Angela Merkel, Capri, pineapple) ■ Peter, Andrew, James, John, Thaddaeus, Bartholomew, Thomas, Noah, Matthew ■ January, March, May, July, Wednesday, September, November, February, June ■ tiger, dog, lion, cougar, jaguar, leopard, cheetah, wildcat, lynx ■ mostly proper names (7 out of 8) Pavel Rye hly • Evaluation of Word Embeddings • April 4, 2023 13/17 New data set: HAMOD ■ 7 Languages: Czech, Slovak, English, German, French, Italian, Estonian ■ 128 clusters (8 words + 8 outliers) ■ https://github.com/lexicalcomputing/hamod Pavel Rye hly • Evaluation of Word Embeddings • April 4, 2023 14/17 New data set - example Colors Electronics Czech English Czech English červená red televize television modrá blue reproduktor speaker zelená green notebook laptop žlutá yellow tablet tablet fialová purple mp3 přehrávač mp3 player O V ' ruzova pink mobil phone oranžová orange rádio radio hnědá brown Playstation Playstation dřevěná wooden blok notebook skleněná glass sešit workbook temná dark kniha book zariva bright CD CD pruhovaný striped energie energy puntíkovaný dotted světlo light smutná sad papír paper nízká low ráno morning Pavel Rychlý • Evaluation of Word Embeddings • April 4, 2023 15/17 Evaluation ■ 9 clusters only, 72 queries OOP Accuracy Czes2 92.2 70.8 czTenTenl2 93.4 79.2 csTenTenl7 94.3 81.9 czTenTenl2 (fasttext) 97.7 87.5 Czech Common Crawl 98.1 95.8 Pavel Rye hly • Evaluation of Word Embeddings • April 4, 2023 Construction ■ each human evaLuator goes through all the sets (only once) for their native Language ■ 1 exercise: 8 inLiers + 1 outLier (randomly chosen from the List of outLiers for each set) ■ in each turn, the evaLuator selects the outLier ■ simple web interface for the exercise ■ Inter-Annotator Agreement: Estonian 0.93, Czech 0.97 DRUMS PIANO I HEADPHONES HARP | DOUBLE BASS | FLUTE Hi GUITAR I VIOLIN I SAXOPHONE I I'M NOT SURE I QUIT |H Pavel Rychlý • Evaluation of Word Embeddings • April 4, 2023 17/17