luation of Czech Distributional "hesauri Pavel Rychlý Natural Language Processing Centre Faculty of Informatics, Masaryk University December 7, 2019 Pavel Rychlý Evaluation of Czech Distributional Thesauri Sketch Engine Thesaurus Lemma Score Freq king 0.242 16,899 prince 0.213 6,355 Charles 0.189 8,952 elizabeth 0.177 3,567 edward 0.176 6,484 mary 0.173 6,870 aentleman 0.171 6,274 lady 0.170 11,905 husband 0.167 11,669 sister 0.167 8,062 mother 0.164 27,536 princess 0.160 2,944 father 0.159 23,824 wife 0.157 18,308 brother 0.155 11,049 henry 0.151 6,699 dauahter 0.150 11,216 anne 0.149 4,386 queen British National Corpus (BNC) freq = 7T872 (70.10 per million) mptherdoctor someone WifeSISteiH1 1 UJrp" e,v^°'le nchardf atherhusbancf C h ar les marg|eta|^emarv|ad^ P,X™ 9ir|K! íl U .brother^ phifip" %J n e n rVfnaster jane guy john guest J"1™"11 Pavel Rychlý Evaluation of Czech Distributional Thesauri Thesaurus evaluation Gold standard Source Most similar words to queen serelex king, brooklyn, bowie, prime minister, mary, bronx, rolling stone, elton John, royal family, princess Thesaurs.com monarch, ruler, consort, empress, regent, female ruler, female sovereign, queen consort, queen dowager SkE on BNC king, prince, charles, elizabeth, edward, mary, gentle- man, lady, husband, sister, mother, princess, father SkE on enTenTen08 princess, prince, king, emperor, monarch, lord, lady, sister, lover, ruler, goddess, hero, mistress, warrior word2vec on BNC princess, prince, Princess, king, Diana, Queen, duke, palace, Buckingham, duchess, lady-in-waiting, Prince powerthesaurus.org empress, sovereign, monarch, ruler, czarina, queen consort, king, queen regnant, princess, rani, queen regent Pavel Rychlý Evaluation of Czech Distributional Thesauri Analogy queries evaluation of word embeddings (word2vec) "a is to a* as b is to b*", where b* is hidden Pavel Rychlý Evaluation of Czech Distributional Thesauri Analogy queries evaluation of word em beddings (word2vec) "a is to a* as b is to b*", where b* is hidden syntactic: good is to best as smart is to smarter semantic: Paris is to France as Tokyo is to Japan agreement by humans: Pavel Rychlý Evaluation of Czech Distributional Thesauri Analogy queries evaluation of word em beddings (word2vec) "a is to a* as b is to b*", where b* is hidden syntactic: good is to best as smart is to smarter semantic: Paris is to France as Tokyo is to Japan agreement by humans: Berlin - Pavel Rychlý Evaluation of Czech Distributional Thesauri Analogy queries evaluation of word em beddings (word2vec) "a is to a* as b is to b*", where b* is hidden syntactic: good is to best as smart is to smarter semantic: Paris is to France as Tokyo is to Japan agreement by humans: Berlin - Germany Pavel Rychlý Evaluation of Czech Distributional Thesauri Analogy queries evaluation of word em beddings (word2vec) "a is to a* as b is to b*", where b* is hidden syntactic: good is to best as smart is to smarter semantic: Paris is to France as Tokyo is to Japan agreement by humans: Berlin - Germany London - Pavel Rychlý Evaluation of Czech Distributional Thesauri Analogy queries evaluation of word em beddings (word2vec) "a is to a* as b is to b*", where b* is hidden syntactic: good is to best as smart is to smarter semantic: Paris is to France as Tokyo is to Japan agreement by humans: Berlin - Germany London - England / Britain / UK ? Pavel Rychlý Evaluation of Czech Distributional Thesauri Analogy queries evaluation of word embeddings (word2vec) "a is to a* as b is to b*", where b* is hidden syntactic: good is to best as smart is to smarter semantic: Paris is to France as Tokyo is to Japan agreement by humans: Berlin - Germany London - England / Britain / UK ? best match for linear combination of vectors: arg max£>*G\/ cos(b*, a* — a + b) Pavel Rychlý Evaluation of Czech Distributional Thesauri Problems of analogy queries ■ Pair of words does not define an exact relation ■ Berlin - Germany: capital, biggest city ■ in what time? ■ Canberra Pavel Rychlý Evaluation of Czech Distributional Thesauri Problems of analogy queries ■ Pair of words does not define an exact relation ■ Berlin - Germany: capital, biggest city ■ in what time? ■ Canberra, Rome Pavel Rychlý Evaluation of Czech Distributional Thesauri vT) o Problems of analogy queries ■ Pair of words does not define an exact relation ■ Berlin - Germany: capital, biggest city ■ in what time? ■ Canberra, Rome ■ rare words/phrases ■ Baltimore - Baltimore Sun: Cincinnati - Pavel Rychlý Evaluation of Czech Distributional Thesauri Problems of analogy queries ■ Pair of words does not define an exact relation ■ Berlin - Germany: capital, biggest city ■ in what time? ■ Canberra, Rome ■ rare words/phrases ■ Baltimore - Baltimore Sun: Cincinnati - Cincinnati Enquirer Pavel Rychlý Evaluation of Czech Distributional Thesauri Outlier detection ■ list of words ■ find the one which is not part of the cluster ■ examples: ■ red, blue, green, dark, yellow, purple, pink, orange, brown Pavel Rychlý Evaluation of Czech Distributional Thesauri Outlier detection list of words find the one which is not part of the cluster examples: ■ red, blue, green, dark, yellow, purple, pink, orange, brown ■ t-shirt, sheet, dress, trousers, shorts, jumper, skirt, shirt, coat Pavel Rychlý Evaluation of Czech Distributional Thesauri Evaluating Outlier Detection ■ original data set by Camacho-Collados, Navigli ■ 8 pairs of 8 words in a cluster and 8 outliers ■ 8 x 8 = 64 queries ■ Accuracy - the percentage of successfully answered queries, ■ Outlier Position Percentage (OPP) Score - average percentage of the right answer (Outlier Position) in the list of possible clusters ordered by their compactness Pavel Rychlý Evaluation of Czech Distributional Thesauri Problems of original data set English only needs extra knowledge ■ Mercedes Benz, BMW, Michelin, Audi, Opel, Volkswagen Porsche, Alpina, Smart Pavel Rychlý Evaluation of Czech Distributional Thesauri Problems of original data set ■ English only ■ needs extra knowledge ■ Mercedes Benz, BMW, Michelin, Audi, Opel, Volkswagen, Porsche, Alpina, Smart ■ (Bridgestone, Boeing, Samsung, Michael Schumacher, Angela Merkel, Capri, pineapple) Pavel Rychlý Evaluation of Czech Distributional Thesauri Problems of original data set English only needs extra knowledge ■ Mercedes Benz, BMW, Michelin, Audi, Opel, Volkswagen, Porsche, Alpina, Smart ■ (Bridgestone, Boeing, Samsung, Michael Schumacher, Angela Merkel, Capri, pineapple) ■ Peter, Andrew, James, John, Thaddaeus, Bartholomew, Thomas, Noah, Matthew Pavel Rychlý Evaluation of Czech Distributional Thesauri Problems of original data set English only needs extra knowledge ■ Mercedes Benz, BMW, Michelin, Audi, Opel, Volkswagen, Porsche, Alpina, Smart ■ (Bridgestone, Boeing, Samsung, Michael Schumacher, Angela Merkel, Capri, pineapple) ■ Peter, Andrew, James, John, Thaddaeus, Bartholomew, Thomas, Noah, Matthew ■ January, March, May, July, Wednesday, September, November, February, June Pavel Rychlý Evaluation of Czech Distributional Thesauri Problems of original data set ■ English only ■ needs extra knowledge ■ Mercedes Benz, BMW, Michelin, Audi, Opel, Volkswagen, Porsche, Alpina, Smart ■ (Bridgestone, Boeing, Samsung, Michael Schumacher, Angela Merkel, Capri, pineapple) ■ Peter, Andrew, James, John, Thaddaeus, Bartholomew, Thomas, Noah, Matthew ■ January, March, May, July, Wednesday, September, November, February, June ■ tiger, dog, lion, cougar, jaguar, leopard, cheetah, wildcat, lynx ■ mostly proper names (7 out of 8) Pavel Rychlý Evaluation of Czech Distributional Thesauri New data set ■ 5 languages: Czech, Slovak, English, German, French ■ 48 clusters (8 words + 8 outliers) Pavel Rychlý Evaluation of Czech Distributional Thesauri New data set - example Colors Electronics Czech English Czech English červená red televize television modrá blue reproduktor speaker zelená green notebook laptop žlutá yellow tablet tablet fialová purple mp3 přehrávač mp3 player růžová pink mobil phone oranžová orange rádio radio hnědá brown Playstation Playstation dřevěná wooden blok notebook skleněná glass sešit workbook temná dark kniha book zariva bright CD CD pruhovaný striped energie energy puntíkovaný dotted světlo light smutná sad papír paper nízká low ráno morning Pavel Rychlý Evaluation of Czech Distributional Thesauri vT) ■ 9 clusters only, 72 queries OOP Accuracy Czes2 92.2 70.8 czTenTenl2 93.4 79.2 csTenTenlľ 94.3 81.9 czTenTenl2 (fasttext) 97.7 87.5 Czech Common Crawl 98.1 95.8 Pavel Rychlý Evaluation of Czech Distributional Thesauri