MUNI FI Meanings Context similarity in huge corpora PA154 Language Modeling (7.2) The word (and some of its parts) are the basic carriers of meaning ■ word without context - no meaning, many meaning potentials ■ the same word in different contexts - different meanings ■ word in similar contexts - same meaning ■ what is context? Pavel Rychlý pary@fi.muni.cz March 30,2023 ^avel Rychlý ■ Context similarity in huge corpora ■ March 30,2023 What is context? Context is the words around the keyword. ■ What surroundings?: ■ the following word ■ previous word ■ window: +1 to +5 ■ window: -5 to -1 ■ Not all words around are important. ■ How do we determine importance? ■ the most common coLLocation - but that's "the" ■ (statistically) most significant - what formuLa? ^avel Rychlý ■ Context similarity in huge corpora ■ March 30,2023 Word Sketch One-page summary of word behaviour research is noun 25,537ä t usually In plurals [99.I'M . percentile 21 9) n x ^* =•: U X *=* =•: U modifier modifies subject_of scientific grant aim recent project focus cancer laboratory investigate empirical ... institute show market ... finding examine further ... contract indicate Cray programme suggest medical council reveal historical ... fellow explore applied centre concentrate Context si Hilarity Tň hu a^A^ra11- March 30,2023 "' involve 4/17 Word Sketch How to create it Grammatical Relations Definition ■ Large balanced corpus ■ Find grammatical realtions (subjects, objects, heads, modifiers etc) ■ List of collocations for each grammatical session ■ Statistics for sorting each list We can create a thesaurus from Word Sketch. plain text file a set of queries for each GR queries contain labels for keyword and collocate processing options ^avel Rychlý ■ Context similarity in huge corpora ■ March 30,2023 5/17 ^avel Rychlý ■ Context similarity in huge corpora ■ March 30,2023 6/17 Grammar relation definitions Association score # 'modifier' and 'modify' gramrels definition *DUAL ^modifier/modify 2:"AD." 1:"N. ." # 'and/or' gramrel definition =and/or *SYMMETRIC 1:[] [word="and"|word="or"] 2:[] & l.tag = 2.tag # 'adverb' gramrel definition =adverb 1:[] 2:"AV." 2:"AV." 1:[] number of occurrences (wordlt gramrel, word2) AScore(w1, R, w2) = 14 + log2 Dice 2-||w1)/?,W2| 14 + log2 ^avel Rychly ■ Context similarity in huge corpora ■ March 30,2023 ^avel Rychly ■ Context similarity in huge corpora ■ March 30,2023 Similarity coefficient Data Sizes comparison of word sketches wi and w2 only important (significant) contexts what is the common counts (wordx, (gramrel, word,)) and (word2, (gramrel, word,)) Z)(tup;,tupy)e{tup1»1ntups^} + ASj - (AS, - ASj)2/S0 Sim(w1, w2) E tup, € {tupWl UtupW2 } AS; Corpus sizes, their vocabularies and word counts in contexts Corpus Size Words Lemat Different ctx All ctx BNC 111m 776k 722k 23m 63m SYN2000 114m 1.65m 776k 19m 58m OEC 1.12g 3.67m 3.12m 84m 569m Itwac 1.92g 6.32m 4.76m 67m 587m Vocabulary sizes and the number of different contexts grow sublinearly with the size of the corpus. ^avel Rychly ■ Context similarity in huge corpora ■ March 30,2023 ^avel Rychlý ■ Context similarity in huge corpora ■ March 30,2023 Matrix size Practical data sizes Similarity of all pairs of lemmas Matrix of size N2, where N is 700k - 5m Number of elements in orders of tera (1012) Matrix is fortunately very sparse Most values are 0 or "almost" 0 Even most of the whole rows/columns are empty Computation only for words with minimum frequency Better to limit the number of contexts than the number of occurrences Take only statistically significant contexts Corpus MIN Lemmat KWIC CTX BNC 1 152k 5.7m 608 k BNC 20 68 k 5.6m 588k OEC 2 269k 27.5 m 994k OEC 20 128k 27.3m 981k OEC 200 48 k 26.7m 965k Itwac 20 137k 24.8 m 1.1m ^avel Rychly ■ Context similarity in huge corpora ■ March 30,2023 11/17 ^avel Rychly ■ Context similarity in huge corpora ■ March 30,2023 12/17 Practical data sizes ■ Matrix of size N2, where N is 50k - 200k ■ Number of elements in orders of giga (1010) ■ The value of each element is created by applying the similarity function to vectors of length K = 500k - lm. ■ The straightforward algorithm for computing the whole matrix has a time complexity 0(N2K). ■ The complexity is polynomial, but the algorithm is practically unusable for given ranges of values. ■ Estimated calculation times are in months or years. ■ Heuristics reduce the sizes of N and K at the expense of accuracy the resulting values. ■ The calculation time is then in the order of days with an error of 1-4%. Efficient algorithm Even the smaller matrix is very sparse No need to calculate similarity for words that have nothing together, they have no common context. The main loop of the algorithm is not through words, but through contexts. ^avel Rychly ■ Context similarity in huge corpora ■ March 30,2023 ^avel Rychly ■ Context similarity in huge corpora ■ March 30,2023 Efficient algorithm Optimization ■ Input: list of all possible words in contexts, (w, r, w'), with frequencies of occurrences in the corpus ■ Output: word similarity matrix sim(wi, w2) for (r, w') in CONTEXTS: WLIST = set of all w where (w, r, w') exists for wi in WLIST: for w2 in WLIST: sim(w1,W2)+ = /(frequencies) If | WLIST\ > 10000, skip the context. We do not keep the matrix sim(wi, w2) in memory during the ca Iculation. Repeated runs of the main loop for the limited range w^. Instead of sim(wi, w2)+ = x we generate (wlt w2,x) to the output. We then sort the output list and add the individual xs. Use of TPMMS (Two Phase Multi-way Merge Sort) with continuous by summation. Instead of several hundred GB, we sort a few GB. ^avel Rychly ■ Context similarity in huge corpora ■ March 30,2023 ^avel Rychlý ■ Context similarity in huge corpora ■ March 30,2023 Results Algorithm is orders of magnitude faster than straightforward algorithm. (18 days x 2 hours) Corpus MIN Lemmat KWIC CTX Time BNC 1 152k 5.7m 608k 13m 9s BNC 20 68k 5.6m 588k 9m 30s OEC 2 269k 27.5 m 994k lh40m OEC 20 128k 27.3m 981k lh 27m OEC 200 48 k 26.7m 965k lh 10m Itwac 20 137k 24.8m 1.1m lh 16m Without changes in precision Possibilities of easy parallelization. ^avel Rychly ■ Context similarity in huge corpora ■ March 30,2023 17/17