Context similarity in huge corpora PA154 Language Modeling (8.2) Pavel Rychlý pary@fi.muni.cz April 16, 2024 Meanings The word (and some of its parts) are the basic carriers of meaning ■ word without context - no meaning, many meaning potentials ■ the same word in different contexts - different meanings ■ word in similar contexts - same meaning ■ what is context? Pavel Rychly • Context similarity in huge corpora • April 16, 2024 2/17 What is context? Context is the words around the keyword. ■ What surroundings?: ■ the following word ■ previous word ■ window: +1 to +5 ■ window: -5 to -1 ■ Not all words around are important. ■ How do we determine importance? ■ the most common collocation - but that's "the" ■ (statistically) most significant - what formula? Pavel Rychly • Context similarity in huge corpora • April 16, 2024 3/17 Word Sketch One-page summary of word behaviour G333EE9 research as noun 25,537x ▼ €^ ± usually in plurals (99.1%. percentile 21.9) :•: n X *=• =•: U X *=> =•= U X modifier modifies subject_of scientific grant aim recent project focus cancer laboratory investigate empirical institute show market finding examine further contract indicate Cray programme suggest medical council reveal historical applied fellow explore hu centre concentrate 7 a\^£^y?:T^e» Context similarity in gj^T&rS1. April 16, 2024 involve Word Sketch How to create it ■ Large balanced corpus ■ Find grammatical reaLtions (subjects, objects, heads, modifiers etc) ■ List of collocations for each grammatical session ■ Statistics for sorting each List We can create a thesaurus from Word Sketch. Pavel Rychly • Context similarity in huge corpora • April 16, 2024 5/17 Grammatical Relations Definition ■ plain text file ■ a set of queries for each GR ■ queries contain Labels for keyword and collocate ■ processing options Pavel Rychly • Context similarity in huge corpora • April 16, 2024 6/17 Grammar relation definitions # 'modifier' and 'modify' gramrels definition *DUAL =modifier/modify 2:"AD." 1:"N.." # 'and/or' gramrel definition =and/or ^SYMMETRIC 1:[] [word="and"|word="or"] 2:[] & l.tag = 2.tag # 'adverb' gramrel definition =adverb 1:[] 2:"AV." 2:"AV." 1:[] Pavel Rychly • Context similarity in huge corpora • April 16, 2024 7/17 Association score number of occurrences (wordi,gramret, wordj) AScore(Wl, R, w2) = 14 + log2 Dice{^^, 14 + |oe,_2'IKi.^2ll_ 1 &^ Wi,/?,* + *,*,W2 Pavel Rychly • Context similarity in huge corpora • April 16, 2024 Similarity coefficient ■ comparison of word sketches w\ and wj ■ only important (significant) contexts ■ what is the common ■ counts {word{gramrel, wordf)) and {wordj, {gramrel, wordf)) n. , , ^2(tuphtupj)e{tupw, ntupsW7} A$i + A$j ~ (A$i ~ A$j)2 Sim{w1, w2) =---^-2--—- l^tupie{tupWl utupW2} Pavel Rychly • Context similarity in huge corpora • April 16, 2024 9/17 Data Sizes Corpus sizes, their vocabularies and word counts in contexts Corpus Size Words Lemat Different ctx ALL ctx BNC 111m 776k 722k 23m 63m SYN2000 114m 1.65m 776k 19m 58m OEC 1.12g 3.67m 3.12m 84m 569m Itwac 1.92g 6.32m 4.76m 67m 587m Vocabulary sizes and the number of different contexts grow subLinearLy with the size of the corpus. Pavel Rychly • Context similarity in huge corpora • April 16, 2024 10/17 Matrix size ■ Similarity of all pairs of Lemmas ■ Matrix of size A/2, where N is 700k - 5m ■ Number of elements in orders of tera (1012) ■ Matrix is fortunately very sparse ■ Most values are 0 or "almost" 0 ■ Even most of the whole rows/columns are empty Pavel Rychly • Context similarity in huge corpora • April 16, 2024 11/17 Practical data sizes ■ Computation only for words with minimum frequency ■ Better to Limit the number of contexts than the number of occurrences ■ Take onLy statistically significant contexts Corpus MIN Lemmat KWIC CTX BNC 1 152k 5.7m 608k BNC 20 68k 5.6m 588k OEC 2 269k 27.5m 994k OEC 20 128k 27.3m 981k OEC 200 48k 26.7m 965k Itwac 20 137k 24.8m 1.1m Pavel Rychly • Context similarity in huge corpora • April 16, 2024 12/17 Practical data sizes ■ Matrix of size A/2, where N is 50k - 200k ■ Number of elements in orders of giga (1010) ■ The value of each element is created by applying the similarity function to vectors of Length K = 500k - lm. ■ The straightforward algorithm for computing the whole matrix has a time complexity 0(N2K). ■ The complexity is polynomial, but the algorithm is practically unusable for given ranges of values. ■ Estimated calculation times are in months or years. ■ Heuristics reduce the sizes of N and K at the expense of accuracy the resulting values. ■ The calculation time is then in the order of days with an error of 1-4%. Pavel Rychly • Context similarity in huge corpora • April 16, 2024 13/17 Efficient algorithm ■ Even the smaller matrix is very sparse ■ No need to calculate similarity for words that have nothing together, ■ they have no common context. ■ The main Loop of the algorithm is not through words, but through contexts. Pavel Rychly • Context similarity in huge corpora • April 16, 2024 14/17 Efficient algorithm ■ Input: List of aLL possible words in contexts, (w, r, w'), with frequencies of occurrences in the corpus ■ Output: word similarity matrix sim{w\, wj) for (r, w') in CONTEXTS: WLIST = set of aLL w where (w, r, w1) exists for in WLIST: for w2 in WLIST: sim(wi, wj)+ = /{frequencies) Pavel Rychly • Context similarity in huge corpora • April 16, 2024 15/17 Optimization If \WLIST\ > 10000, skip the context. We do not keep the matrix sim(wi, wj) in memory during the calculation. Repeated runs of the main Loop for the Limited range w\. Instead of sim(wi, wj)+ = x we generate (wi, wj,x) to the output. We then sort the output List and add the individual xs. Use of TPMMS (Two Phase MuLti-way Merge Sort) with continuous by summation. Instead of several hundred GB, we sort a few GB. Pavel Rychly • Context similarity in huge corpora • April 16, 2024 16/17 Results ■ Algorithm is orders of magnitude faster than straightforward algorithm. (18 days x 2 hours) Corpus MIN Lemmat KWIC CTX Time BNC 1 152k 5.7m 608k 13m 9s BNC 20 68k 5.6m 588k 9m 30s OEC 2 269k 27.5m 994k lh 40m OEC 20 128k 27.3m 981k lh 27m OEC 200 48k 26.7m 965k lh 10m Itwac 20 137k 24.8m 1.1m lh 16m ■ Without changes in precision ■ Possibilities of easy paraUeLization. Pavel Rychly • Context similarity in huge corpora • April 16, 2024 17/17