Context similarity in huge corpora PA154 Language Modeling (7.2) Pavel Rychly pary@fi.muni.cz March 30,2023 Meanings The word (and some of its parts) are the basic carriers of meaning ■ word without context - no meaning, many meaning potentials ■ the same word in different contexts - different meanings ■ word in similar contexts - same meaning ■ what is context? Pavel Rychly • Context similarity in huge corpora • March 30, 2023 2/17 What is context? Context is the words around the keyword. ■ What surroundings?: ■ the following word ■ previous word ■ window: +1 to +5 ■ window: -5 to -1 ■ Not all words around are important. ■ How do we determine importance? ■ the most common collocation - but that's "the" ■ (statistically) most significant - what formula? Pavel Rychly • Context similarity in huge corpora • March 30, 2023 3/17 Word Sketch One-page summary of word behaviourJUMBffia research as noun 25,537* ▼ Q ± usually in plurals (99.1%. percentile 21.9) :•: n X «=• =•: U X *=• =•= U X modifier modifies subject_of scientific grant aim recent project focus cancer laboratory investigate empirical institute show market finding examine further contract indicate Cray programme suggest medical council reveal historical fellow explore applied hu centre concentrate 7 \^f^>lifilV^ Context similarity in g^&il&rS1. March 30, 2023 involve Word Sketch How to create it ■ Large balanced corpus ■ Find grammatical reaLtions (subjects, objects, heads, modifiers etc) ■ List of collocations for each grammatical session ■ Statistics for sorting each List We can create a thesaurus from Word Sketch. Pavel Rychly • Context similarity in huge corpora • March 30, 2023 5/17 Grammatical Relations Definition ■ plain text file ■ a set of queries for each GR ■ queries contain Labels for keyword and collocate ■ processing options Pavel Rychly • Context similarity in huge corpora • March 30, 2023 6/17 Grammar relation definitions # 'modifier' and 'modify' gramrels definition *DUAL =modifier/modify 2:"AD." 1:"N.." # 'and/or' gramrel definition =and/or ^SYMMETRIC 1:[] [word="and"|word="or"] 2:[] & l.tag = 2.tag # 'adverb' gramrel definition =adverb 1:[] 2:"AV." 2:"AV." 1:[] Pavel Rychly • Context similarity in huge corpora • March 30, 2023 7/17 Association score number of occurrences (wordi7gramrel7 wordj) AScore(wi, R, wj) = 14 + log2 Dice 14 + log2 \ Wi,/?,W2 1 I Wi,/?,W2 1^ { |wi ,/?,*| i *,*,W2 ) 2-| Wi,/?,W2 1 * l + l ^2 II Pavel Rychly • Context similarity in huge corpora • March 30, 2023 8/17 Similarity coefficient ■ comparison of word sketches w\ and wj ■ only important (significant) contexts ■ what is the common ■ counts {word{gramrel, wordf)) and {wordj, {gramrel, wordf)) n. , , Y,(tuphtupj)e{tupw, ntupsWl} ASj + ASj - {ASf - ASj)2/50 Sim{w1, w2) =---^-2--—- 2^tupie{tupWl utupW2} Pavel Rychly • Context similarity in huge corpora • March 30, 2023 9/17 Data Sizes Corpus sizes, their vocabularies and word counts in contexts Corpus Size Words Lemat Different ctx ALL ctx BNC 111m 776k 722k 23m 63m SYN2000 114m 1.65m 776k 19m 58m OEC 1.12g 3.67m 3.12m 84m 569m Itwac 1.92g 6.32m 4.76m 67m 587m Vocabulary sizes and the number of different contexts grow subLinearLy with the size of the corpus. Pavel Rychly • Context similarity in huge corpora • March 30, 2023 10/17 Matrix size ■ Similarity of all pairs of Lemmas ■ Matrix of size A/2, where N is 700k - 5m ■ Number of elements in orders of tera (1012) ■ Matrix is fortunately very sparse ■ Most values are 0 or "almost" 0 ■ Even most of the whole rows/columns are empty Pavel Rychly • Context similarity in huge corpora • March 30, 2023 11/17 Practical data sizes ■ Computation only for words with minimum frequency ■ Better to Limit the number of contexts than the number of occurrences ■ Take only statistically significant contexts Corpus MIN Lemmat KWIC CTX BNC 1 152k 5.7m 608k BNC 20 68k 5.6m 588k OEC 2 269k 27.5m 994k OEC 20 128k 27.3m 981k OEC 200 48k 26.7m 965k Itwac 20 137k 24.8m 1.1m Pavel Rychly • Context similarity in huge corpora • March 30, 2023 12/17 Practical data sizes ■ Matrix of size A/2, where N is 50k - 200k ■ Number of elements in orders of giga (1010) ■ The value of each element is created by applying the similarity function to vectors of Length K = 500k - lm. ■ The straightforward algorithm for computing the whole matrix has a time complexity 0(N2K). ■ The complexity is polynomial, but the algorithm is practically unusable for given ranges of values. ■ Estimated calculation times are in months or years. ■ Heuristics reduce the sizes of N and K at the expense of accuracy the resulting values. ■ The calculation time is then in the order of days with an error of 1-4%. Pavel Rychly • Context similarity in huge corpora • March 30, 2023 13/17 Efficient algorithm ■ Even the smaller matrix is very sparse ■ No need to calculate similarity for words that have nothing together, ■ they have no common context. ■ The main Loop of the algorithm is not through words, but through contexts. Pavel Rychly • Context similarity in huge corpora • March 30, 2023 14/17 Efficient algorithm ■ Input: List of all possible words in contexts, (w, r, w'), with frequencies of occurrences in the corpus ■ Output: word similarity matrix sim{w\, wj) for (r, wf) in CONTEXTS: WLIST = set of all w where (w, r, w1) exists for w1 in WLIST: for w2 in WLIST: sim(wi, wj)+ = /{frequencies) Pavel Rychly • Context similarity in huge corpora • March 30, 2023 15/17 Optimization If \WLIST\ > 10000, skip the context. We do not keep the matrix sim(w\, wj) in memory during the calculation. Repeated runs of the main Loop for the Limited range w\. Instead of sim(wi, wj)+ = x we generate (wi, wj,x) to the output. We then sort the output List and add the individual xs. Use of TPMMS (Two Phase MuLti-way Merge Sort) with continuous by summation. Instead of several hundred GB, we sort a few GB. Pavel Rychly • Context similarity in huge corpora • March 30, 2023 16/17 Results ■ Algorithm is orders of magnitude faster than straightforward algorithm. (18 days x 2 hours) Corpus MIN Lemmat KWIC CTX Time BNC 1 152k 5.7m 608k 13m 9s BNC 20 68k 5.6m 588k 9m 30s OEC 2 269k 27.5m 994k lh 40m OEC 20 128k 27.3m 981k lh 27m OEC 200 48k 26.7m 965k lh 10m Itwac 20 137k 24.8m 1.1m lh 16m ■ Without changes in precision ■ Possibilities of easy parallelization. Pavel Rychly • Context similarity in huge corpora • March 30, 2023 17/17