Context similarity in huge corpora
PA154 Language Modeling (8.2)
Pavel Rychlý
pary@fi.muni.cz April 16, 2024
Meanings
The word (and some of its parts) are the basic carriers of meaning
■ word without context - no meaning, many meaning potentials
■ the same word in different contexts - different meanings
■ word in similar contexts - same meaning
■ what is context?
Pavel Rychly • Context similarity in huge corpora • April 16, 2024
2/17
What is context?
Context is the words around the keyword.
■ What surroundings?:
■ the following word
■ previous word
■ window: +1 to +5
■ window: -5 to -1
■ Not all words around are important.
■ How do we determine importance?
■ the most common collocation - but that's "the"
■ (statistically) most significant - what formula?
Pavel Rychly • Context similarity in huge corpora • April 16, 2024
3/17
Word Sketch
One-page summary of word behaviour G333EE9
research as noun 25,537x ▼ €^ ±
usually in plurals (99.1%. percentile 21.9)
:•: n X    *=• =•: U X    *=> =•= U X
modifier		modifies		subject_of	
scientific		grant		aim	
recent		project		focus	
cancer		laboratory		investigate	
empirical		institute		show	
market		finding		examine	
further		contract		indicate	
Cray		programme		suggest	
medical		council		reveal	
historical applied		fellow		explore	
	hu	centre		concentrate	7
a\^£^y?:T^e» Context similarity in		gj^T&rS1. April 16, 2024		involve	
Word Sketch
How to create it
■ Large balanced corpus
■ Find grammatical reaLtions (subjects, objects, heads, modifiers etc)
■ List of collocations for each grammatical session
■ Statistics for sorting each List
We can create a thesaurus from Word Sketch.
Pavel Rychly • Context similarity in huge corpora • April 16, 2024
5/17
Grammatical Relations Definition
■ plain text file
■ a set of queries for each GR
■ queries contain Labels for keyword and collocate
■ processing options
Pavel Rychly • Context similarity in huge corpora • April 16, 2024
6/17
Grammar relation definitions
# 'modifier' and 'modify' gramrels definition *DUAL
=modifier/modify 2:"AD." 1:"N.."
# 'and/or' gramrel definition =and/or
^SYMMETRIC
1:[] [word="and"|word="or"] 2:[] & l.tag = 2.tag
# 'adverb' gramrel definition =adverb
1:[] 2:"AV." 2:"AV." 1:[]
Pavel Rychly • Context similarity in huge corpora • April 16, 2024
7/17
Association score
number of occurrences (wordi,gramret, wordj)
AScore(Wl, R, w2) = 14 + log2 Dice{^^, 14 + |oe,_2'IKi.^2ll_
1       &^    Wi,/?,*  + *,*,W2
Pavel Rychly • Context similarity in huge corpora • April 16, 2024
Similarity coefficient
■ comparison of word sketches w\ and wj
■ only important (significant) contexts
■ what is the common
■ counts {word{gramrel, wordf)) and {wordj, {gramrel, wordf))
n.  ,        ,     ^2(tuphtupj)e{tupw, ntupsW7} A$i + A$j ~ (A$i ~ A$j)2 Sim{w1, w2) =---^-2--—-
l^tupie{tupWl utupW2}
Pavel Rychly • Context similarity in huge corpora • April 16, 2024
9/17
Data Sizes
Corpus sizes, their vocabularies and word counts in contexts
Corpus	Size	Words	Lemat	Different ctx	ALL ctx
BNC	111m	776k	722k	23m	63m
SYN2000	114m	1.65m	776k	19m	58m
OEC	1.12g	3.67m	3.12m	84m	569m
Itwac	1.92g	6.32m	4.76m	67m	587m
Vocabulary sizes and the number of different contexts grow subLinearLy with the size of the corpus.
Pavel Rychly • Context similarity in huge corpora • April 16, 2024
10/17
Matrix size
■ Similarity of all pairs of Lemmas
■ Matrix of size A/2, where N is 700k - 5m
■ Number of elements in orders of tera (1012)
■ Matrix is fortunately very sparse
■ Most values are 0 or "almost" 0
■ Even most of the whole rows/columns are empty
Pavel Rychly • Context similarity in huge corpora • April 16, 2024
11/17
Practical data sizes
■ Computation only for words with minimum frequency
■ Better to Limit the number of contexts than the number of occurrences
■ Take onLy statistically significant contexts
Corpus	MIN	Lemmat	KWIC	CTX
BNC	1	152k	5.7m	608k
BNC	20	68k	5.6m	588k
OEC	2	269k	27.5m	994k
OEC	20	128k	27.3m	981k
OEC	200	48k	26.7m	965k
Itwac	20	137k	24.8m	1.1m
Pavel Rychly • Context similarity in huge corpora • April 16, 2024
12/17
Practical data sizes
■ Matrix of size A/2, where N is 50k - 200k
■ Number of elements in orders of giga (1010)
■ The value of each element is created by applying the similarity function to vectors of Length K = 500k - lm.
■ The straightforward algorithm for computing the whole matrix has a time complexity 0(N2K).
■ The complexity is polynomial, but the algorithm is practically unusable for given ranges of values.
■ Estimated calculation times are in months or years.
■ Heuristics reduce the sizes of N and K at the expense of accuracy the resulting values.
■ The calculation time is then in the order of days with an error of 1-4%.
Pavel Rychly • Context similarity in huge corpora • April 16, 2024
13/17
Efficient algorithm
■ Even the smaller matrix is very sparse
■ No need to calculate similarity for words that have nothing together,
■ they have no common context.
■ The main Loop of the algorithm is not through words, but through contexts.
Pavel Rychly • Context similarity in huge corpora • April 16, 2024
14/17
Efficient algorithm
■ Input: List of aLL possible words in contexts, (w, r, w'), with frequencies of occurrences in the corpus
■ Output: word similarity matrix sim{w\, wj)
for (r, w') in CONTEXTS:
WLIST = set of aLL w where (w, r, w1) exists for    in WLIST: for w2 in WLIST:
sim(wi, wj)+ = /{frequencies)
Pavel Rychly • Context similarity in huge corpora • April 16, 2024
15/17
Optimization
If \WLIST\ > 10000, skip the context.
We do not keep the matrix sim(wi, wj) in memory during the calculation.
Repeated runs of the main Loop for the Limited range w\.
Instead of sim(wi, wj)+ = x we generate (wi, wj,x) to the output.
We then sort the output List and add the individual xs.
Use of TPMMS (Two Phase MuLti-way Merge Sort) with continuous by summation.
Instead of several hundred GB, we sort a few GB.
Pavel Rychly • Context similarity in huge corpora • April 16, 2024
16/17
Results
■ Algorithm is orders of magnitude faster than straightforward
algorithm. (18 days x 2 hours)
Corpus	MIN	Lemmat	KWIC	CTX	Time
BNC	1	152k	5.7m	608k	13m 9s
BNC	20	68k	5.6m	588k	9m 30s
OEC	2	269k	27.5m	994k	lh 40m
OEC	20	128k	27.3m	981k	lh 27m
OEC	200	48k	26.7m	965k	lh 10m
Itwac	20	137k	24.8m	1.1m	lh 16m
■ Without changes in precision
■ Possibilities of easy paraUeLization.
Pavel Rychly • Context similarity in huge corpora • April 16, 2024
17/17