Rapid Ukrainian-English Dictionary Creation
Using Post-Edited Corpus Data
Marek Blahuš1
, Michal Cukr1
, Ondřej Herman1,2
, Miloš
Jakubíček1,2
, Vojtěch Kovář1,2
, Jan Kraus1
, Marek Medveď1,2
,
Vlasta Ohlídalová1,2
, Vít Suchomel1,2
1
Lexical Computing, Brno, Czechia
2
Faculty of Informatics, Masaryk University, Brno, Czechia
E-mail: firstname.lastname@sketchengine.eu
Abstract
This paper describes the development of a new corpus-based Ukrainian-English dictionary.
The dictionary was built from scratch, we used no pre-existing dictionary data. A rapid
dictionary development method was used which consists of generating dictionary parts
directly from a large corpus, and of post-editing the automatically generated data by
native speakers of Ukrainian (not professional lexicographers). The method builds on Baisa
et al. (2019) which was improved and updated, and we used a different data management
model. As the data source, a 3-billion-word Ukrainian web corpus from the TenTen series
(Jakubíček et al., 2013) was used.
The paper briefly describes the corpus, then we thoroughly explain the individual steps
of the workflow, including the volume of the manual
work needed for the particular phases in terms of person-days. We also present details
about the newly created dictionary and discuss directions for its further development.
Keywords: Ukrainian; post-editing; dictionary; lexicography
1. Introduction
For decades, language corpora have served as source data for dictionary building. In the
last years, corpora were also used for automatic generation of various dictionary parts
(Rundell & Kilgarriff, 2011; Kosem et al., 2018; Gantar et al., 2016; Kallas et al., 2015).
These automatic outputs were then post-edited by professional lexicographers to ensure
the data quality in the resulting dictionary.
With the advancement of technology, it is now possible to create whole dictionaries using
this scenario of automatic generation and post-editing by native speakers (not necessarily
professional lexicographers). The methodology was used before (Baisa et al., 2019); we have
improved the process and used it in a new project aimed at creating a Ukrainian–English
dictionary using a 3-billion-word Ukrainian corpus.
This paper covers all our work on this particular project. We describe building, cleaning
and tagging the new multi-billion web corpus of Ukrainian. Then, we discuss the rapid
dictionary creation method and our particular implementation which is different from
(Baisa et al., 2019) especially in the data management approach.
613
In the last part, we describe the resulting dictionary that contains more than 55,000
verified headwords but due to time and budget constraints, we were able to fully complete
only 10,000 entries, so there is still large space for improvements.
2. New Ukrainian Web Corpus
We were able to identify three Ukrainian corpora the new dictionary could be based on:
• General Regionally Annotated Corpus of Ukrainian (GRAC) (Shvedova, 2020;
Starko, 2021)
• UberText Ukrainian corpus by Lang-uk , a web corpus of 665 million tokens
• ukTenTen14 web corpus from 2014, consisting of 2.73 billion tokens
Of these corpora, the first one is not available for download. The second one is a rather
small, topic-specific corpus (mostly news). It is distributed in the form of shuffled sentences,
which prevents the selection of headwords by document frequency. For our dictionary
work, we took the third one, enlarged it and updated it into a new Ukrainian web corpus.
In this stage we followed the methodology of the TenTen corpora family (Jakubíček et al.,
2013).
The crawler (Suchomel & Pomikálek, 2012) was instructed to download from Ukrainian
top-level domains and and generic domains such as , , or . A
character trigram based model trained on a 200 kB sample of manually checked Ukrainian
plaintext was used to stop crawling websites that did not contain text in Ukrainian.
The crawl was initialized by nearly 6 million unique seed URLs:
• 194 manually identified news sites
• 94,000 websites from web directories
• 336,000 URLs of web pages found by search engine Bing by searching Ukrainian
words
• 5,410,000 URLs found in ukTenTen14
Table 1: Number of documents by TLD in the final merged and cleaned data from 2014
and 2022
Data obtained by the crawler were converted to UTF-8 with the help of the Chared
tool (Pomikálek & Suchomel, 2011) and cleaned by jusText (Pomikálek, 2011). The
614
Table 2: Websites contributing the most tokens to the final merged and cleaned data from
2014 and 2022
result was merged with the old ukTenTen14 and with 1,040,000 articles from Ukrainian
Wikipedia downloaded by the Wiki2corpus tool. Duplicate paragraphs were removed by
Onion (Pomikálek, 2011) and manual cleaning was performed according to Suchomel &
Kraus (2021).
The final size of the merged Ukrainian corpus is 3,280 million tokens and 2,593 million words
in 7.2 million documents with 52% texts downloaded in 2014 and 48% texts downloaded
in 2020. Sizes of parts of the corpus coming from selected TLDs and websites are in
Table 1 and Table 2, respectively. As can be seen there, the most contributing sites are
encyclopedias, technology sites, news sites and legal related sites. Distribution of genres
and topics assigned using the method described in Suchomel & Kraus (2022) can be found
in Table 3 and in Table 4, respectively.
The corpus was then tagged using RFTagger (Schmid & Laws, 2008) and lemmatized
using CST lemmatiser (Jongejan & Dalianis, 2009). The RFTagger model was trained on
the Universal Dependencies corpus for Ukrainian and the Brown corpus of the Ukrainian
language (Starko & Rysin, 2023). Training was also supplemented by an additional
morphological database generated from the Ukrainian Brown dictionary (Starko & Rysin,
2020). The model for the CST lemmatiser was trained on Ukrainian Brown dictionary using
Affixtrain. As the last step, heuristic postprocessing of the tagging and lemmatization
was applied based on manual inspection of the corpus data.
3. Rapid Dictionary Development by Post-editing
The post-editing methodology we are building on assumes that all lexicographic content
is automatically generated from an annotated corpus, and step-by-step post-edited, re-
615
Table 3: Subcorpus sizes by genre
Table 4: Subcorpus sizes by topic
informing the corpus to maximize the mutual completion between the data and the editors,
thereby minimizing the editorial effort. Central to this process are two databases: the
corpus and the dictionary draft which get mutually updated. The entry components are
generated separately according to their dependencies, as illustrated in Figure 1.
After an entry component is generated and post-edited by human editors, the edited
data are incorporated into the corpus annotation and used for generating further entry
components. For example, having word sense post-edited leads to the introduction of
sense identifiers in the corpus, which in turn yields sense-based analysis for a distributional
thesaurus or example sentences (which would not be very reliable otherwise).
Figure 1: A high-level workflow overview of the post-editing process
In the next sections we explain in detail how we developed a large-scale dictionary with
a fraction of human effort compared to the standard setting in which the lexicographers
themselves interrogate the corpus. We show the method can rely on existing (imperfect)
NLP tools but requires a radical change to the typical lexicographic workflow and a robust
data management process between the corpus, the dictionary and the editors.
616
3.1 Training the Native Speakers
Annotators should be native speakers of the source language, but they are not expected
to have any previous lexicographic training. For tasks that involve translation, written
capacity in the target language (English) is required. English was also the prevailing
language of instruction.
Good training helps annotators understand their tasks well and leads to high-quality output.
Each step in the dictionary creation process needs to be clearly explained—containing all
relevant information; giving illustrative examples; describing potential conflicts or marginal
cases; mentioning the recommended amount of time per entry in each particular task.
Therefore, the training for each task consists of three parts:
1. e-learning describing the task in general, providing English examples, explaining the
underlying linguistic concepts, including test questions to verify that the annotator
understood the essence of the task
2. half-day face-to-face training where we explain the whole task with real Ukrainian
examples and language-specific issues
3. manual of 2-3 pages with the necessary instructions
Most of the time, annotators work using the Lexonomy on-line dictionary building tool
(Měchura, 2017; Jakubíček et al., 2018). We have developed a dedicated user interface
(customized entry editor) in Lexonomy for each task.
3.2 Headwords
The annotator sees a list of headword candidates (i.e. combinations of lemma and part of
speech) and their task is to assign a flag to each according to its perceived correctness.
Flagging can be performed with the mouse, but using keyboard shortcuts is preferred.
Available flags are given short English names and color codes. The key to attributing flags
to headword candidates, reproduced here as Figure 2, is shown to the editor all the time.
After familiarizing themselves with the concepts of and and having
learned about specifics of handling them in Ukrainian and in the applied tagger, annotators
train by using the key to flag headword candidates.
In this project, a total of 119,615 headword candidates were evaluated, 87% of which
received at least two annotations and 24% were annotated at least three times. Multiple
annotations are taken to create a margin for detecting errors and conflicts of opinion.
Eight annotators took part in the annotation effort, the work was split into 289 batches
and in total 285,177 annotations were made.
The most frequently assigned flag was “ok” (38.4%), followed by “not a lemma” (25.9%)
and “wrong POS” (21.2%), then came “proper name” (5.1%) and “I don’t know” (5.0%),
later “non-standard (register or spelling)” (2.7%), and at last “not Ukrainian” (1.6%).
Total time annotators spent on this task was 2114 hours, i.e. one annotation took on
average 27 seconds. Speed varied greatly between annotators, ranging from 12 seconds
617
Figure 2: Key to attributing flags to headword candidates, color-coded and with keyboard
shortcuts
to 64 seconds per annotation, influenced by factors such as annotator’s self-confidence,
computer skills (use of clicking vs. pressing keys), reliance on external resources, work
habits or tiredness.
Out of the presented headword candidates, 49,131 (41%) were eventually accepted as
correct headwords into the final dictionary. Major contributor of noise in the input data
was inter-POS homonymy, produced by early versions of the used tagger from before we
managed to reduce it by integrating a larger morphological database. If only lemmas are
counted, 66% of the candidates made it into the dictionary. The lempos to lemma ratio
has decreased from 1.45 among the headword candidates to only 1.02 among the accepted
headwords. Low homonymy between parts of speech was expected since it is a strong
property of Slavic languages.
618
3.3 Headword Revision
In Headword Revision, annotators get the chance to review headword candidates that were
rejected in the Headwords task but could be turned into correct headwords. For each such
rejected headword, Lexonomy displays a form in the right-hand pane (see Figure 3), whose
exact content varies depending on what is signalled to be an issue with the headword (e.g.
not a lemma, wrong part of speech, non-standard spelling).
For instance, if only part of speech is believed to be wrong, then the lemma field is pre-filled
and the annotator is asked to select a different part of speech from a dropdown. However,
they still have the option to modify the lemma as well, at their own discretion. For cases
of ambiguity, it is possible to enter multiple revisions for a headword. The annotator
can also decide that the headword be ignored (without revision), or accepted as is (call
it correct). Due to the decisive character of this task, it should be commissioned with
priority to annotators with high proficiency in the language and good performance in the
Headwords task.
Figure 3: Interface for the Headword Revision task
In this project, 54,503 headword candidates were sent for revision. Some of them eventually
underwent revision more than once (in order to explore inter-annotator agreement), what
resulted in 5,820 duplicate entries (though with possibly differing annotations). Four
annotators contributed to this task, which was split into 66 batches.
To make an annotation, the user clicks a radio button. If the headword is to be corrected,
then they also enter the correct lemma, pick the correct part of speech and indicate whether
it is a proper name.
619
In 94.9% of cases, a revision was resolved by providing an alternative headword. In 3.2%,
annotators said that the displayed headword was in fact correct. The remaining 1.9% were
cases of unrecognized words or words considered non-Ukrainian. In the typical situation
when correct headwords were provided to replace an incorrect headword, in 91.6% there
was just one replacement headword, in 7.4% there were two and in 1.0% three or more
(up to six).
Total time annotators spent on this task was 722 hours, i.e. annotating one entry took on
average 43 seconds. Speed fluctuated a lot across annotators in this task too, with the
fastest person taking just 28 seconds per entry and the slowest one needing 77 seconds.
3.4 Word Forms
The Forms task is concerned with inflection. Ukrainian is an inflected language and we
want to collect as many inflected forms of each headword (lemma) in the corpus as possible.
Annotators are first trained to distinguish inflection from derivation. Then, in Lexonomy,
their task is to tell apart correct and incorrect items in a list of possible inflected forms
for each headword. A link to concordance is available for case of doubt, but in practice,
most items are resolved swiftly. “Correct” is the default, so the annotator needs to act
only in case of incorrect forms. This task has a threshold only slightly higher than the
Headwords task, it can be introduced quite early in the process and no other later tasks
depend on it, which makes it a universal task for times of delay etc.
Figure 4: Interface for the Forms task
620
In this project, word forms were sought for 42,694 headwords, for which there was a
total of 578,327 form candidates (i.e. an average of 13.5 form candidates per headword).
Among the form candidates, nearly all (99.2%) only appeared with a single headword.
This means that the task was not as much about checking the form-lemma relationship,
rather than about checking the correctness and acceptability of the form itself (the used
tagger is permissive and accepts even some archaic and corrupted word forms). All seven
annotators available at the time were made to work on this task. The work was divided
into 43 batches.
The observed ratio of reported incorrect forms was 21.6%. In almost four out of five such
cases (79.4%), the rejected form candidates started with a capital letter – and, for lemmas
which start with a small letter themselves, such word forms differing in letter case are
highly unlikely in Ukrainian. In fact, 77.4% of all capitalized word forms ended up marked
incorrect.
Annotators spent 1269 hours checking the word forms, which means that they took on
average 107 seconds per headword, or 8 seconds per word form. The fastest annotator
needed only 2.5 seconds per word form (or 35 seconds per headword), while the slowest
required 22 seconds per word form (or 328 seconds per headword). Explanation of these
inter-annotator differences must be looked for in the same factors as mentioned with the
Headwords task.
We did not make any automatic judgments on the correctness of words forms, but we
benefited from the large corpus to extract a rather satisfying list of them – both in terms
of precision (we have shown above that majority of the presented candidates were correct)
and recall (although we did not attempted to quantify it, as we explicitly did not aim at
acquiring a “full” word form list, whatever it should mean). The average number of unique
word forms per lemma was 18.0 for verbs, 13.1 for adjectives, 9.4 for pronouns, 7.3 for
nouns, 5.4 for numerals, and below 1.3 for other parts of speech (uninflected). Depending
on details of the used processing pipeline, orthographic or phonetically motivated variants
of words may have been represented either as “inflected forms” or as separate headwords.
3.5 Audio Recordings
Instead of relying on phonetic transcriptions to indicate pronunciation or on the traditional
stress marks to indicate word stress, we make an audio recording of the headword’s
pronunciation by a native speaker and store it as a part of the dictionary entry. This is
the only part of the entry creation process that is done fully manually, since we want to
be in control of the quality of the result and automatic text-to-speech output could not
be post-edited. However, apart from having to face a few challenges such as preserving a
steady loudness or maintaining a low noise level, it turned out to be also one of the simplest
tasks. This is also the only step that does not use Lexonomy, but a specially developed
audio-recording software, and the only step which necessitates physical presence of the
annotator in dedicated premises (a soundproof audio cabin with high-quality recording
hardware) during the whole work time.
In this project, we recorded audio pronunciation for all the 55,632 headwords in the final
dictionary. Some of the headwords were recorded multiple times and, due to the recording
occurring in parallel with the rest of the dictionary building, we also made recordings of
621
some headwords which eventually did not make it into the final dictionary. In total, 57,800
audio files were created (i.e. 3.9% overhead). The work was divided into 60 batches, 59
of which were assigned to the same annotator so that same voice is used throughout the
dictionary. Only the last batch (1.3% of headwords) was assigned to a different person
because the original speaker was not available anymore.
The recording station in the audio booth was controlled with a special small 6-key keyboard
(the available keys were marked with pictograms meaning YES, NO, SKIP, DOWN, UP,
QUIT, respectively). This was done to save desk space else occupied by a regular keyboard,
concentrate all controls in a single location, reduce the chance of typos, limit noise
generated by keystrokes and improve user comfort for the annotator. The processing of
each headword consists of seeing it displayed on the screen, recording its pronunciation,
then listening to the recording to check its quality, and possible re-recording if the quality
is not sufficient.
It took the annotators 553 hours, or about 14 weeks (of 40 work hours each), to make
the recordings. That means an average of 36 seconds per headword. This time, however,
includes regular break time, because it is demanding if not impossible for a non-trained
person to stay concentrated in a small booth and keep speaking using a fresh voice of
stable strength for the whole day. In fact, in most of the cases when a headword had to
be recorded repeatedly, the reason for this was the software stepping in with an automatic
low-volume alert.
3.6 Word Senses
Identification of word senses for each headword is an important step in the dictionary
building process, because all subsequent tasks are performed on sense level instead of
headword level, and therefore dependent on the word-sense distinctions made here. After
annotators learn that there is not a single perfect solution for the problem (Kilgarriff, 1997),
reaching common ground with regard to granularity of sense distinctions is attempted by
means of joint practice and discussions on each other’s proposed solutions.
Annotators’ invention is effectively limited to automatically induced word sense data
(read more in 4.2.5), represented in Lexonomy as (i.e. collocations, each
including a longest-commonest match (Kilgarriff et al., 2015)) and grouped into clusters,
each of which could be considered a word sense candidate. Having reviewed this data,
however, the annotator has the freedom to establish a number of senses of their choice, to
distribute the collocations among them freely, not to assign a collocation to any particular
sense (by marking it either as “mixed sense” or “error”) and even come up with a sense not
linked to any of the collocations (the latter is allowed so that no important word senses
are lost due to possible deficiencies of the word sense induction algorithm). Each sense is
also given a disambiguating gloss (in the language of the dictionary), one or more English
translations, and a mark saying whether it is offensive in meaning.
The Word Senses task might be the most difficult task to be properly trained, and the
quality of its outcome directly influences the quality of data in all upcoming tasks. In this
project, 10,098 post-edited word sense disambiguations were performed in this way, for
a total of 10,016 distinct headwords. In each processed entry, there were on average 43
collocations, divided into 9 clusters. Four annotators were chosen and trained for this task
and the work was divided into 55 batches.
622
In terms of part of speech, 45.2% of the annotated headwords were nouns, 24.7% adjectives,
21.6% verbs and 5.8% adverbs; the remaining 2.7% were other parts of speech, for
which word sense disambiguation is not always applicable. Figure 5 shows an example
of one cluster, with three collocations. Annotators assign collocations to senses by
clicking numbered buttons. The available buttons multiply as soon as more senses are
declared – which is done by providing a disambiguating gloss, English translation(s) and
possibly switching a toggle to mark offensiveness. To reflect real-world conditions, English
translations can be shared by multiple senses, again by means of numbered buttons, thus
reducing the need for typing. And when a collocate is not self-explaining, the annotator
has the option to view a corresponding concordance in the corpus.
Figure 5: Part of the interface for the Word Senses interface
For 60.1% of headwords, a single sense was identified; 18.5% was split into two senses;
10.5% into three senses; 5.0% into four senses; the remaining 5.8% into five senses or more.
Overall average number of senses identified for the processed headwords was 1.84. Among
annotators, the average number of identified senses reached from 1.38 to 2.31. The highest
number of senses (not necessarily an ideal) was routinely found by an annotator who
happened to have some formal education in the field of linguistics. The same annotator
was also the only one who would, exceptionally, go to great lengths by establishing more
than 10 senses for a headword.
Annotators spent a total of 1203 hours on the Word Senses task, i.e. about 7 minutes
per headword. Three of the annotators had very close averages (from 7.3 to 9.1 minutes
per headword), only the fourth annotator (the one with linguistic education) differed
substantially when she took a much lower average of 4.5 minutes per headword.
Of the listed collocations, only 1.8% were declared incorrect or incomprehensible, and
1.5% could not be conclusively attributed to particular sense (when there were more senses
to choose from). Remaining collocations were either all attributed to a single sense, or
distributed among two senses (in the average ratio of 79:21) or three senses (67:23:10).
Even with four senses, the least-frequent one still corresponded to approximately three
collocations, which indicates that even in highly competitive situations, all senses were
solidly backed up by corpus data (in contrast with senses defined without any corpus
evidence, which are disregarded in this computation).
Annotators entered a total of 26,715 English translations (usually single words, but
sometimes multi-word expressions, and exceptionally even descriptions of concepts that
lack a direct English translation), which means an average of 2.65 translations per headword.
This is close to the average number of pre-generated machine translations of the headword
into English, which was 2.45.
623
Only a tiny fraction (25, 0.1%) of the identified senses has been marked offensive, although
the annotators were aware of this possibility and each of them used it at least once. We
believe that more of the headwords could be used in an offensive or derogatory way and
suspect that the annotators may have under-annotated them under the influence of the
previous tasks, in which we had to repeatedly stress that also bad words are to be included
in the dictionary and that they should be treated .
3.7 Thesaurus
In the Thesaurus task, annotators are trained to evaluate thesaurus candidates (i.e. selected
related headwords, read more in 4.2.6) for a given headword (this
subdivision into senses is maintained across the rest of the dictionary building). Each
thesaurus item can be put into one of three categories: Synonym, Antonym and Similar
word (i.e. not a synonym or antonym, but still somehow related). A fourth option, named
Other, is the default choice and results in the candidate being discarded.
Figure 6: Interface for the Thesaurus task
In this project, two annotators were assigned to the Thesaurus task and, at the time
of writing, they had processed in total 10,377 entries (headwords in individual senses),
divided into 12 batches. Each entry contained exactly 20 thesaurus candidates.
Out of all thesaurus candidates, three fourths (75.5%) were discarded (marked as Other),
while 15.0% were accepted as Similar, 8.2% were classified as Synonyms and 1.3% as
Antonyms. In the training phase, we realized that one annotator had developed a preference
towards marking many words as Similar, while the other preferred Synonym in
these cases. During data inspection, we found out that Similar:Synonym ratio was 83:17
624
for the first annotator and 61:39 for the second one. We could, however, not find solid
grounds on which we could convince one or the other to change their preference. The
percentage of identified antonyms was consistently low with both annotators.
Work on the Thesaurus took 364 hours, with one of the annotators being significantly
slower (527 seconds per entry) than the other (74 seconds).
On average, 4.9 thesaurus candidates were accepted for each headword. Since the candidates
were scored by Sketch Engine and shown in that order, we would expect that items higher
in the list have a higher chance of being accepted as thesaurus items. And indeed, the
probability that a candidate item had been accepted was found to be inversely proportional
to the item’s rank; it was 48.9% for the first item in the list, 38.1% for the second one,
32.3% for the third one; 19.3% for position ten, 15.6% (minimum) for position fifteen.
Positions 16–20 were exceptions, because they had been reserved for top-scored thesaurus
candidates for the headword, regardless of sense. These items had a higher chance of being
accepted (21.2–27.3%), comparable to that of the (sense-specific) positions 5–9.
3.8 Usage Examples
Choosing a good, easy to understand, illustrative dictionary example for a headword (in
one of its senses) is a challenging task. So although GDEX (Rychlý et al., 2008) is used to
pre-select candidate sentences (read more in 4.2.7), annotators need to be well trained
to choose the best one of the five pre-selected sentences and redact them when necessary
(shorten them or remove controversial information). In rare cases, annotators may even be
forced to come up with an example sentence of their own (for this purpose, they have on
hand a link to the first one hundred GDEX-scored collocation lines from the corpus as
source of inspiration), although writing example sentences anew is strongly discouraged
for reasons of time expense and authenticity.
In the user interface, the annotator selects their preferred sentence by clicking on a button
next to it. Clicking directly on the sentence activates a text input field in which its text
can be modified as needed. After an example sentence is selected, it changes color from red
to green and and another text field opens below it, pre-filled with machine translation of
the original sentence into English. It is the annotator’s responsibility to check and fix the
English translation as needed and to make sure that the sentences in the two languages
stay in sync.
Four annotators were trained in this activity and 20 batches were finished at the time of
writing this paper. In those batches, the annotators processed a total of 14,474 entries,
each containing five pre-selected and pre-translated example sentences. The work took
them 693 hours, which averages to 2.9 minutes per entry. The average time spent on an
entry varied greatly across the annotators (0.8 minutes, 4.3 minutes, 6.0 minutes, 14.7
minutes). The differences are likely to have been caused by each annotator’s differently
strong criteria for a good example. Slower annotators edited their chosen examples more
heavily, often fully rewriting them because they thought it necessary.
It seems that the position of the five offered sentences in the list (they were order by
decreasing GDEX score) correctly reflected their quality, or at least that sentences closer
625
Figure 7: Interface for the Examples task
to the top attracted annotators’ attention more and would be more probably chosen in
case if multiple comparably good candidates were present. The chance of the sentence in
position one to be chosen was 34.7%; position two 18.9%; position three 15.0%; position
four 12.6%; position five 12.3%.
The average length of the chosen example in its original (from the corpus) and accepted
(possibly modified) form was 63.1 and 56.5 characters (8.7 and 7.8 words), respectively,
which suggests a welcome tendency of the annotators to produce shorter examples. The
same tendency was found also with regard to the length of the sentence’s English translation
(decrease from 67.6 to 60.8 characters; from 11.3 to 10.2 words). Evaluation of
Levenshtein distance (minimum number of insertions, deletions, and substitutions) between
the generated and post-edited Ukrainian sentences reveals that 67.3% of the 13,449 studied
sentences did not need any modifications at all, and the average edit distance was 12.6 on
the whole set (and 38.5 just on those sentences which needed modification).
The pre-generated machine translations of the original Ukrainian sentences into English
and their final forms (often updated both for language and for linguistic deficiencies in the
Ukrainian originals) differed more, as expected, but not substantially: the edit distance was
15.9 (and 34.6 on just the modified sentences, which is even a decrease). Also, surprisingly,
54.1% of the machine-translated sentences were considered good enough by the annotators
to be left intact. This seems to suggest that the machine translation is reliable and saves
time during annotation. Indeed, in cases when the Ukrainian sentence was left unmodified,
76.0% machine translations were also not modified; and other 7.7% only required up to 5
edits (insertions, deletions or substitutions) to be performed in order to fix the English
sentence. The average edit distance of English sentences in these cases of unmodified
Ukrainian sentences was 2.6 (or 10.6 just on the modified sentences).
626
3.9 Images
The Images task was not yet administrated at the time of writing this paper, but we foresee
using an interface similar to the one depicted in (Baisa et al., 2019). Freely licensed images
relevant to the headwords will be identified and a top list will be offered to annotators to
choose from.
3.10 Final Review
Final Review is the last phase of the dictionary building process. In it, a complete
dictionary entry is composed out of the collected components (see entry structure in 5.1)
and visualized for the first time. The annotator’s task is to fix any typos and mistakes
and to check the overall coherence of the entry.
For instance, senses (however well-defined) are perceived differently across annotators,
who may produce translations, usage examples and images that are not fully compatible
with each other. In Final Review, a skilled annotator has the last say and can modify
or delete entry components to achieve coherence. Addition of information, however, is
discouraged at this step. Final-reviewed entries have got their definite form (in terms of
data management, not visualization), in which they will appear in the final dictionary.
4. Data Management
Baisa et al. (2019) reported on issues with data management. Although the paper itself
is not very specific about this issue, we have learned from the authors that the issues
were connected to the fact that the XML annotations from all the phases described above
were exported from and imported into one centralized database. Once an annotation was
imported into the database, it could not be easily changed and re-imported, because the
entries could have been changed by following imports. The approach would be probably
working fine if all the annotations and import/export processes were perfect and consistent,
however, that was not the case. Every inconsistency in annotation and all the small bugs
in the automatic import/export procedures, propagated and resulted in a decent amount
of entries containing inconsistent information which must have been manually corrected,
generating delays and additional costs. Moreover, as new versions of the source corpus
were produced (e.g. due to improvements of lemmatization and tagging), some parts of
the data became inconsistent with the corpus.
Therefore, our approach to the data management is different. We take the source corpus
and the native speaker annotations as source data for fully automated procedure that
creates respective dictionary parts, merges them into the complete dictionary and generates
new data for annotation. The procedure is implemented as a Makefile which makes it easy
to define dependencies among the individual components, and is illustrated in Figure 8.
In case of any change (new annotations available, new version of the source corpus, ...)
all the data are re-processed, new versions of partial dictionaries and derived corpora are
created, and a new version of the dictionary is automatically generated. Also, new data
for annotation, if needed, are created.
This approach gives us the flexibility to fix any problem or inconsistency in the source
data, or in the manual annotations, that previously passed unnoticed, and re-generate
627
the whole dictionary easily. The fully automated procedure therefore enforces consistency
across all the pieces of data involved in the process. Also, it can be used instantly for a
new corpus and a new language.
Figure 8: Illustration of the data management process
4.1 Formats
For the partial dictionaries (green rectangles in Figure 8) as well as for the resulting
dictionary, we used the NVH – name-value hierarchy – format (Jakubíček et al., 2022), a
text format easily readable for both humans and simple automatic text processing tools,
which is suitable for dictionary data and significantly less complex than XML.
For the manual annotations (blue “documents”), XML was used as the internal format of
the Lexonomy software where the annotators worked.
628
4.2 Generating the Dictionary
In this part we describe the automatic procedure more in detail. In Figure 8, every shape
represents a target in a Makefile, and the arrows represent the dependencies among the
particular targets. Typically, there is one Python script (or a few calls of the standard UNIX
tools) for each of the targets, which generates the target contents from its dependencies.
For clarity, we have split the description into parts, but please bear in mind that all the
content of this section is one fully automatic process that runs as a whole over partial
data, and can be repeated as many times as needed.
4.2.1 Headwords
At the very beginning, there is a source corpus, tagged and lemmatized automatically,
using available software tools. The first step of the procedure takes the word list of
lemposes (lemmas with a one-letter part-of-speech suffix) from the corpus and generates
annotation batches (“headword annotations” in Figure 8) for the N most frequent words
(by document frequency). In this project, a total of 102,323 lemposes received 2 annotations
from different annotators, and if they were not in agreement, further annotations were
collected until there was at least 50% agreement.
From the headword annotations, a partial dictionary is generated, containing
lempos, its annotations, final decision and the percentage of agreement, for each of
the headwords.
4.2.2 Headword Revision
If the final decision about a headword was , or , a
annotation was generated – a next step whose purpose was to fix mistakes of the
automatic lemmatizer and tagger and find correct (or standard, respectively) lemmas and
parts of speech of the words. Most of the items sent to revision were revised to a word
that we already had in the dictionary, but we also obtained 6,177 words that we had not
seen before and the dictionary would miss them if the step was not incorporated.
The outputs of the revision annotations are merged into a partial dictionary
which records the corrections. This dictionary is then used in two ways:
• Using the recorded revisions and the original corpus, we create a
that contains correct lemmas and parts of speech, and is used as a base for further
629
processing, namely word senses. (If we did not take this step, the word sense model
would not contain the 6,177 new words at all, and the data for other words would
be incomplete.)
• We merge it with and create which contains
a final list of headwords for the dictionary together with frequencies and
frequency ranks generated from the revised corpus. The next phases do not add
more words into the dictionary, they just add more information to words that are
already present.
4.2.3 Word Forms
For each of the valid words in , we generate a list of word forms
present in the revised corpus into the word form annotation batches. The annotators mark
them as correct or wrong and the correct word forms are then exported into
which is later merged into the final dictionary.
4.2.4 Audio Recordings
Audio batches for recording are generated for all the valid words in .
After recording, the audio files are kept separately and the metadata containing information
about the location of the particular audio file, are compiled into which is then
merged into the final dictionary.
4.2.5 Word Senses
From the revised corpus, we generate an automatic model of word senses for all words in
the corpus. At first, we used traditional collocation-based approach described in Herman
et al. (2019), but the result would frequently miss high frequency senses. The overall
quality of the result was not sufficient and significant post-editing effort was necessary to
extract useful information. For this reason, we switched to a word sense induction model
based on Bartunov et al. (2016), which represents the senses of a word as word embeddings.
Then, we map the senses from the model onto (some of) the collocations from word sketch,
clustering the collocations. Each cluster of collocations is then considered a candidate
sense. From these clusters of collocations, we generate sense annotation batches and ask
the annotators to name, fix and translate the automatically identified senses, as discussed
in 3.6.
These annotations are then processed into another partial dictionary that
records the division of each word into senses, the collocations assigned to the particular
senses, and the names and translations of the senses. Apart from being an input for the
final dictionary, this partial dictionary is used to generate a from the revised
corpus, where the basic unit of analysis is not a lemma (lempos) anymore, but a . In
our particular implementation, a sense is a lempos concatenated with the sense name, e.g.
vs. , but the exact string is not important, it could as well be
and . The important moment is that now we can work with separate
senses instead of lemmas (lemposes)—namely compile word sketches and thesaurus for
senses so that word sketch and thesaurus for is different from word sketch
and thesaurus for .
630
4.2.6 Thesaurus
For each recorded in , a list of similar words (and similar ) is pulled
from the using Sketch Engine’s thesaurus function (Rychlý & Kilgarriff,
2007) and put into thesaurus annotation batches. Because not all of the occurrences are
clustered into senses, we merge thesaurus for the sense with the thesaurus of the (more
general) lemma to get more quality data. The results of the annotation are again compiled
into a partial dictionary .
4.2.7 Usage Examples
For each sense recorded in , we generate a set of 5 best candidate example
sentences from the corpus with the GDEX tool (Rychlý et al., 2008). For this purpose,
a new Ukrainian-specific GDEX configuration was created. The candidate sentences are
then automatically translated into English by the DeepL API and annotation batches are
created from the extracted sentences and their automatic translations. The annotators are
then asked to read all the sentences, select one best example, edit it (but only if needed)
and check and edit (again, only if needed) its automatic translation into English.
The annotations are then processed into a partial dictionary which is then
merged into the final dictionary.
4.2.8 Images
The images phase of the project is not yet finished at the time of writing this paper, but
we intend to implement it in the same frame as the previous phases: automatically search
for copyright-free images in several databases, based on English translations for each sense,
let the annotators select one best image out of 10, and record the selections in a partial
dictionary .
5. About the Dictionary
So far we discussed the process of compiling the Ukrainian dictionary. This section
summarizes some basic information about the resulting dictionary itself.
5.1 Entry structure
The entry structure of the dictionary may be clear from the description of the methods
above—however, the following description shows it explicitly:
• Headword (lemma + part of speech) is the basic identification of every entry.
It is also the primary key of the dictionary in the database sense—we don’t allow
multiple entries with the same lemma and part of speech.
• Flag specifies the type of the entry: in the final dictionary, we have only ,
and but we also keep all the rejected words with the other flags in
631
a separate database. and entries do not contain senses, and
also contains a link to the standard form of the headword.
• Frequency of the word retrieved automatically from the document frequency in
the corpus, i.e. number of the documents the headword occurred in.
• Rank of the headword according to the frequency (computed automatically from
the frequency).
• Pronunciation, or precisely the location of the audio recording with the pronunciation,
the output of the audio recording phase.
• List of word forms, the output of the word forms post-editing phase.
• List of senses identified in the sense annotation phase. Only words marked
have senses and translations. Next, every sense contains:
– Sense identifier or disambiguator which tells the senses apart and may
explain them to an extent (but it is neither definition or explanation of the
sense). It may be empty if the word is found monosemous (has only a single
sense recorded in the dictionary).
– One or more translations to English, as recorded in the sense annotation
phase.
– Collocations sorted by grammatical relations, as recorded in the sense
annotation phase. Each collocation also contains a short example (typically
3-5 words) automatically extracted from the corpus.
– List of synonyms, antonyms and similar words, as identified in the
thesaurus annotation phase.
– One usage example and its translation to English, both results of the
example annotation phase.
– Image, if appropriate, selected in the images selection step (not implemented
yet). Every image consists of its location, source and license.
The structure is rather shallow, but we believe it contains the most important elements for
a decent dictionary entry. Also, the modular nature of the process makes it possible to add
further steps easily, such as definitions/explanations or translation into more languages.
In this dictionary, we did not take multi-words into account—but there are already tools
available to identify multi-words from corpus n-grams and collocations that would make it
relatively easy to enrich the dictionary in this direction.
5.2 Basic statistics
For organizational and budget reasons, we did not complete all the entries all the way
through, some of them are “more complete” than others. A relatively long list of valid
frequent headwords and word forms is a valuable multi-purpose resource, so we aimed at
having a really long list of headwords first, and then continued with the other phases step
by step, always starting with the most frequent headwords.
By the time of writing this paper, the project is still not finished, but mainly for budget
reasons we slowed it down and now the work continues with only one remaining Ukrainian
editor. This means the numbers below are not final but they reflect the state after less
than 1 year of intensive work during which 6,918 hours of manual post-editing work (or
approximately 3.5 full-time person-years) were consumed. See Figure 10 for a breakdown
by task.
632
Figure 9: Dictionary entry example for the word (tooth). There are 7 senses of the
word in total, here we show only the first three of them.
Overall, our database contains 123,574 annotated headword candidates (i.e. all the
headwords from the corpus seen by at least one annotator). This figure includes the revised
headwords that were originally not present in the corpus—without them, it is 117,397.
Of these, 14,141 were only seen by one annotator (better than nothing but not reliable
633
Headwords: 30.6 %
Revision: 10.4 %
Forms: 18.3 %
Audio: 8.0 %
Senses: 17.4 %
Thesaurus: 5.3 %
Examples: 10.0 %
Figure 10: Workload by task (100% = 6,918 hours)
enough for the dictionary) which leaves us with 109,433 headword candidates with reliable
annotation.
Of these, 55,632 ended with flags suitable for the final dictionary, namely:
• 46,987 common words (marked )
• 8,252 proper names
• 393 non-standard words
So we can say that the size of our dictionary is 55,632 entries. All of these entries
contain an audio recording of the pronunciation, as well as frequency and rank derived
automatically from the corpus.
42,639 of these headwords contain list of their word forms which is in total 453,010 validated
word forms.
The size of the dictionary in terms of complete entries, i.e. entries with verified senses,
translations, thesaurus and usage example, is 9,785. (We still plan to add images in the
near future.) Of these, 3,901 entries are polysemous and 5,884 are monosemous. 1,057
words have more than three senses. Total number of senses in the dictionary is 17,973.
In all the process phases, we always proceeded according to document frequency. In other
words, we went through the 109,433 most frequent words in the corpus, the dictionary
contains the 55,632 most frequent Ukrainian words (according to the corpus) and we have
complete entries with senses for the 9,785 most frequent words.
6. Conclusions
We have reported on a rapid corpus-based development of a new Ukrainian-English
dictionary using a new process of automatic generating and step-by-step post-editing of
634
the dictionary. We described building the source corpus, then we went through all phases
of the process in detail and explained our approach to dictionary data management during
the process.
The resulting dictionary contains ca. 10,000 finished entries; another 45,000 entries for
less frequent headwords are partly finished. Overall the process consumed less than 7,000
hours of paid editor’s time which is a fraction of both time and money needed to build a
similar dictionary in a traditional way with professional lexicographers.
In the future, we will continue working on the dictionary (2–5,000 more finished entries,
adding images), and since we made the workflow setup really easy within this project, we
are looking forward to running similar projects with new languages soon.
7. Acknowledgements
We cordially thank the Institute for Ukrainian (https://mova.institute) for permission
to use their manually annotated corpus available through the Universal Dependencies
project (https://github.com/UniversalDependencies/UD_Ukrainian-IU). We cordially
thank Andriy Rysin, Vasyl Starko and the BrUK team for permission to use the Ukrainian
morphological database they developed and made available at https://github.com/brown
-uk/dict_uk. This work has been partly supported by the Ministry of Education of CR
within the LINDAT-CLARIAH-CZ project LM2023062. This publication was written with
the support of the Specific University Research provided by the Ministry of Education,
Youth and Sports of the Czech Republic.
8. References
Baisa, V., Blahuš, M., Cukr, M., Herman, O., Jakubíček, M., Kovář, V., Medveď, M.,
Měchura, M., Rychlý, P. & Suchomel, V. (2019). Automating dictionary production:
a Tagalog-English-Korean dictionary from scratch. In
. Brno, Czech Republic: Lexical Computing CZ
s.r.o., pp. 805–818. URL https://elex.link/elex2019/wp-content/uploads/2019/10/eLe
x-2019_Proceedings.pdf.
Bartunov, S., Kondrashkin, D., Osokin, A. & Vetrov, D. (2016). Breaking sticks and
ambiguities with adaptive skip-gram. In . PMLR, pp.
130–138.
Gantar, P., Kosem, I. & Krek, S. (2016). Discovering Automated Lexicography: The
Case of the Slovene Lexical Database. , 29(2), pp.
200–225. URL https://doi.org/10.1093/ijl/ecw014. https://academic.oup.com/ijl/artic
le-pdf/29/2/200/7199846/ecw014.pdf.
Herman, O., Jakubíček, M., Rychlý, P. & Kovář, V. (2019). Word Sense Induction
Using Word Sketches. In
.
Springer, pp. 83–91.
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P. & Suchomel, V. (2013). The TenTen
Corpus Family. In . Lancaster,
pp. 125–127. URL http://ucrel.lancs.ac.uk/cl2013/.
Jakubíček, M., Kovář, V., Měchura, M. & Rambousek, A. (2022). Using NVH as a
Backbone Format in the Lexonomy Dictionary Editor. In A.R. Aleš Horák Pavel Rychlý
635
(ed.)
. Brno: Tribun EU, pp. 55–61. URL https:
//raslan2022.nlp-consulting.net/.
Jakubíček, M., Měchura, M., Kovář, V. & Rychlý, P. (2018). Practical Post-editing
Lexicography with Lexonomy and Sketch Engine. In
. p. 65.
Jongejan, B. & Dalianis, H. (2009). Automatic training of lemmatization rules that
handle morphological changes in pre-, in- and suffixes alike. In
. Suntec, Singapore:
Association for Computational Linguistics, pp. 145–153. URL https://aclanthology.org
/P09-1017.
Kallas, J., Kilgarriff, A., Koppel, K., Kudritski, E., Langemets, M., Michelfeit, J., Tuulik,
M. & Viks, Ü. (2015). Automatic generation of the Estonian Collocations: Dictionary
database. In
. Trojina, Institute for Applied Slovene Studies, pp. 1–20.
Kilgarriff, A. (1997). I don’t believe in word senses. , 31,
pp. 91–113.
Kilgarriff, A., Baisa, V., Rychlý, P. & Jakubíček, M. (2015). Longest–commonest Match. In
. Trojina, Institute for Applied Slovene Studies/Lexical
Computing Ltd ..., pp. 11–13.
Kosem, I., Koppel, K., Zingano Kuhn, T., Michelfeit, J. & Tiberius, C. (2018). Identification
and automatic extraction of good dictionary examples: the case(s) of GDEX.
, 32(2), pp. 119–137. URL https://doi.org/10.1093/ijl/ec
y014. https://academic.oup.com/ijl/article-pdf/32/2/119/28858872/ecy014.pdf.
Měchura, M.B. (2017). Introducing Lexonomy: an open-source dictionary writing and
publishing system. In
. pp. 19–21.
Pomikálek, J. (2011). .
Ph.D. thesis, Masaryk University.
Pomikálek, J. & Suchomel, V. (2011). chared: Character Encoding Detection with a
Known Language. In
. pp. 125–129.
Rundell, M. & Kilgarriff, A. (2011). Automating the creation of dictionaries: where will it
all end. , pp. 257–282.
Rychlý, P., Husák, M., Kilgarriff, A., Rundell, M. & McAdam, K. (2008). GDEX:
Automatically finding good dictionary examples in a corpus. In
. Barcelona: Institut Universitari de Lingüística
Aplicada, pp. 425–432.
Rychlý, P. & Kilgarriff, A. (2007). An efficient algorithm for building a distributional
thesaurus (and other Sketch Engine developments). In
. pp. 41–44.
Schmid, H. & Laws, F. (2008). Estimation of Conditional Probabilities With Decision
Trees and an Application to Fine-Grained POS Tagging. In
. Manchester, UK:
636
Coling 2008 Organizing Committee, pp. 777–784. URL https://aclanthology.org/C08-1
098.
Shvedova, M. (2020). The General Regionally Annotated Corpus of Ukrainian (GRAC,
uacorpus.org): Architecture and Functionality. In V. Lytvyn, V. Vysotska, T. Hamon,
N. Grabar, N. Sharonova, O. Cherednichenko & O. Kanishcheva (eds.)
, volume 2604 of
. CEUR-WS.org, pp. 489–506. URL https://ceur-ws.org/Vol-2604/paper36.
pdf.
Starko, V. (2021). Implementing Semantic Annotation in a Ukrainian Corpus. In
N. Sharonova et al. (eds.)
, volume 2870 of . CEUR-WS.org, pp. 435–447.
URL https://ceur-ws.org/Vol-2870/paper32.pdf.
Starko, V. & Rysin, A. (2020).
. Seriya ”Ne vse splyva rikoyu chasu...”. Vydavnychyj
dim Dmytra Buraho. URL https://www.researchgate.net/profile/Vasyl-Starko/pu
blication/344842033_Velikij_elektronnij_slovnik_ukrainskoi_movi_VESUM_ak_zas
ib_NLP_dla_ukrainskoi_movi_Galaktika_Slova_Galini_Makarivni_Gnatuk/links/5
fa110cd458515b7cfb5cc97/Velikij-elektronnij-slovnik-ukrainskoi-movi-VESUM-ak-zasib
-NLP-dla-ukrainskoi-movi-Galaktika-Slova-Galini-Makarivni-Gnatuk.pdf.
Starko, V. & Rysin, A. (2023). Creating a POS Gold Standard Corpus of Modern Ukrainian.
In .
Dubrovnik, Croatia: Association for Computational Linguistics, pp. 91–95. URL
https://aclanthology.org/2023.unlp-1.11.
Suchomel, V. & Kraus, J. (2021). Website Properties in Relation to the Quality of Text
Extracted for Web Corpora. In
. pp. 167–175. URL https://nlp.fi.muni.cz/raslan/
2021/paper19.pdf.
Suchomel, V. & Kraus, J. (2022). Semi-Manual Annotation of Topics and Genres in Web
Corpora, The Cheap and Fast Way. In
. pp. 141–148. URL https://nlp.fi.muni.cz
/raslan/2021/paper22.pdf.
Suchomel, V. & Pomikálek, J. (2012). Efficient Web Crawling for Large Text Corpora. In
S.S. Adam Kilgarriff (ed.) .
Lyon, pp. 39–43. URL http://sigwac.org.uk/raw-attachment/wiki/WAC7/wac7-proc.
pdf.
637