DeriNet 2.0: Towards an All-in-One Word-Formation Resource
Jonáš Vidra Zdeněk Žabokrtský Magda Ševčíková Lukáš Kyjánek
Charles University, Faculty of Mathematics and Physics,
Institute of Formal and Applied Linguistics
Malostranské náměstí 25, 118 00 Prague 1, Czech Republic
{vidra,zabokrtsky,sevcikova,kyjanek}@ufal.mff.cuni.cz
Abstract
DeriNet is a large linguistic resource containing over 1 million lexemes of Czech connected by
almost 810 thousand links that correspond to derivational relations. In the previous version,
DeriNet 1.7, it only contained very sparse annotations of features other than derivations – it listed
the lemma and part-of-speech category of each lexeme and since version 1.5, a true/false ﬂag
with lexemes created by compounding.
The paper presents an extended version of this network, labelled DeriNet 2.0, which adds a number
of features, namely annotation of morphological categories (aspect, gender and animacy) with all
lexemes in the database, identiﬁcation of root morphemes in 250 thousand lexemes, annotation
of ﬁve semantic labels (diminutive, possessive, female, iterative, and aspect) with 150 thousand
derivational relations, a pilot annotation of parents of compounds, and another pilot annotation
of so-called ﬁctitious lexemes, which connect related derivational families without a common
synchronous parent. The new pieces of annotation could be added thanks to a new ﬁle format
for storing the network, which aims to be general and extensible, and therefore possibly usable to
other similar projects.
1 Motivation
The paper deals with extending DeriNet, a lexical database developed for Czech, which contains around
1 million lexemes connected with app. 810 thousand edges representing morphological derivations
(Ševčíková and Žabokrtský, 2014), forming app. 220 thousand tree-shaped derivational families. The
resulting version is labelled DeriNet 2.0 (Vidra et al., 2019) and it is available for download under a free
non-commercial license. The extension is mostly qualitative: we extended the expressive power of the
underlying data structure (and of the associated ﬁle format) substantially and thus enabled capturing
language phenomena which were impossible to handle in the previous versions of DeriNet. More
speciﬁcally, there are ﬁve newly supported annotation components in the DeriNet annotation scheme:
• morphological categories: lexemes are assigned morphological categories that remain constant
under inﬂection, such as gender with nouns or aspect with verbs,
• morpheme segmentation: lexemes belonging to the largest derivational families have their root
morphemes identiﬁed,
• semantic labels: derivational relations are assigned labels capturing the change that the meaning of
the base word undergoes by attaching the aﬃx (in aﬃxation),
• compounds: lexemes with two (or even more) roots are linked with their both (or more) base words.
The linking of compounds with their base words has not been possible so far due to the highly
constrained data structure used in DeriNet 1.7 and older versions,
81
Proceedings of the 2nd Int. Workshop on Resources and Tools for Derivational Morphology (DeriMo 2019), pages 81–89,
Prague, Czechia, 19-20 September 2019.
• ﬁctitious lexemes: lexemes that are attested neither in the corpora nor in the dictionaries but, based
on structural analogies, ﬁll a paradigm gap in the derivational family are newly added into the
database.
Feature 1.7 2.0
Derivational relations  
Part-of-speech category  
Morphological categories  
Compounding relations a 
Semantic labels  
Morpheme segmentation  b
Fictitious lexemes  
a
A yes/no ﬂag marking compounds was encoded in the POS category.
b
In the present version, only root morphs of a subset of lexemes are annotated.
The format allows for marking aﬃxes and allomorph resolution as well, but these
annotations are not currently available.
Table 1: Comparison of features available in DeriNet 1.7 and 2.0.
The annotations present in DeriNet 2.0 are compared to the previous versions in Table 1.
The actual recall of the newly added annotations is rather limited, but even the incomplete annotations
serve as a proof of concept and show the viability of the new annotation scheme. However, the main
ambition of our eﬀorts does not lie in adding several new annotation components, but it is more strategical:
in the long term we attempt to accumulate virtually all information related to word-formation in a single
data resource (similarly to various kind of syntactic and semantic phenomena being annotated), and thus
hopefully proﬁt from new synergies due to combining diﬀerent possible perspectives on word-formation.
Some of the features are already available in existing data resources, so from this viewpoint DeriNet 2.0
is rather eclectic. For instance, detailed information on morphological categories of lexemes is captured in
MorfFlex CZ (Hajič and Hlaváčová, 2013), morpheme segmentation is available in the MorphoChallenge
dataset (Kurimo et al., 2009), semantic labels of derivations can be found in Démonette (Hathout and
Namer, 2014), compounds are identiﬁed in CELEX (Baayen et al., 1995), and ﬁctitious lexemes are
introduced in Word Formation Latin (Litta Modignani Picozzi et al., 2016). However, none of these
resources, to the best of our knowledge, integrate all the features in one data set.
In addition, we believe that the extended annotation scheme is ﬂexible enough to be sustainable for
a longer period of time without major changes. At the same time, we plan to apply the scheme to dozens
of other languages, so the scheme is designed to be as language agnostic as possible.
2 New features
2.1 Morphological categories
Lexemes were provided with selected morphological categories in DeriNet 2.0, namely with the category
of gender and animacy (with nouns) and the category of grammatical aspect (with verbs), in addition to
the part-of-speech category already available in the previous versions of the data. These categories do
not change in inﬂection, and are characteristics associated with lexemes as wholes.
The morphological categories to assign were extracted from the MorfFlex CZ dictionary (Hajič and
Hlaváčová, 2013), which enumerates all possible word forms and positional part-of-speech tags for each
lexeme. The set of part-of-speech tags of a particular lexeme was merged into a single string, tentatively
called a tag mask, by comparing individual positions of the diﬀerent tags. If all tags of the lexeme share
the same value at a position, it is copied to the tag mask, otherwise it is replaced by the question mark
(“?”). For example, exploiting the part-of-speech tags assigned to the individual forms of the noun chata
‘cottage’ (15 tags in total, including e.g. “NNFS1-----A----”, “NNFP3-----A----” or “NNFP7-----A---6”),
the tag mask “NNF??-----A---?” was compiled, which encodes that the lexeme is a noun (NN), feminine
82
Unique combinations Lexemes
Lemma 2,599 5,342
Lemma + POS category 2,137 4,353
Lemma + POS category + morph. features 518 1,039
Table 2: Counts of homonymous combinations of various lexeme features with the counts of aﬀected
lexemes. By deﬁnition, the number of lexemes must be at least twice the number of homonymous
combinations, since a feature combination that is not shared by at least two lexemes is not a homonym.
The number of lexemes is slightly larger, because some lemmas are shared by up to four lexemes: e.g. stát,
which can mean either ‘a country’, ‘to stand’, ‘to stop’ or ‘to melt down’.
gender (F), aﬃrmative polarity (A). The categories associated with the other positions either vary (cf. the
question marks in the positions associated with the categories of number, case, and register), or are not
applicable to Czech nouns (such as tense, cf. the positions with ‘-’).
In addition to the tag mask format, the morphological categories listed above were extracted from the
masks and stored in DeriNet 2.0 using the Universal Features annotation scheme (Nivre et al., 2016).
This approach to extracting the morphological categories has a very high precision: we were unable to
ﬁnd any errors in the grammatical category of gender in an uniformly randomly selected sample of 100
nouns, and we found two errors in the category of aspect in a sample of 100 verbs.
The recall of the annotation is also high, with 99.6% nouns being assigned a gender category and
93.2% of verbs being assigned an aspect category. The nouns with missing gender annotation are mostly
foreign words with unclear or varying gender (such as image ‘image’, which can be masculine inanimate,
feminine or neuter, depending on the speaker’s preference) and words which can be used to denote both
male and female persons (such as šereda ‘ugly (person), gorgon’). The dictionary we use as the lexeme
source, MorfFlex CZ, usually handles these cases by having a separate lexeme for each gender (such that
all forms of any one lexeme have identical gender), but some lexemes have forms with diﬀerent genders,
resulting in missing gender annotation after extraction. Verbs with missing aspect annotation are mostly
missing the aspect category in the source dictionary, but some (about one in six) are marked as biaspectual
– we chose to exclude the annotation of these for the time being due to low precision of this part of the
annotation.
The morphological categories can in some cases also be used to distinguish homonymous lexemes.
Just as there are pairs of lexemes with identical lemmas, but diﬀerent part-of-speech categories, there
are also pairs of lexemes with identical lemmas and part-of-speech categories, but with a diﬀerent
aspect or gender. Using tag masks combined with lemmas, we are able to uniquely identify 3,314
out of 4,353 lexemes with homonymous lemma-POS combinations in DeriNet 2.0; see Table 2 for
detailed counts. Therefore, the tag masks serve as auto-generated readable identiﬁers (distinguishing
e.g. masculine inanimate mol#NNI??-----A---? ‘mole (unit)’ and masculine animate mol#NNM??-----A--?
‘mill moth’), as opposed to e.g. using opaque numerical indices (‘mol#1’ and ‘mol#2’) or manually
created descriptions (‘mol#grammolecule’ and ‘mol#butterﬂy’) to distinguish homonyms, which are the
methods used by the underlying MorfFlex CZ dictionary.
The homonymous lexemes may or may not be parts of the same derivational family. For instance,
the noun růst ‘growth’ and the verb růst ‘to grow’, distinguished by the part-of-speech category, are
related derivationally, the former one being converted from the latter one. Compared to that, the noun
tulení ‘hugging’ and the adjective tulení ‘seal’ are identical in spelling due to truly random coincidence;
they belong to diﬀerent derivational families (with the root lexemes tulit (se) ‘to hug’ and tuleň ‘seal’,
respectively).
Morphological categories captured by the tag masks have been exploited also within the semantic
labelling task (Section 2.3).
83
[téc]i V (to ﬂow)
s[téc]i V (to ﬂow together)
[tek]oucí A (ﬂowing)
[tok] N (a ﬂow)
u[téc]i V (to run away)
sou[tok] N (a conﬂuence)
s[ték]at V (to be ﬂowing together)
u[tík]at V (to be running away)
Figure 1: An excerpt from the derivation family of téci ‘to ﬂow’ in DeriNet 2.0, with root morphemes
marked by square brackets. Other morphemes are not delimited yet.
2.2 Morpheme segmentation and allomorphy
In DeriNet 2.0, root morphemes of selected lexemes were identiﬁed as another new type of annotation.
This annotation is currently limited to approx. 250 thousand lexemes and it is supposed to be a sort of
pilot approach for a large-coverage morpheme segmentation in the next versions of the data. See Figure 1
for a small sample of the annotation.
Morpheme segmentation, i.e. the task of dividing a word into a sequence of segments corresponding to
morphemes as the smallest meaning-bearing language units, is extremely challenging when dealing with
Czech. The main reason is the frequent allomorphy of roots and aﬃxes. For instance, in the lexemes that
are derivationally related with the verb jíst ‘to eat’ in our data, eight root allomorphs are attested (jís, jíd,
jed, níd, nís, něd, jez, and něz). Notice that there is not a single grapheme shared by all of the allomorphs.
In our ﬁrst experiment, which aimed at identiﬁcation of all morphemes in the lexeme structure, we
implemented a lemma decision-tree-based segmenter that employed letter n-gram features and was trained
using a set of 750 hand-segmented lexemes sampled uniformly randomly from DeriNet. However, the
evaluation on an independent dataset showed that the precision of predicted segmentations (95% of
identiﬁed morphs were correct, resulting in only 85% words being segmented correctly) is below the
quality standards usually applied on released versions of DeriNet.1
In our second experiment we thus limited the problem to identiﬁcation of root morphemes and made
more intensive use of existing derivational trees. For the 760 biggest trees (in terms of number of nodes),
we applied the previously trained segmenter on all lexemes in these trees and tried to distinguish the
substring corresponding to the root morpheme in each lexeme using a simple heuristics: for each word,
mark its rarest morpheme (measured by the number of occurrences in the whole dataset) as the root;
break ties by marking the longer or ﬁrst such morpheme. We obtained a set of allomorphs of the root
morpheme for each tree. The quality of such allomorph sets was relatively low, so the sets were cleaned
manually. Then we identiﬁed the position of a root allomorph in each lexeme. In case there were
multiple matching allomorphs, we preferred the longest one. This process was iterated several times, as
applying the allomorph sets to the whole derivational trees uncovered several errors in the annotation
of derivational relations. Finally, we added such detected root morpheme boundaries into DeriNet 2.0,
which resulted in 243,793 lexemes with identiﬁed boundaries of their root morphemes.
There was an interesting side eﬀect of the allomorphy annotations. Some sets of allomorphs for diﬀerent
derivational trees were surprisingly similar. In some cases the string similarity was only due to a random
coincidence of etymologically unrelated clusters (such as the derivational family of řídký ‘sparse’ with
root allomorphs řid, říd, řed and řeď, from which three allomorphs overlap with the family of řídit ‘to
direct, to drive’ with allomorphs řid, říz, řed, řiz and říd), or due to a diachronic etymological relation
(since DeriNet focuses on synchronic view of the language, diachronic relations which are opaciﬁed in
modern language are not included; e.g. medvěd ‘a bear’, which is etymologically a compound with bases
med ‘honey’ and jíst ‘to eat’, is not connected to any parents in DeriNet) but sometimes we really revealed
a missing relation in DeriNet 1.7; such relations were added into DeriNet 2.0.
1One of our design decisions is that when adding new pieces of information into DeriNet, we prefer precision to recall.
84
Label Count
Possessive 88,718
Female 29,023
Aspect 15,439
Iterative 11,886
Diminutive 5,939
Table 3: Counts of the semantic labels in DeriNet 2.0 data.
2.3 Semantic labels
Semantic labels, which capture the change in the meaning of the base word imposed by aﬃxation, were
assigned with relations in DeriNet as another new type of annotation.
Derivation in Czech is characterized by homonymy (polyfunctionality)2 of aﬃxes and, at the same time,
by their synonymy. Many aﬃxes convey more than one meaning, cf. the suﬃx -ka deriving the diminutive
noun vlnka ‘small wave’ from vlna ‘wave’, the female noun hráčka ‘female player’ derived from hráč
‘player’, the agent noun mluvka ‘talker’ from mluvit ‘to talk’, or the location noun skládka ‘dump’ from
skládat ‘to dump’. From the opposite perspective, a particular meaning is usually expressed by several
formally diﬀerent aﬃxes, cf. the suﬃxes -ka in stavitelka ‘female builder’ derived from stavitel ‘builder’,
-yně in kolegyně ‘female colleague’ from kolega ‘colleague’, -ice in lékarnice ‘female pharmacist’ from
lékárník ‘pharmacist’, and -ová in švagrová ‘sister-in-law’ from švagr ‘brother-in-law’ for female nouns.
The size of the DeriNet data as well as the fact that the database is still under construction were
the main reasons why semantic labels were not assigned manually but a Machine Learning experiment
was designed for this task. Five semantic labels were included into this pilot experiment, namely
DIMINUTIVE, POSSESSIVE, FEMALE, ITERATIVE, and ASPECT. While the former four labels correspond to
semantic concepts proposed for comparative research into aﬃxation (Bagasheva, 2017), the latter label
(ASPECT) was introduced to apply to suﬃxation of verbs that does not aﬀect the lexical meaning but
changes the category of aspect (from imperfective to perfective, or the other way round).3
Training and test data for the Machine Learning experiment, containing both positive and negative
examples of the ﬁve labels to assign, were compiled by exploiting several language resources and
reference grammars of Czech (cf. Ševčíková and Kyjánek in press for details).
Using morphological categories and character n-grams of both the base words and the derivatives as
features and multinomial logistic regression as method, precision and recall achieved in the Machine
Learning task (each above 96 %) indicate that the derivational families organized into rooted trees and the
features included provide a suﬃcient basis for resolving the homonymy and synonymy of aﬃxes in most
cases. An analysis of incorrectly labelled relations pointed out, for example, to feminines incorrectly
assigned the FEMALE label such as profesura ‘professorship’ (derived from profesor ‘professor’) and krejčovna
‘tailor’s workshop’ (from krejčí ‘tailor’); these particular problem could be solved by introducing
the animacy feature to feminine nouns because the label is intended to be assigned only with female
counterparts of masculines. The resulting annotation of approx. 150 thousand labels was included into
the DeriNet 2.0 data. See Table 3 for a breakdown of the counts of the diﬀerent categories.
2.4 Compounds
In the previous versions of DeriNet, compounding could not be adequately modelled due to the highly
constrained data structure used as it allowed to specify a single base word for each derivative. In
DeriNet 2.0, we introduce the notion of multi-node relations, which allow specifying any number of
parent and child lexemes. Compounding is then annotated as a relation with multiple parent lexemes. For
technical reasons, a single parent and a single child must always be marked as the main ones. For example,
the adjective jihoruský ‘south-Russian’ points to the adjective ruský ‘Russian’ by the main-parent link
2The terms “homonymy” / “polyfunctionality” are preferred to “polysemy” in the recent accounts (Karlík et al., 2012;
Šimandl, 2016).
3As formation of aspectual pairs exploits derivational aﬃxes in Czech, the decision has been made to model this process as
deverbal derivation in the DeriNet database (Ševčíková et al., 2017).
85
hedvábí N hedvábný A
umělý A uměle D
umělost N
hedvábíčko N
hedvábně D
hedvábník N
hedvábnost N
umělohedvábný A
hedvábnice N
hedvábnický A
hedvábníkův A
umělohedvábnost N
umělohedvábně D
hedvábničin A
hedvábnickost N
hedvábnicky D
hedvábnictví N
Figure 2: The derivational family of the lexeme hedvábí ‘silk’ and a tiny excerpt from the family of the
lexeme umělý ‘artiﬁcial’.
and to the noun jih ‘south’ by a non-main-parent link. The interﬁx -o- is often added between the bases
in compounds in Czech.
DeriNet 2.0 contains only a small sample of such compound annotations, serving, again, rather as
a proof of concept. Out of around 33 thousand lexemes that were labelled as compounds in DeriNet 1.7
(just by a value of a binary ﬂag, without their compositional parents being identiﬁed), we extracted 723
lexemes whose parents can be guessed automatically with relatively high reliability using just a set of
string-based heuristics. Subsequently we checked the list manually, which resulted in 600 compounds for
which both compositional parents are captured in DeriNet 2.0.
The procedure for guessing the parents works as follows: First, decompose the lemma of a known
compound by ﬁnding an ‘o’ in it and extracting the substrings preceding and following it. The ﬁrst
substring is looked up in the dictionary as-is or amended by appending ‘ý’, ‘í’, ‘y’, ‘i’, ‘o’ or ‘a’ (these
are common inﬂectional suﬃxes and word-ﬁnal characters in Czech). The second substring is looked
up in the dictionary verbatim. If these lookups result in ﬁnding only a single pair of candidate parent
lemmas, output them, otherwise (if there are no matches or several) end the procedure without producing
any output. This selection process is highly biased, as it selects only lexemes whose parents can be
conclusively detected by simple string manipulation and ignores ambiguous cases.
2.5 Fictitious lexemes
When climbing from a derived word up to its base parent and continuing upwards, we should ideally end
up in a tree root whose lemma is unmotivated (in the synchronous sense, i.e. there is no parent in the
contemporary language). However, in some cases there is a strong intuition that a virtual node (corpus- or
dictionary unattested) would be helpful, as it would complete a certain analogy pattern. For instance, one
is tempted to add a non-existent lemma bízet, as it would naturally serve as a derivational base for nabízet
‘to oﬀer’, vybízet ‘to prompt’, pobízet ‘to urge’ and others. In other conﬁgurations, a virtual lemma such
as tmívat could serve as an intermediate node connecting a (corpus-attested) lemma stmívat se ‘to get
dark’ with its (corpus-attested) grand-parent tma ‘darkness’, as the derivation is (again, by analogy to
other derivational clusters) perceived as two-phase. We call such artiﬁcially added lexemes ﬁctitious
lexemes. As a proof of concept, we added 13 such lexemes into DeriNet 2.0, which allowed adding 41
derivations for preﬁxed verbs that should clearly not remain in tree root positions.
Our approach to ﬁctitious lexemes is related to the linguistic discussion on cranberry morphemes
(Aronoﬀ, 1976) and, more recently, on paradigm gaps (e.g. Stump 2019). However, the basic building
unit of DeriNet is still a lexeme, not a morpheme, and thus there is no technical means e.g. for expressing
that a set of preﬁxed verbs makes use of the same morpheme.
3 New data format
Previous versions of DeriNet were published in a simple tab-separated-values text database ﬁle, which
contained a lemma, part of speech and an optional link to the derivational parent on each line; see Table 4
for an excerpt from DeriNet 1.7. None of the new features can be represented in the old format, and so
a new one was required. The old format cannot be easily extended in a backwards-compatible way, as
there is no reserved ﬁeld that identiﬁes the version and the only possible simple extension – adding new
columns to the end of each line – is not compatible with existing tooling that uses several extra columns
86
ID Lemma Dictionary ID POS Parent ID
205205 hedvábíčko hedvábíčko N 205206
205206 hedvábí hedvábí N
205207 hedvábně hedvábně_(*1ý) D 205219
205208 hedvábnice hedvábnice_(*3ík) N 205215
205209 hedvábničin hedvábničin_(*3ce) A 205208
205211 hedvábnickost hedvábnickost_(*3ý) N 205213
205212 hedvábnicky hedvábnicky_(*1ý) D 205213
205213 hedvábnický hedvábnický A 205215
205214 hedvábnictví hedvábnictví N 205213
205215 hedvábník hedvábník N 205219
205216 hedvábníkův hedvábníkův_(*2) A 205215
205218 hedvábnost hedvábnost_(*3ý) N 205219
205219 hedvábný hedvábný A 205206
. . . . . . . . . . . . . . .
768083 umělohedvábně umělohedvábně_ˆ(*1ý) D 768085
768084 umělohedvábnost umělohedvábnost_ˆ(*3ý) N 768085
768085 umělohedvábný umělohedvábný AC
. . . . . . . . . . . . . . .
768106 umělý umělý A 768197
768020 uměle uměle_(*1ý) D 768106
Table 4: The tree below the word “hedvábí” (silk) and excerpts of two related trees in DeriNet 1.7. Since
compounding cannot be annotated in this format, the word umělohedvábný ‘made of artiﬁcial silk’ is
marked as a compound using the ‘C’ mark in the part-of-speech category (fourth) column, but it is not
connected to its parents umělý ‘artiﬁcial’ and hedvábný ‘made of silk’. The Dictionary ID column lists the
lemma together with technical suﬃxes as used by the MorfFlex dictionary – these are stored in DeriNet
to allow interlinking the two resources.
for debugging information. Therefore, as compatibility with existing tools has to be broken anyway, we
decided to create the new format from the ground up. When designing it, we drew inspiration from the
CoNLL-U format (Nivre et al., 2016), which recently became a widely used representation of syntactic
annotation.
The new format is still textual and lexeme-based, but it allows for a wider range of annotations. In
addition to the lemma and part-of-speech tag, each lexeme can be annotated by key-value pairs specifying
its properties (e.g. the morphological categories), a list of its morphemes together with their properties,
and by any number of directed word-formation relations. Each relation can connect multiple parents with
multiple children, and so the format can express one-to-one relation such as derivation or conversion,
as well as many-to-one relations such as compounding. The relations are stored together with their
children, connecting them to their parents, but otherwise behave like separate entities, and they can also
be annotated with arbitrary key-value pairs (e.g. the semantic labels). Furthermore, there is space for
custom (possibly language-speciﬁc) extensions of the format in the form of JSON-encoded data (Bray,
2017) stored in the last column. See Table 5 for an excerpt from DeriNet 2.0 showing the new format and
Figure 2 for a visualization of this data.
The key-value pairs are serialized into textual form by joining each pair by an equals sign and concatenating
all such pairs describing a single entity with ampersands: key1=value1&key2=value2.
If the ﬁeld in question describes multiple entities, such as the segmentation, the diﬀerent entities are
concatenated with vertical bars: key1=value1&key2=value2|keyA=valueA.
To simplify processing of the data, which has the form of a general graph, we explicitly select treeshaped
substructures from the graph and store the corresponding “main parent” IDs in a dedicated column.
The lexemes in the ﬁle are grouped according to these trees, which correspond to derivational families,
with compounds added to the family of one of its parents. This enables e.g. performing a depth-ﬁrst
search over the structure of the derivational families without having to explicitly avoid cycles by marking
87
ID Language-speciﬁc ID Lemma POS
Morphological
features
Morpheme
segmentation
Main
parent ID
Parent relation
144293.0 hedvábí#NNN??-----A---? hedvábí N Gender=Neut
144293.1 hedvábný#AA???----??---? hedvábný A 144293.0 Type=Derivation
144293.2 hedvábně#Dg-------??---? hedvábně D 144293.1 Type=Derivation
144293.3 hedvábník#NNM??-----A---? hedvábník N
Animacy=Anim
&Gender=Masc
144293.1 Type=Derivation
144293.4 hedvábnice#NNF??-----A---? hedvábnice N Gender=Fem 144293.3
SemanticLabel=Female
&Type=Derivation
144293.5 hedvábničin#AU????--------? hedvábničin A Poss=Yes 144293.4
SemanticLabel=Possessive
&Type=Derivation
144293.6 hedvábnický#AA???----??---? hedvábnický A 144293.3 Type=Derivation
144293.7 hedvábnickost#NNF??-----?---? hedvábnickost N Gender=Fem 144293.6 Type=Derivation
144293.8 hedvábnicky#Dg-------??---? hedvábnicky D 144293.6 Type=Derivation
144293.9 hedvábnictví#NNN??-----A---? hedvábnictví N Gender=Neut 144293.6 Type=Derivation
144293.10 hedvábníkův#AU???M--------? hedvábníkův A Poss=Yes 144293.3
SemanticLabel=Possessive
&Type=Derivation
144293.11 hedvábnost#NNF??-----?---? hedvábnost N Gender=Fem 144293.1 Type=Derivation
144293.12 umělohedvábný#AA???----??---? umělohedvábný A 144293.1
Sources=195833.258,144293.1
&Type=Compounding
144293.13 umělohedvábnost#NNF??-----?---? umělohedvábnost N Gender=Fem 144293.12 Type=Derivation
144293.14 umělohedvábně#Dg-------??---? umělohedvábně D 144293.12 Type=Derivation
144293.15 hedvábíčko#NNN??-----A---? hedvábíčko N Gender=Neut 144293.0
SemanticLabel=Diminutive
&Type=Derivation
. . . . . . . . . . . . . . . . . . . . . . . .
195833.258 umělý#AA???----??---? umělý A
End=2
&Morph=um
&Start=0
&Type=Root
195833.4 Type=Derivation
195833.259 uměle#Dg-------??---? uměle D
End=2
&Morph=um
&Start=0
&Type=Root
195833.258 Type=Derivation
Table 5: The lexeme hedvábí ‘silk’ and derivationally related lexemes (i.e. a derivational family represented
as a tree) in DeriNet 2.0. The last column containing language- and resource-speciﬁc data has been
omitted; in Czech DeriNet 2.0, it contains the technical dictionary ID for linking with MorfFlex and the
“compound yes/no” ﬂag from previous versions of DeriNet. The line with dots divides the derivational
family of hedvábí ‘silk’ from that of umělý ‘artiﬁcial’, which is the second base word for the compound
umělohedvábný ‘made of artiﬁcial silk’.
The family containing the lexeme umělý is large enough to have been included in the annotation of root
morphemes. This annotation is present in the sixth column. The family of hedvábí is not annotated yet
and its sixth column is therefore empty.
visited lexemes, as it guarantees that a search starting from the base lexeme of the family will visit every
lexeme in it exactly once. There are no restrictions on the relations not participating in the tree-shaped
substructure, so it is possible to annotate double motivation and other general word-formation structures.
Inside the database, all lexemes are unambiguously speciﬁed using an ID. The IDs are hierarchical:
they are composed of the number of the tree they are in, followed by the number of the lexeme in the
tree. These IDs are used to specify the endpoints of relations. Because the hierarchical numerical IDs
are opaque and they change when a lexeme is reconnected, a more permanent identiﬁcation of a lexeme
is possible using a ﬁeld reserved for this purpose. In the Czech data, this ﬁelds contains the lemma and
the tag mask introduced above.
Detailed documentation of the ﬁle format and the tools created to process it is available in the doc/
directory of the DeriNet repository at https://github.com/vidraj/derinet.
4 Conclusions
The DeriNet database was enriched with several diﬀerent kinds of information about the lexemes and
88
relations contained therein, which were previously missing. The newly added annotation is useful or
even required for many tasks, e.g. the availability of morphological categories was vital to annotating
the relations with semantic labels, and the annotation of root morphemes allowed us to cross-check the
already present derivational relations with another source of information.
The format we developed for storing and distributing the resulting network is supposed to be general,
extensible and language-agnostic enough to be usable by other projects as well. By using a common
format, the diﬀerent networks can beneﬁt from a shared set of tools and services and their users can more
easily compare their properties, and through that hopefully also the properties of diﬀerent languages.
Acknowledgments
This work was supported by the Grant No. GA19-14534S of the Czech Science Foundation, by the
Charles University Grant Agency (project No. 1176219) and by the SVV project number 260 453. It has
been using language resources developed, stored, and distributed by the LINDAT/CLARIAH CZ project
(LM2015071, LM2018101).
References
Mark Aronoﬀ. 1976. Word Formation in Generative Grammar, volume 1 of Linguistic inquiry monographs. MIT
Press, Cambridge, Massachusetts, USA.
R. Harald Baayen, Richard Piepenbrock, and Leon Gulikers. 1995. CELEX2. Linguistic Data Consortium,
Catalogue No. LDC96L14. https://catalog.ldc.upenn.edu/LDC96L14.
Alexandra Bagasheva. 2017. Comparative Semantic Concepts in Aﬃxation. In Competing Patterns in English
Aﬃxation. Peter Lang, Bern, Switzerland, pages 33–65.
Tim Bray. 2017. The JavaScript Object Notation (JSON) Data Interchange Format. RFC 8259.
Jan Hajič and Jaroslava Hlaváčová. 2013. MorfFlex CZ. http://hdl.handle.net/11858/00-097C-0000-0015-A780-9.
Nabil Hathout and Fiammetta Namer. 2014. Démonette, A French Derivational Morpho-Semantic Network.
Linguistic Issues in Language Technology 11:125–162.
Petr Karlík et al. 2012. Příruční mluvnice češtiny. NLN, Prague, Czech Republic.
Mikko Kurimo, Sami Virpioja, Ville T. Turunen, Graeme W. Blackwood, and William Byrne. 2009. Overview
and results of Morpho Challenge 2009. In Workshop of the Cross-Language Evaluation Forum for European
Languages. Springer, pages 578–597.
Eleonora Maria Gabriella Litta Modignani Picozzi, Marco Carlo Passarotti, and Chris Culy. 2016. Formatio
Formosa est. Building a Word Formation Lexicon for Latin. In Proceedings of the 3rd Italian Conference on
Computational Linguistics. pages 185–189.
Joakim Nivre et al. 2016. Universal Dependencies v1: A Multilingual Treebank Collection. In Proceedings of the
10th International Conference on Language Resources and Evaluation. ELRA, pages 1659–1666.
Gregory Stump. 2019. Some sources of apparent gaps in derivational paradigms. Morphology 29:271–292.
Jonáš Vidra, Zdeněk Žabokrtský, Lukáš Kyjánek, Magda Ševčíková, and Šárka Dohnalová. 2019. DeriNet 2.0.
LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics
and Physics, Charles University. http://hdl.handle.net/11234/1-2995.
Magda Ševčíková, Adéla Kalužová, and Zdeněk Žabokrtský. 2017. Identiﬁcation of Aspectual Pairs of Verbs
Derived by Suﬃxation in the Lexical Database DeriNet. In Proceedings of the Workshop on Resources and
Tools for Derivational Morphology. EDUCatt, Milan, Italy, pages 105–116.
Magda Ševčíková and Lukáš Kyjánek. in press. Introducing Semantic Labels into the DeriNet Network.
Jazykovedný časopis .
Magda Ševčíková and Zdeněk Žabokrtský. 2014. Word-Formation Network for Czech. In Proceedings of the 9th
International Conference on Language Resources and Evaluation. ELRA, Reykjavik, Iceland, pages 1087–1093.
Josef Šimandl, editor. 2016. Slovník aﬁxů užívaných v češtině. Karolinum, Prague, Czech Republic.
89