33
Keeping Czech in check: A corpus-based study of generalization in
translation1
Jana Kubáčková
Generalization and specification of lexical meaning are studied, both quantitatively
and qualitatively, as potential universals of translation on the basis of a modest
English-Czech corpus comprising monolingual, multilingual and parallel
subcorpora. Against the backdrop of recent research in this area, generalization and
specification are outlined from the viewpoint of semantics, lexicology, stylistics and
contrastive language typology, with a particular focus on the category of translation
universals and the employment of corpus methodology. Methods and tools are
tailored to the needs of the analysis and the two contrasting concepts are
operationalized. The results obtained confirmed a weak overall tendency to
generalization as well as occurrence of specification which may be explained as due
to the influence of several factors.
Introduction: Arrival of corpora and universals
Translation universals have been around for quite some time now. They acquired substantial
prominence in the mid-1990s when Mona Baker outlined the potential of corpus linguistics in
the study of
[...] universal features of translation, that is features which typically occur in translated text
rather than original utterances and which are not the result of interference from specific
linguistic systems (Baker 1993:243).
The introduction of the methods of corpus linguistics into descriptive translation studies
(DTS) was hailed as “a turning point in the history of the discipline” (Baker 1993: 235).
Indeed, in conjunction with corpus methodology, the study of translation universals has
yielded some very interesting, indeed impressive results in terms of the nature of translations
as well as research methodology (e.g. in Meta 43:4, Chesterman 2004). Yet I cannot avoid
the impression that - for all that has been written about them - translation universals remain
very elusive. After all, as Pym (2008) remarks, the degree of overlap among the various
translation universals (those studied so far) is such that we might as well classify them as subcategories
of Toury’s law of growing standardization, or interference. While this does not
precisely bring us “back to square one”, Pym’s brilliant argumentation exposes the patchiness
of our knowledge of translation universals and the need to understand translation universals
in mutual relations.
And despite the methodological breakthrough made possible by large electronic
corpora, such tools often remain difficult to harness. What Kenny wrote in 1998 still holds
today:
1
This article is based on an MA thesis defended at Charles University (Prague) in 2008.
34
As has been shown time and time again in corpus linguistics, a new resource can give
impetus to new research. The challenge is to know what questions to ask of a translationoriented
corpus, and how to ask them (Kenny 1998: 523).
In addition to the tricky nature of electronic tools and (largely) quantitative results, there is
also the problem of linguistic systems that are more or less different from English. This has
far-reaching implications for the use of tools originally tailored for English and for attempts
to adapt them to, let’s say, a Slavonic language, which turned out to be a problem in the
present study when it came to dealing with the translators’ tendency to generalize lexical
meaning.
Universals, or tendencies?
The starting point, and in many respects the cornerstone of the study is the theory developed
by Jiří Levý, the Czech translation scholar, who found that:
Experiments with translators have shown that, when offered a group of near-synonyms, they
exhibit a natural tendency to select from it the most generalised term, the least specific word
(Levý 2008: 52).
In his theoretical work dating back to the 1950s (e.g. Levý 1955),2
Levý addresses various
general phenomena that have an impact on translating and translation, e.g. the tendencies
towards generalization, stylistic levelling or ‘intellectualization’ - the latter concept
represents a kind of rationalization and partly overlaps with what is today referred to as the
explicitation universal).
In his approach to the process and product of translating, Levý proceeds from a
prescriptive hypothesis to description, based on his experiments and observations, and
towards explanation (e.g. in Levý 1971 a,b; Levý 2008: 47f.) and points to objective, but also
subjective, i.e. psychological, cognitive, and pragmatic factors that may influence the
outcome of the translation process. The three key factors pointed out are (a) the structure of
the translators’ linguistic memory and (b) their perception of their role as mediators between
the text and the reader, but also (c) the principle of least effort, the “minimax theory”:
[…] the translator selects from the range of alternatives the one which promises the
maximum effect for the minimum effort (Levý 2008: 62-63).
The translator not only takes into account the reader’s most likely expectations3
but, more
importantly, adopts a pragmatic approach to the process of translating, seeking (consciously
or subconsciously) to strike a balance between his or her own efforts and potential results,
looking for
2
Other key studies and monographs include Umění překladu (The Art of Translation, first published in 1963)
and studies in the anthology published in 1971 (for a recent translation into English see Králová - Jettmarová et
al. 2008).
3
Similarly to Gutt’s theory of relevance; Gutt (2000).
35
a sentence structure which broadly takes account of all the essential semantic and stylistic
features, although a more perfect version might be found following a protracted period of
experimentation and thought (Levý 2008: 63).4
However, opting for the solution that is readily available in the linguistic memory may easily
result in a translation that is “colourless, general and vague” (Levý 2008: 52). According to
Levý, good translators go
deeper than the first, second or third level of the lexicon, selecting, as far as possible, words
which contain all the semantic attributes of the source text (ibid: 52).
In his attempts to explain phenomena characterizing the process of translation and in his
inherently interdisciplinary approach (Levý 1971a: 148), he in many respects anticipates the
“most recent” trends in DTS and CTS. Sadly, having lived on the “wrong” side of the Iron
Curtain, he is still too much of an outsider in the English-speaking world of translation
studies.
In translation research, the position of generalization and specification is a rather
marginal one. Lexical generalization is sometimes understood as a feature of simplification
(Blum-Kulka – Levenston 1983, in Halverson 2003: 219; Klaudy 2003), while specification
is often seen as an aspect of explicitation (Leuwen-Zwart 1990: 90; Klaudy 1993, in Baker et
al. 1998; Øverås 1998). Leuwen-Zwart (1990: 93) and Munday (1998) suggest that
specification is more prevalent than generalization, thus questioning the universal character
of the latter.
In their classical Stylistique comparée du français et de l’anglais: méthode de
traduction (1958; English translation in 1995), Vinay and Darbelnet use the term
“generalization” to label a translation technique – not a universal tendency, but a conscious
strategy “in which a specific (or concrete) term is translated by a more general (or abstract)
term” (Vinay-Darbelnet 1995: 343).
In a similar vein, the concept of generalization (and, correspondingly, specification referred
to as “concretization”) was developed by Kinga Klaudy (1996, 2003), who classifies
these phenomena as (a) language-specific, (b) culture-specific and (c) translation-specific
(Klaudy 1996). However, by taking examples from widely different languages (Finno-Ugric
vs. Indo-European) and building on their well-known linguistic and stylistic preferences, she
cannot account for generalization or specification as translation universals, even though she
does include this category and describe it in terms similar to those used by Levý (2008: 62-
63):
[...] translators might be tempted to follow the line of least resistance, and if they cannot find a
precise equivalent in the TL, they will select a word with a more general meaning [...]. (Klaudy
2003: 9)5
However, most studies addressing potential universals such as normalization, simplification,
sanitization, Toury’s law of growing standardization etc., subsume the tendency to use vague,
4
In this, he is close to Pym (2008) and his notion of risk avoidance.
5
The page numbers in references to works by Klaudy refer only to pages of texts printed from email
attachments, not books. The texts have been kindly provided by Kinga Klaudy herself.
36
less specific vocabulary under the respective universal. For example, Laviosa, who focuses
on simplification and has made a significant contribution to the use of electronic corpora in
translation research, classifies the tendency to overuse high-frequency words6
(the “core
patterns of lexical use”) as one of the criteria defining simplification (e.g. Laviosa 2002: 58n,
2003: 158-9).
This brings out the issue of genetic inter-relatedness between groups of potential
translation universals, as highlighted by Pym (2008). Halverson (2003: 218n) suggests that
behind these cognate tendencies there is a common cause, a sort of “gravitational pull”
exercised by the most salient members of the semantic structure. Interestingly enough, in her
article grounded in cognitive science, she endorses the arguments of Levý who also speaks of
a “symptom of attraction exercised [...] by the best-known member of a group of synonyms”
(1983: 143, my translation JK).
This highlights the need to study potential translation universals in their mutual
relationships, horizontal as well as vertical. To use a rather crude example, the Czech
language has a predilection for semantically rich verbs. A lack of specific verbs introducing
direct speech in fiction translated from English into Czech may be seen as language-pairspecific
generalization, which therefore cannot be considered as a universal. However, at a
higher level of abstraction, it may be regarded as a feature of the unique items hypothesis
proposed by Tirkkonen-Condit (2004) and simultaneously an instance of negative
interference according to Toury (1995). Cases like these imply that objective conditions
(linguistic and stylistic norms etc.) and subjective ones (the translator’s linguistic memory,
experience etc.) tend to combine and we can only make a more or less precise guess as to
which of the two is more probable.7
Languages in contrast: the preliminaries
Any analysis of generalization involves the treatment of the issue of lexical meaning and
synonymy both in general and language-specific contrastive aspects. After all, it is actually
the existence of synonyms, differentiated by shades of notional, pragmatic or contextual
meaning that provides the paradigm from which the translator can choose:
A paradigm cannot of course be considered a set of equivalent elements but a set ordered
according to a variety of criteria (e.g. ‘shades of meaning’, ‘stylistic levels’ etc.), as otherwise no
choice would be possible. (Levý 2008: 51)
As regards stylistic levels, some formal aspects of synonyms can be distinguished as useful
for subsequent corpus analysis, e.g. Czech expressive synonyms can often be identified
thanks to certain suffixes and combinations of letters.
In Czech functional stylistics, the concept of synonymy is broader and more loosely
defined than in lexicology (Filipec 1961: 145, Bečka 1948: 63) and is therefore more
convenient for semantic analysis in translation. Levý (2008: 49-52), conceiving translation as
6
High frequency of occurrence is assumed, already by Levý, to accompany a vaguer semantic content.
7
There are of course many more factors to be taken into account, such as the translator’s attitude to stylistic
norms of the source and target cultures, the author’s style etc.
37
a decision-making process, speaks of near-synonyms; J.V. Bečka, a leading Czech stylist of
Levý’s era, concedes:
In stylistics, it is not only the word as such, but the choice between words that is at stake, [and
sometimes] we have to decide between words whose meanings are close but by no means
overlapping, i.e. between words that do not constitute true synonyms. (Bečka 1948: 63; my
translation JK).
In other words, stylistic synonymy is very rich but rather unstable, context- and functionsensitive,
sometimes verging on (co-)hyponymy. The reason is that in translation and any text
analysis we deal with parole, which represents a projection of the paradigmatic axis onto the
syntagmatic axis. Such a broad delimitation allows for the conception of synonymic chains
where the dominant member, the “centre of gravity”, tends to be the most frequent and
general one (Filipec 1961: 205); this brings us back to Levý and his arguments explaining
translators’ tendency to generalize.
Contrastive language typology is another crucial aspect of the preliminary analysis, accounting
for the principal differences between the source (English) and target (Czech) languages – it is all
the more important as such typological differences have far-reaching implications for corpus
research methodology. As Pym (2008) pointed out, comparable corpora represent an attempt to
get rid of the influence of the source language, but in themselves are insufficient since they
cannot account for interference and thus can lead to erroneous conclusions. This is one of the
stumbling blocks of comparable corpora as conceived by Baker (1995). True, to a certain extent,
the influence of linguistic systems can be harnessed using Jantunen’s method of three
comparable subcorpora (Jantunen 2004). In parallel corpora, where the source-target relations
can be observed more directly, the first step is to isolate systemic differences in order to identify
their influence. The next step in the present study was therefore the establishment of relevant
typological features and stylistic preferences of Czech and English with special focus on the
vocabulary and methodological implications.
The Czech scholar of English language and literature Vilém Mathesius draws a parallel
between language typology and the meaning of lexical units.
Roughly speaking, words in a language with a synthetic structure (such as Czech) usually have a
more definite meaning than words in a language with an analytical structure (such as English or
French) (Mathesius 1975: 18).
English is also classified among languages characterized by a high degree of polysemy
(Čermák 2004: 205). While synthetic languages, including Czech, usually use affixes to
create new words, English can often simply convert nouns to verbs etc., without changing the
form. In addition, English vocabulary, known for its tendency towards monosyllabism,
includes a significant proportion of homonyms (Vachek 1974:66).
This may have a significant impact on corpus research. For example, one cannot
directly compare the frequencies and counts of word types in a parallel Czech-English corpus
– one English type most probably stands for a number of context-dependent meanings, and
may represent several different parts of speech. Nor can the type/token ratio be used to
account for vocabulary richness in both languages – due to inflection, one Czech word can
38
occur in many cases with different endings, thus substantially increasing the number of types
in relation to tokens. Thus the resulting ratio would be much lower than that for English.
Another crucial aspect of typological differences with direct consequences for corpus
research methodology is the concept of “the word” itself.
The definition of the word varies from one language to another. For example in Czech and other
Slavonic languages of a predominantly synthetic type, the boundaries between words as opposed
to collocations, sentences and morphemes are drawn more clearly than in English, a
predominantly analytical language. (Filipec - Čermák 1985: 34, my translation JK).8
As Mathesius rightly points out, “there are borderline cases; besides independent words there
are words approaching affixes” (Mathesius 1975: 24) – English, in particular, often uses
apostrophes and hyphens, which can divide words as well as members of a compound. Czech
and English differ also in their approach to and usage of various types of compounds.
In his guide to the ParaConc corpus manager, Michael Barlow pays special attention
to the category of the word:
[...] the first definition of a word that comes to mind is a string of letters (and perhaps numbers)
surrounded by spaces. And with a little further thought, we would realise that we need to include
punctuation symbols, in addition to spaces, as possible delimiters of words. Hence, we can define
a word as a string of characters bounded by either spaces or punctuation (plus special computer
characters such as the carriage return) (Barlow 2003: 75).
ParaConc, for example, treats the apostrophe as a part of the word. However, by changing
search options, the apostrophe may be classified as a word delimiter. Similar precautions
apply for using WordSmith Tools.
Compound words are even more problematic, mainly due to the varying degree of
independence of hyphenated words. Moreover, a corpus manager cannot be expected to
capture all instances of compounds since “it is largely a matter of personal choice whether we
write match box, match-box or matchbox (Stubbs 2002: 31). Needless to say, phrasal verbs
such as give up, care for etc., which usually have one-word Czech equivalents, are “invisible”
for corpus managers. Finally, English uses many grammatical words (articles, auxiliary verbs
etc.) and expressions where the grammatical and semantic functions are distributed between
the members (to have a swim, to give a laugh, etc.). The whole – which is more than the sum
of the parts – is unrecognizable in a frequency list.
Useful information concerning systemic differences between Czech and English can
be gained from translated texts - as shown for example by Knittlová (2003), building on
examples from 1960s-1980s translations. Like Baker (1992), Knittlová addresses various
types of non-equivalence and speaks of generalization and specification as sub-categories of
partial equivalence. She considers specification to be the prevalent tendency in translations
from English into Czech and highlights the semantic richness of “multifaceted” Czech verbs:9
8
Filipec and Čermák refer to the article by Josef Vachek (1961) – Some Less Familiar Aspects of the Analytical
Trend of English. In Brno Studies in English 3, 9-78.
9
Here, to some extent, Knittlová cannot avoid the blurring of purely semantic and grammatical (or semigrammatical)
categories such as the aktionsart. The present study excludes consideration of Czech verbal aspect
and English tenses.
39
Again, this is related to the typological difference between the two languages, to the nominal
character of English and the rather verbal character of Czech (Knittlová 2003: 34, my translation
JK).
Knittlová adds that “Czech equivalents of the most frequent groups of English verbs are
semantically richer and more specific” (Knittlová 2003: 51, my translation JK).10
She (2003:
51-52) also suggests that although English has verbs of similar specificity they are used much
less frequently.
Linguistic typology also influences the way languages use markers of expressiveness:
In English, expressiveness tends to be concentrated in lexical units which carry solely expressive
connotational features and have a capacity to radiate, while in Czech texts expressiveness is
spread more evenly over a greater number of units that carry both denotational and connotational
features (Knittlová 2003: 106, my translation JK).11
As for generalization, most examples in Knittlová are illustrative of cultural differences
rather than of a phenomenon occurring during the process of translation.
To be sure, a good translator ought to be able to come to terms with the
incommensurable nature of language pairs – and the first step is to be aware of the problem
and the remedy. As Levý (1983: 70) points out, routine Czech translations from English make
insufficient use of diminutives and other means of expressing affection due to the typological
differences. However, in the complex decision-making process of translation in general, and
translation of fiction in particular, the issue of generalization vs. specification is only one of
many.
Hypotheses and operationalization
Generalization in translation is defined as a “conscious or subconscious semantic loss of one
or more specific semes (notional or pragmatic), in a lexical unit [...] in contrast to
corresponding units of the original or, quantitatively, to comparable original texts written in
the same language” (Kubáčková 2008: 65). Conversely, specification is defined as the
opposite tendency, i.e. as “conscious or subconscious semantic enrichment [...]”.
Bearing in mind that the boundaries between possible causes of these phenomena can
be blurred and using the categories presented by Levý (1983) and Klaudy (1996), instances of
generalization/specification are classified according to the following potentially independent
variables: (a) differences between language systems, (b) stylistic norms, (c) pragmatic factors
such as cultural knowledge and (d) as translation-inherent (universal tendencies, lack of time
or experience, unwillingness to look for a better solution, etc.).
In line with Levý, who considered the vague and “grey” style to result from the
tendency of translators to choose a more general word (1983: 137), the following set of
hypotheses was established:
10
These include the category of verbs introducing direct speech (verba dicendi) where the semantic richness of
Czech verbs is reflected in the prevalent stylistic norm requiring lexical variation.
11
E.g. certain endings and suffixes typical for spoken Czech.
40
I The vocabulary of a corpus of Czech texts translated from English will be more
general and deprived of semantic colour in comparison with original Czech texts.
II The vocabulary of a corpus of Czech texts translated from several languages will be
more general and deprived of semantic colour in comparison with original Czech texts.
III In translations, generalization will be more frequent than specification if we exclude
instances of obligatory specification, non-equivalence due to differences in language
typology and instances that can be accounted for by stylistic conventions.
IV The tendency to generalize will be observed in different translations of an identical
original.
In terms of observable phenomena, the following were considered as indirect indicators of
generalization:
1) A lower number of appellative autosemantic lemmas and a lower lemma-token ratio
will be found in the translation corpus compared to the comparable corpus of original texts;
2) The first 200; 500; 1000 appellative autosemantic lemmas respectively in the
frequency list of the translation corpus will cover a higher percentage of the corpus than the
same numbers of lemmas in the comparable corpus of original texts;
3) The first 200 appellative autosemantic types in the frequency list of the translation
corpus will cover a higher percentage of the corpus and include fewer lemmas than the same
number of types in the comparable corpus of original texts;
4) The number of lemmas with a frequency of 1 up to 10 will account for a smaller part
of the total number of lemmas in translations than in the comparable corpus of original texts;
5) The number of specific expressive lemmas produced by lexical derivation12
(also in
comparison to the total number of lemmas) will be lower than in the comparable corpus of
original texts;
6) The range of synonyms and near-synonyms will be less varied than in the comparable
corpus of original texts.
The criterion for directly observed indication of generalization was the following:
7) When compared to the original, a given passage of a translation will display more
instances of semantic loss (generalization) than of semantic specification. These instances
will not be directly relatable to typological differences between the languages in question or
the influence of target-language stylistic conventions. (Kubáčková 2008: 66-68)
12
Derivation is the typical procedure for word formation in Czech. Therefore, expressive endings adding
stylistic colour to the text do not readily suggest themselves to the translator from English; their absence may
result in the generalization of lexical meaning, which may endorse the Unique Items Hypothesis.
41
Analytical methods and procedures
The analysis was carried out on three levels, with three different types of corpora, starting
with largely quantitative observations and gradually increasing the proportion of qualitative
research. The selection of texts was guided by the principle of mainstream fiction, since this
was the type of material on which Levý had based his theory. At the same time, fiction, due
to its aesthetic function, could be expected to reveal instances of noticeable semantic loss or
enrichment. The aim was not only to verify Levý’s experimental data by using electronic
tools, but also to apply corpus-based methods on Czech texts and so contribute to their
refinement.
The first analytical level handled a monolingual comparable corpus (in terms of Laviosa
1997a: 292) consisting of three subcorpora of Czech fiction extracted from the SYN2005
corpus, which is part of the freely accessible Czech National Corpus (CNC)13
and 40% of
which comprises fiction texts. The CNC corpus manager Bonito was used to design the
subcorpora in line with the criteria of Jantunen’s three-phase comparative analysis (Jantunen
2004: 106f) to provide for the control of the influence of English as the source language. By
spotlighting interference, this method also helps uncover phenomena that are not the result of
the influence of the source language.
The building of the subcorpora had to tackle an imbalance in the book market also
encountered by Bernardini and Zanettin (2004) – most of the texts were translations from
English, Czech originals ranged second and translations from other languages came last.14
As
the aim was to create the largest subcorpora possible, the smallest subcorpus had to be taken
as the benchmark and consequently the size of the other two subcorpora had to be adjusted so
as to make them comparable. Three subcorpora were obtained with a total size of some 22
million tokens:
ORIG: 7 201 905 (i.e. Czech original fiction)
T-Engl: 7 207 238 (i.e. translations from English)
T-mix: 7 209 242 (i.e. translations from a mix of languages)
The selection criterion of “contemporary mainstream fiction” being rather vague, the
subcorpora were composed of texts published in the period 1960-2004, with the majority
published in the 1990s and later. T-mix consists of 37 translations from Germanic languages,
37 from Romance languages, 26 from Slavonic languages and 12 from non-Indo-European
languages (Finnish, Japanese, Hebrew and Yiddish), which gives quite a balanced mix.15
Comparability of the subcorpora is based on the criteria of their size, genre (prose),
period of publication and language. As the T-mix subcorpus was limited by the availability of
texts in the CNC16
the criteria could not have been further fine-tuned.
13
Accessible at http://ucnk.ff.cuni.cz/
14
It is worth pointing out that the sources for SYN2005 were selected on the basis of a wide-scale readership
survey. See http://ucnk.ff.cuni.cz/.
15
Cf. Laviosa (2002: 63) who mentions the disadvantage of having a large proportion of source languages from
one group. The wide choice offered by SYN2005 is probably due to the Czech translation tradition.
16
All available texts were used. For their list see Kubáčková (2008).
42
The Bonito manager made it possible to use lemmatization and tagging provided in
the SYN2005 texts.17
After retrieving all the tokens of each subcorpus (query .*.), the
negative filter (N-filter) was used to eliminate all lemmas starting with a capital letter (proper
names) and all punctuation marks, numbers and numerals and synsemantic parts of speech
including pronouns. Thus, allowing for tagging errors, sets of nouns, adjectives, verbs and
adverbs18
were obtained and frequency lists of lemmas produced for the calculation of the
lemma/token ratio for each subcorpus (Fig.1).
FIG. 1
ORIG T-Engl T-mix
difference
ORIG -
T-Engl
difference
ORIG -
T-mix
No. of tokens (size of the
subcorpora) 7201905 7207238 7209242
No. of appellative
autosemantic tokens 3483594 3431286 3473394
No. of appellative
autosemantic lemmas 95145 72256 68873 22889 26272
lemma/token ratio (%) 2,7312 2,1058 1,9829 0,6254 0,7483
The difference between the respective lemma/token ratios is in percentage points.
Since in Czech each lemma of an inflected word occurs as a number of types, there is a
significant disparity between the number of types and different lemmas. Therefore to capture
lexical diversity the lemma/token ratio had to be used instead of the usual type/token ratio.
The comparative study of the generalization indicators No. 1– 6 was based on the
frequencies19
of lemmas and affixes.
Interestingly enough, although originally smaller than the translation subcorpora,
ORIG turned out to contain the highest number of appellative autosemantic lemmas. The
differences between the lemma/token ratios of ORIG and T-Engl/T-mix respectively were not
statistically significant, but, together with most of the other indicators (% covered by the most
frequent lemmas/types, the numbers of low-frequency lemmas etc. as is evident for example
in Fig. 2-5), indicated that there was a difference between ORIG on the one hand and
translation subcorpora on the other, suggesting a greater lexical diversity in ORIG.
17
The risk of error must be allowed for, despite the fact that the methods for lemmatization and tagging of
SYN2005 represent a major step forward as compared to preceding corpora. For more information see
http://ucnk.ff.cuni.cz/ .
18
The boundaries between different parts of speech are not always clear-cut; The present approach is tailored to
the tools of electronic analysis.
19
[...] in a field like translation, the best, if not the only way to go about estimating “probabilities for terms in [...]
systems” is to proceed from “observed frequencies in [a] corpus” (Toury 2004: 20).
43
FIG. 2
No. of the most frequent lemmas (list head)
size of corpora (No.
of tokens) The first 200 The first 500 The first 1000
Subcorpus
(appellative,
autosemantic) sum % sum % sum %
ORIG 3483594 1261775 36,221 1656373 47,548 1981174 56,872
T-Engl 3431286 1326524 38,660 1732302 50,486 2066143 60,215
T-mix 3473394 1291135 37,172 1708494 49,188 2055030 59,165
FIG. 3
Subcorpus Size
No. of types
(list head)
No. of
lemmas (n)
Part of the subcorpus covered
by the first 200 types (p)
ORIG 3483594 200 137 780734 (22,412 %)
T-Engl 3431286 200 139 816267 (23,789 %)
T-mix 3473394 200 141 789657 (22,734 %)
FIG. 4
Subcorpus
Average frequency of the first
200 types (f = p/200)
Average frequency of the first
n lemmas (f = p/n)
ORIG 3903,670 5698,788
T-Engl 4081,335 5872,424
T-mix 3948,285 5600,404
FIG. 5
ORIG T-Engl T-mix
Total No. of lemmas 95145 72256 68873
No. of lemmas with a
frequency ≤ 10 72298 51801 48511
% 75,99 71,69 70,44
As for the usage of expressive affixes in original texts and translations (see e.g. Fig. 6), the
results were quite convincingly in favour of ORIG, suggesting that translators tend to neglect
the specific potential of Czech morphology. On the other hand there were hardly any
differences in the usage of synonyms. From the point of view of methodology, it may be
more worthwhile to focus on affixes as parts of words bearing only limited semantic
information than on words as such, e.g. synonyms, the occurrence of which appears to
depend much more on the texts in the corpora.
FIG. 6 Expressive suffix –isko (augmentative semantic value)
Subcorpus ORIG T-Engl T-mix
No. of expressive lemmas 28 7 9
Without proper names 25 7 9
Total No. of lemmas 95145 72256 68873
Of which expressive lemmas (%) 0,0263 0,00969 0,0131
44
The three-phase comparable analysis also indicated certain instances of interference from
English, but these were negligible against the backdrop of the overall tendency of translations
to use less varied vocabulary.
Admittedly, the differences between originals and translations were usually small. In
addition, we must allow for a number of limitations, such as the size and composition of the
corpora, lemmatization errors etc. However, the results repeatedly pointed to a less varied
vocabulary in both types of translation subcorpora.
The second level of analysis aimed at testing the third hypothesis – i.e. the prevalence of
generalization in translations with the exclusion of instances caused by systemic or stylistic
differences. It was based on a parallel corpus of five books of fiction and their translations
into Czech (Kubáčková 2008: 74). The originals were all published after 1950 and the
translations after 1989;20
the books were written by well-known authors and can be
considered mainstream fiction; the authors include both men and women from Great Britain,
the USA and Canada; each of the books was translated by a different person with Czech as
their mother tongue; each author and translator is represented only once.
The corpus was analysed with WordSmith and ParaConc. Three reference corpora
were used in addition – the British National Corpus (BNC), the frequency lists of the
American National Corpus (ANC), and a CNC reference corpus of original Czech fiction
(over 10 million tokens) extracted by the author of the present study.
To get a rough picture of the lexical variety of the English originals, their
standardized type/token ration per 1000 words was calculated in WordSmith and the results
were compared to the standardized type/token ration of original English fiction in BNC 1995
– a benchmark used by Zanettin (2000: 111). The English part of the corpus as a whole was
only slightly above the reference value of 44.44 (also calculated by WordSmith) and the
values for individual novels showed no extreme deviations that would indicate a peculiar
vocabulary usage.
In order to devise a method that would be as objective and as easy to replicate as
possible, a ParaConc frequency list of the original texts was produced first. Since the words
in the list head are likely to be translated into Czech in a more specific way due to systemic
language differences, the subsequent analysis focused on infrequent types:21
100 types were
selected which occurred only once in the list and less than 100 times in the BNC or the ANC.
Their meanings were checked in dictionaries in order to select semantically rich words. The
process of selection was carried out prior to the analysis of the translations so as to not to
distort the results by any subjective bias.
Subsequently the translations were analyzed in ParaConc and word pairs then
examined in the minimum context necessary. Not surprisingly, numerous English expressions
were “spread” over several units in translation, which would be unobservable in a purely
quantitative study solely relying on electronic analytical data.
Shifts in translation, based on Popovič’s typology (1974: 122f; 130) and lexical
stylistics, were identified with reference to a variety of dictionaries (monolingual, bilingual,
20
The year 1989 is considered a landmark which brought a major change into the social and economic context
of Czech translation.
21
These subcorpora were not lemmatized.
45
synonymic, etymological). There being no occurrences of generalization caused by pragmatic
differences between the readers of the originals and the translations in the 100 words chosen,
occurrences of generalization and specification were classified as (a) systemic (languagespecific),
(b) stylistic and (c) translational. Three more categories were needed to account for
the remaining cases: other types of shifts, zero equivalents (omission) and zero or negligible
shifts.22
The analysis of the 100 lexical units and their translations yielded a prevalence of
translational generalization:
FIG. 7
100 units systemic stylistic translational sum
generalization 11 0 26 37
other
shifts zero / negligible shifts
specification 2 1 7 10 9 44
In addition, shifts were observed within the context sentences – i.e. in other lexical units.
Here the occurrence of stylistic specification increased and prevailed over generalization.
However, after elimination of the systemic and stylistic types of specification, translational
generalization prevailed over specification:
FIG. 8
shifts in context
sentences systemic stylistic translational sum
other
shifts
generalization 1 0 22 23 3
specification 5 9 10 24
The results suggest a significant tendency towards generalization and, with respect to the
material analysed, confirm the third hypothesis. At the same time they contradict LeuvenZwart
(1990) and Munday (1998), who found a prevalence of specification. However, it is
possible that their material displayed a significant degree of systemic or stylistic specification
which was not treated separately from translational phenomena.
However, no shift, be it generalization, specification, or even a zero shift, should be a
priori qualified as negative, undesirable, or positive (Popovič 1974: 131). Generalization may
deprive the translation of some colour (such as in to marshal other ranks – odvést, i.e. to
“lead away”, Kubáčková 2008: 102), but specification can also have a negative effect by
offering an almost ready-made interpretation. Zero or negligible shifts may include both wellfitting
solutions as well as ill-fitting expressions. There are also instances where such a shift
is deliberate because appropriate from the aspect of a larger context, or may be introduced by
the editor. Such information is inaccessible, but such conditioning has to be accounted for as
a possible factor.
The third level represents a deep analysis of two translations of the novel Foundation and
Empire by Isaac Asimov (1952). Two translations of one original offer a unique possibility to
22
The classification of shifts into these necessarily rough categories is far from unambiguous since, as pointed
out above, there are no clear-cut boundaries between the causes underlying each choice made by the translator.
For complete results, see Kubáčková 2008.
46
focus on a limited number of variables – the personality of the translators, their idiolect,
experience and preferences, and the context in which the translations were produced.
Although the first Czech translation of Asimov dates back to 1970,23
it was the years
after 1989 that witnessed an outbreak of publishing frenzy. Between 1991 and 2006, the
newly emerging small publishing houses, driven very probably by demand from readers,
churned out at least two new Asimov translations almost every year - inevitably, with a
negative impact on the quality of the new translations. Translator Richard Podaný (2000)
mentions the Czech version of the novel Foundation by Jarmila Pravcová (1991) as one of
the books that inspired the establishment of translation anti-awards. In the same year,
Foundation and Empire was published, again in Pravcová’s translation. The publishing
house, AG Kult, became notorious for negligent editorial work. Podaný does not expressly
speak about Foundation and Empire, but the context, as a starting point of analysis, certainly
does not bode well. The second translation of Foundation and Empire was probably a
reaction to this “rush” period. Published in 2003, the new Czech version was produced by
Viktor Janiš, a young translator who has established a good reputation.
The analysis of the original confirmed rich vocabulary (with a type/token ratio per 1000
words of 45.6 compared to the above-mentioned benchmark of 44.44, Zanettin 2000: 111).
Next, WordSmith was used to search for keywords, i.e. the words identified as typical for a
single text (here regarded as a small corpus) in contrast with a larger corpus. The Keywords
tool was employed to compare the vocabulary of Foundation and Empire to the other four
English-language novels used in the previous analyses.24
Disregarding the names of characters and words related to the content of the novel
(planet, galaxy), the list of keywords featured many expressions related to speech: first and
second person pronouns, most common present tense verbs including their short forms and
words that could introduce or describe direct speech (shrugged, smiled, spoke, nodded,
replied, frowned, muttered, whispered etc; adverbs dryly, coldly, harshly, sombrely; and also
the interjection huh and nouns such as voice or speech). As these words had been chosen for
their relatively high frequency, it could be inferred that the novel built heavily on the
dialogue or direct speech and its varieties. There were complex structures qualifying speech
(with a crisp air of finality, with slow meaning, etc.), an unusually varied usage of verba
dicendi, a high degree of expressiveness in the speech of certain characters, as well as a
significant range of synonyms describing communication (to prate, to jabber, to babble etc.).
These features were established as dominant for the style of the original and therefore as the
focus of subsequent translation analysis.
In verba dicendi, the difference found between the two translations was striking.
Again, WordSmith’s Keywords were used, this time in a rather unusual way. Keywords are
normally employed to compare a text with a large reference corpus; however, identification
of keywords in two translations of one text may be a promising launch pad as both use certain
content-related expressions which will thus not appear in the keyword list. By definition, the
list will yield words that are “overused” by one of the translators, thus pointing to their
idiolect, approach, etc.
23
Information provided by the online catalogue of the Czech National Library.
24
A maximum of 1000 words was searched for. The minimum “key” frequency was set at 3, with p=0,001.
47
The list revealed a pronounced disparity in the usage of verba dicendi.25
The Czech
for [he] said – řekl – tops the list of keywords in Pravcová and the corresponding lemma
occurs 334 times in her translation.26
On the other hand, Janiš takes great pains to avoid what
he considers to be the obvious interference, and uses the lemma a mere twelve times.
However, in his translation, other verba dicendi are conspicuously frequent – they are often
rather formal or even bookish (opáčit, odtušit - similar, but not quite synonymous with retort
or riposte). Moreover, some of them seem to be overused in translations in general (as
detected in T-Engl and T-mix corpora), in contrast with original Czech texts.
Thus, while Pravcová features a high rate of interference of the English stylistic norm,
making the Czech dialogues rather stereotypical, Janiš takes care to respect domestic
conventions, but doesn’t always keep his own lexical predilections under control. By
overusing certain semantically rich verbs, he draws attention to them without need or
purpose. Besides, some expressive verbs occur in collocations where they do not fit.
Further analysis of verba dicendi and longer stretches of discourse revealed that
Janiš’s effort to use more varied and colourful vocabulary was also reflected in the higher
degree of expressiveness in dialogues. His method is certainly in line with the overall
tendency of the original, and a great improvement on the previous translation which
substantially deprived the dialogue of its original colour. In places, however, Janiš, carried
away by this tendency, disregards the context.
The two translations show diverging tendencies – one a tendency towards generality
and stylistic interference, and the other an inclination to (over)use of colourful and
semantically specific vocabulary with occasional losses due to the intention to be “different”.
Thus, generalization occurs in both translations, but cannot be said to be equally prevalent.
Social conditions and the policy of the publisher can influence this trend while the idiolect
and approach adopted by the translator can go against a “general” tendency, as shown in
Janiš’s effort to counter stylistic interference.
Conclusion
Generalization is observed as a weak but universal tendency of translated texts in
monolingual comparable corpora. In pairs of originals and translations, it was prevalent in
particular in semantically complex lexical units, as observed at the second-level analysis;
however, as seen in the comparison of two translations of one novel, generalization seems to
be largely dependent on the translator’s idiolect and ambitions as well as on the social
context.
The concept of a “universal tendency” follows the line of thought expressed for
example by Toury (2004), but what does it mean in practice? On the basis of the findings of
the present study, the following can be proposed as a tentative hypothesis: the universal
nature of generalization can be expected to reveal itself when large amounts of data are
compared by quantitative methods. However, the more one relies on qualitative analysis and
the closer one comes to the individual translator and his/her idiolect, experience and working
conditions in a particular social context, the more can one expect discrepancies in findings. In
25
The list of keywords was useful as it indicated the main tendencies. However, WordSmith seems to have
difficulty with treating texts in Czech. It was therefore necessary to verify and correct the data.
26
The original uses only a few more – 358 instances of said as introducing speech.
48
other words, although generalization may perhaps never be ruled out, it can be overridden by
contrary tendencies.27
Different types of corpora can thus be expected to yield different types
of results – small samples point to the differences between translator individualities while
larger corpora reveal the general pattern.
Finally, the results obtained suggest that with a flexible approach, the enormous
potential of corpus managers may be exploited in large corpora as well as in individual text
analyses. In the present article, particular attention is paid to the factors that must be taken
into account in a corpus-based analysis of two largely different languages. Contrastive
semantics, lexicology, stylistics and morpho-semantic typology provide the groundwork in
which the research methods are anchored. Care has been taken to present the necessary
methodological adjustments and adaptations needed to meet the challenge of studying a
synthetic language with tools designed for a language with an analytical structure. Electronic
corpora offer potential for innovation, as e.g. our keyword analysis suggests.
References
BAKER, Mona. 1992. In Other Words: A Coursebook on Translation. London and New
York: Routledge.
BAKER, Mona. 1993. Corpus Linguistics and Translation Studies: Implications and
Applications. In Baker, M., Francis, G., Tognini-Bonelli, E. (eds), Text and Technology. In
Honour of John Sinclair. Amsterdam: J. Benjamins, 233-250.
BAKER, Mona. 1995. Corpora in Translation Studies. An Overview and Some Suggestions
for Future Research. Target 7: 2, 223-243.
BARLOW, Michael. 2003. ParaConc: A Concordancer for Parallel Texts. (Draft 3/03).
[online] [cit. 1-9-2008]. Accessible at http://www.athel.com/paraconc.pdf
BEČKA, Josef V.1948. Úvod do české stylistiky. Praha: Knižnice Kruhu přátel českého
jazyka.
BERNARDINI, Silvia – Zanettin, Frederico. 2004. When Is a Universal Not a Universal? In
MAURANEN, A. - Kujamäki, P. (eds), Translation Universals. Do They Exist? Amsterdam:
J. Benjamins, 51-62.
BLUM-KULKA, S. – Levenston, E. 1983. Universals of lexical simplification. In Færch, C.
–Kasper, G. (eds.), Strategies in Interlanguage Communication. London: Longman, 119-139.
CHESTERMAN, Andrew. 2004. Beyond the Particular. In Mauranen, A. - Kujamäki, P.
(eds), Translation Universals. Do They Exist? Amwsterdam: J. Benjamins, 33-49.
27
Clearly, this is one of the challenges faced by translator training – to warn against undesirable tendencies,
while avoiding simplifying judgements.
49
ČERMÁK, František. 2004. Jazyk a jazykověda. Přehled a slovníky. Praha: Karolinum.
ČERMÁK, František – Kocek, Jan. 2008. Co je korpus? [online] [cit. 30-8-2008]. Accessible
at http://ucnk.ff.cuni.cz/co_je_korpus.html.
FILIPEC, Josef. 1961. Česká synonyma z hlediska stylistiky a lexikologie. Praha: Nakladatelství
Československé akademie věd.
FILIPEC, Josef – Čermák, František. 1985. Česká lexikologie. Praha: Academia.
HAJIČ, Jan. 2004. Disambiguation of Rich Inflection (Computational Morphology of Czech),
Vol. 1. Praha: Karolinum.
HAJIČ, Jan – Krbec, Pavel – Květoň, Pavel – Spoustová, Drahomíra – Votrubec, Jan. 2007.
The Best of Two Worlds: Cooperation of Statistical and Rule-Based Taggers for Czech. In
Proceedings of the Workshop on Balto-Slavonic Natural Language Processing. ACL 2007,
Prague. Praha: Karolinum,. 67-74.
HALVERSON, Sandra. 1998. Translation Studies and Representative Corpora: Establishing
Links between Translation Corpora, Theoretical/Descriptive Categories and a Conception of
the Object of Study. Meta 43: 4, 494-514.
HALVERSON, Sandra. 2003. The Cognitive Basis of Translation Universals. Target 15: 2,
197-241.
JANTUNEN, Jarmo Harri. 2004. Untypical Patterns in Translation. In Mauranen, A. Kujamäki,
P. (eds), Translation Universals. Do They Exist? Amsterdam: J. Benjamins, 101-
124.
KENNY, Dorothy. 1998. Creatures of Habit? What Translators Usually Do with Words.
Meta 43: 4, 515-523.
KLAUDY, Kinga. 1996. Concretization and Generalization of Meaning in Translation. In
Thelen, M., Lewandowska-Tomaszczyk, B. (eds), Translation and Meaning, Part 3.
Maastricht: Hogeschool Maastricht, 140-152.
KLAUDY, Kinga. 2003. Languages in Translation. Lectures on the Theory, Teaching and
Practice of Translation. Budapest: Scholastica, 321-327.
KNITTLOVÁ, Dagmar. 2003. K teorii i praxi překladu. Olomouc: Univerzita palackého v
Olomouci.
KRÁLOVÁ, Jana – Jettmarová, Zuzana et al. 2008. Tradition versus Modernity. From the
Classic Period of the Prague School to Translation Studies at the Beginning of the 21st
Century. Praha: Karlova univerzita – TOGGA.
50
KUBÁČKOVÁ, Jana. 2008. Generalizace a specifikace lexikálního významu v překladu.
[MA Thesis]. Praha: Karlova Univerzita v Praze.
LAVIOSA, Sara. 1997a. How Comparable Can “Comparable Corpora” Be? Target 9: 2, 289-
319.
LAVIOSA, Sara. 1997b. Investigating Simplification in an English Comparable Corpus of
Newspaper Articles. In Klaudy, K. – Kohn, J. (eds), Transferre necesse est. Budapest:
Scholastica, 531-540.
LAVIOSA, Sara. 2002. Corpus-Based Translation Studies. Theory, Findings, Applications.
Amsterdam: Rodopi.
LAVIOSA, Sara. 2003. Corpus and simplification in translation. In S. Pertilli (ed.),
Translation Translation. Amsterdam: Rodopi, 153 - 162.
LEVÝ, Jiří. 1955. Překladatelský proces – jeho objektivní podmínky a psychologie. Slovo a
slovesnost 16, 65-87.
LEVÝ, Jiří. 1971a. Bude teorie překladu užitečná překladatelům? In Bude literární věda
exaktní vědou? Praha: Československý spisovatel, 147-157.
LEVÝ, Jiří. 1971b. Geneze a recepce literárního díla. In Bude literární věda exaktní vědou?
Praha: Československý spisovatel, 71-143.
LEVÝ, Jiří. 1983. Umění překladu. Praha: Panorama.
LEUWEN-ZWART, Kitty M., van. 1989. Translation and Original. Similarities and
Dissimilarities I, II. Target 1: 2, 151-183; 2: 1, 69-95.
MATHESIUS, Vilém. 1975. Obsahový rozbor současné angličtiny na základě obecně
lingvistickém. Praha: Nakladatelství Československé akademie věd.
MUNDAY, Jeremy. 1998. A Computer-Assisted Approach to the Analysis of Translation
Shifts. Meta 43: 4, 542-556. [online] [cit. 2008-08-14]. Accessible at
http://id.erudit.org/iderudit/003680ar
ØVERÅS, Linn. 1998. In Search of the Third Code: An Investigation of Norms in Literary
Translation. Meta 43: 4, 571-588.
PODANÝ, Richard. 2000. Koniášovská retrospektiva. In Interkom, 9-10-2000. [online] [cit.
2008-08-24]. Accessible at http://www.scifi.cz/ik/2000/20000908.htm
POPOVIČ, Anton. 1974. Teória umeleckého prekladu. Bratislava: Tatran.
51
PYM, Anthony. 2007. On Toury’s laws of how translators translate. [online] [cit. 2008-07-
05]. Accessible at http://www.tinet.org/~apym/
PYM, Anthony – Shlesinger, Miriam – Simeoni, Daniel. (eds). 2008. Beyond Descriptive
Translation Studies. Investigations in homage to Gideon Toury. Amsterdam: J. Benjamins.
SCOTT, Mike. 1998. WordSmith Tools Manual. Version 3.0. Oxford: Mike Scott & Oxford
University Press.
STUBBS, Michael. 2002. Words and Phrases. Corpus Studies of Lexical Semantics. London:
Blackwell.
TIRKKONEN-CONDIT, Sonja. 2004. Unique items – over- or under-represented in
translated language? In Mauranen, A. - Kujamäki, P. (eds), Translation Universals. Do They
Exist? Amsterdam: J. Benjamins, 177-184.
TOURY, Gideon. 1995. Descriptive Translation Studies and Beyond. Amsterdam: J.
Benjamins.
TOURY, Gideon. 2004. Probabilistic Explanations in Translation Studies. In Mauranen, A.
and Kujamäki, P., Translation Universals. Do They Exist? Amsterdam: J. Benjamins, 15-32.
VACHEK, Josef. 1974. Chapters from Modern English Lexicology and Stylistics. Praha:
Státní pedagogické nakladatelství.
VINAY, Jean-Paul; Darbelnet, Jean 1995. Comparative Stylistics of French and English: a
Methodology for Translation. Transl. and ed. by Juan C. Sager and M. J. Hamel. Amsterdam:
J. Benjamins.
ZANETTIN, Federico. 2000. Parallel Corpora in Translation Studies: Issues in Corpus
Design and Analysis. In Olohan, M. (ed.), Intercultural Faultiness. Manchester: St. Jerome,
105-118.
Jana Kubáčková
52
In SKASE Journal of Translation and Interpretation [online]. 2009, vol. 4, no. 1 [cit. 2009-09-07].
Available on web page <http://www.skase.sk/Volumes/JTI4/pdf_doc/04.pdf>. ISSN 1336-7811.