PA153 Natural Language Processing
08 - Lexicographic tools and computational lexicography
Karel Pala, Adam Rambousek
Centrum ZPJ, Fl MU, Brno
December 16, 2020
Q Lexicography
• Introduction 9 History
• Dictionaries and computers
Q Computational Lexicography
• Data representation
• Dictionary Writing Systems
Q Dictionary creation
• Lexical database
• Dictionary
9 subfield of lexicology
• lexicography, lexikografie
► the activity or occupation of compiling dictionaries (Oxford d.)
► the editing or making of a dictionary (Merriam-Webster d.)
► the job of writing a dictionary (Macmillan d.)
• practical lexicography
• theoretical lexicography - analysis and description of the lexicon, theory of dictionary components, user groups, evaluation
• Slovník národního jazyka náleží mezi první potrebnosti vzdelaného
Computational Lexicography
o Ebla (Syria) clay tablets, cca 2500-2250 BC
► Sumerian - Ebla language
• The Oxford English Dictionary (A New English Dictionary)
► 1857, Philological Society, R. C. Trench, criticizing dictionary
► 1879, James A. H. Murray appointed chief editor
► 1882-1928, published in 12 volumes, 15 487 pages, 240 000 entries
• Kancelář Slovníku jazyka českého, 1911
► volunteers gathering supporting materials
► excerpts from novels, poems, technical books, journals
► Příruční slovník jazyka českého, 1935-1957
► 10 824 pages, 250 000 entries
► quotes by "unwanted authors"censored (Karel Čapek = Lid.nov.)
Akademický slovník současné češtiny
► 2005-2010, lexical database (Praled)
► 2012-2016, applied research
► planned 120-150 thousands
► finished A (2700), B (3500), C+Č (3600), as of December 2020
► mainly electronic (web, mobile)
The Oxford English Dictionary 3rd Edition
► 2000-2037?, budget £34M
► "Every word in the Dictionary is being reviewed"
► periodical updates in batches, 4x/year
QED3 Revision Progress
200,000 150,000 100,000
2010 KtV INEW
Dictionaries and computers
• 1960s - computers are used, lexicographers writing on paper, operators typing into database, Brown Corpus
• 1978, Longman Dictionary of Contemporary English
► 1st with limited definition dictionary, checked automatically
► special coding for NLP research
• 1980, COBUILD, University of Birmingham + Collins
► contemporary corpus (Bank of English)
► 1987, Collins COBUILD English Language Dictionary
► 1st dictionary based on corpus data
► new definition style - full sentence
► If a person, animal, or other living thing is killed, something or someone causes them to die.
9 1990s - development of specialised dictionary writing systems
• 1987, Text Encoding Initiative
9 PB138 Modern Markup Languages
• extensible Markup Language - markup (meta)language
• rules for properly formatted document - easy machine processing and information exchange
• actual markup specified by the user (standards, custom) 9 elements content
9 without content may be shortened to 9 attributes
Computational Lexicography
Structure and content description
• DTD (Document Type Definition)
► list of elements and attributes, and their relations
► no content checking
• XML Schema (XSD, XML Schema Definition)
► description of XML document structure and content, schema itself is XML document
► elements, attributes, structure
► possibility to define custom content types (e.g. postal address)
► content checking (e.g. number range, regular expressions, allowed values)
• XSLT - extensible Stylesheet Language (Transformations)
• converting XML to another format
► other XML markup, plain text, HTML, LaTeX, PDF
• small templates for parts of XML document, recursive processing of the document
• (functional programming language)
