PA153 Natural Language Processing
08 - Lexicographic tools and computational lexicography
Karel Pala, Adam Rambousek
Centrum ZPJ, Fl MU, Brno
21. listopadu 2018
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 1/19
Q Lexicography
• Introduction 9 History
• Dictionaries and computers
Q Computational Lexicography
• Data representation
• TEI
• LMF
• Dictionary Writing Systems
Q Dictionary creation
• Lexical database
• Dictionary
Karel Pala, Adam Rambousek
PA153 N LP
Lexicography
o PLIN035 Computational Lexicography
• subfield of lexicology
• lexicography, lexikografie
► the activity or occupation of compiling dictionaries (Oxford d.)
► the editing or making of a dictionary (Merriam-Webster d.)
► the job of writing a dictionary (Macmillan d.)
• practical lexicography
9 theoretical lexicography - analysis and description of the lexicon, theory of dictionary components, user groups, evaluation
• Slovník národního jazyka náleží mezi první potrebnosti vzdelaného človeka.
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 3/19
History
o Ebla (Syria) clay tablets, cca 2500-2250 BC
► Sumerian - Ebla language
• The Oxford English Dictionary (A New English Dictionary)
► 1857, Philological Society, R. C. Trench, criticizing dictionary
► 1879, James A. H. Murray appointed chief editor
► 1882-1928, published in 12 volumes, 15 487 pages, 240 000 entries
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 4/19
History
• Kancelář Slovníku jazyka českého, 1911
► volunteers gathering supporting materials
► excerpts from novels, poems, technical books, journals
► Příruční slovník jazyka českého, 1935-1957
► 10 824 pages, 250 000 entries
► quotes by "unwanted authors"censored (Karel Čapek = Lid.nov.)
Computational Lexicography
5/19
Future?
• Akademický slovník současné češtiny
► 2005-2010, lexical database (Praled)
► 2012-2016, applied research
► planned 120-150 thousands
► finished A (2700) December 2017; B+C in 2018?
► mainly electronic (web, mobile)
• The Oxford English Dictionary 3rd Edition
► 2000-2037?, budget £34M
► "Every word in the Dictionary is being reviewed"
► periodical updates in batches, 4x/year
Dictionaries and computers
• 1960s - computers are used, lexicographers writing on paper, operators typing into database, Brown Corpus
9 1978, Longman Dictionary of Contemporary English
► 1st with limited definition dictionary, checked automatically
► special coding for NLP research
• 1980, COBUILD, University of Birmingham + Collins
► contemporary corpus (Bank of English)
► 1987, Collins COBUILD English Language Dictionary
► 1st dictionary based on corpus data
► new definition style - full sentence
► If a person, animal, or other living thing is killed, something or someone causes them to die.
• 1990s - development of specialised dictionary writing systems
• 1987, Text Encoding Initiative
Karel Pala, Adam Rambousek PA153 NLP Computational Lexicography 7/19
XML
• PB138 Modern Markup Languages
• extensible Markup Language - markup (meta)language
• rules for properly formatted document - easy machine processing and information exchange
9 actual markup specified by the user (standards, custom)
• elements content
o without content may be shortened to o attributes
Computational Lexicography
8/19
Structure and content description
• DTD (Document Type Definition)
► list of elements and attributes, and their relations
► no content checking
►
►
• XML Schema (XSD, XML Schema Definition)
► description of XML document structure and content, schema itself is XML document
► elements, attributes, structure
► possibility to define custom content types (e.g. postal address)
► content checking (e.g. number range, regular expressions, allowed values)
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 9/19
Display
• XSLT - extensible Stylesheet Language (Transformations) 9 converting XML to another format
► other XML markup, plain text, HTML, LaTeX, PDF
• small templates for parts of XML document, recursive processing of the document
• (functional programming language)
S5JC Slovník jpusYnftio jtyka idkfhn
lov
m b i«j -u)
1. stiháni a zmocňováni se wife (neji odstřelem), chytáni ryb L jelenů, divokých kachřti, velryb; I. lososů, I. perel; doba lovu; uspořádat L na medvídy; vyjft na ].; právo Jovu, I. odstřelem, chytáním, lapáním, I le^ni, pobii, vodní, hromadný 1. hun. lisV.a vyšla na ].; lovu 3dar' (itrnekýpoidrov)
2. tipr chytám, shánini Čehokotrv, vůbec získávání, přt kterém se uplatni obratnost a náhoda I. rcacníno hmy:u. sbírat*!*1 se vydat na L lidových písní; potící* podnikla L na:lodi]e; «pr io)*L! ifamýnáitz útulná koupi cp
3. výsledek Jovu, úlovek, kořist vrátit se s bohatým lovem sutonnou mřiep. pí»n nPr irfemr Kiimny™:oimirKůittn Haimóvnáhodou
Slovní}: ipuovn* ítiltny
lov
-u m
1. loveni nife a ryb lov koroptvi, lov na zajíce, Hi>.a vyšla na lov, 1 úlovek ftvnol keřut iswnoi mít bohatý lov,
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 10 / 19
Storing
• XML database
o storing XML documents directly
• searching - XPath, XQuery o e.g. eXist, BaseX, Sedna
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 11 / 19
TEI
• Text Encoding Initiative, http: //www. tei-c. org/
• TEI Guidelines (current version 5, published 2007)
• XML format for semantic description of text documents
• wide range of markup tags
• TEI Lite - smaller version, "90 % needs of 90 % of users "
9 novels, poems, theatre plays, technical reference, dictionaries, corpora, alignment, text revisions, musical notation...
o tools - XSL transformations to I5TeX, docx, epub, HTML
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 12 / 19
LMF
Lexical Markup Framework,
http://www.lexicalmarkupframework.org/
ISO-24613:2008
common model for lexical resources
emphasis on machine processing and extensibility
UML diagram for the lexicon
core with basic information + extensions for various areas (morphology, syntax, semantics...)
: Global Information : Lexical Resource
languageCoding = " ISO 639-3 '
1
Lexicon
language -'eng"
; Lemma I- : Lexical Bitry
brüten Form = "clergyman* 1 pariQfSpeetri = "common Noun"
: Word Form : Word Form
wrltlsnForm = "clergyman" gram rn elicaiN um be r = *s 1 rtg uia r* wrifleriForm = "clergymen" gram maücalNumber = "plural"
Karel Pala, Adam Rambousek
PA153 N LP
Computational Lexicog
Dictionary Writing Systems
• software application for dictionary creation (usually full process) o connected to other resources (corpora, analyzers...)
• often custom developed
• commercial (IDM DPS, iLex, TLex, ABBYY Lingvo Content)
• DEB (Dictionary Editor and Browser)
► platform to build dictionary applications
► client-server, core libraries, specialized modules
► DEBDict, DEBVisDic, Internetová jazyková příručka, DEBWrite
► http://deb.fi.muni.cz
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 14 / 19
[Mew Owimttil Obj«t Mt-íd] IsliHMinelíic - [£\DirtiOníiy sandále ("] sarvJiMctif)
sangler ("j sang-méré (*} sangsue(') sani
■sarrs-MEur (") sarrs-joie ("J
sam* (■> ■sasui
saper|l]C) ■í
r-Pronun(i3bcn lest $0' POSÖroup AKöNurmtef = l.PartDISpe«3i=p«p .-. 1 AutoKumber^l
UTE: TE=**thOUt
Example Fiüir:; e=C'est quend tu pem da j~ Example Examples'On peu< Taire gans-tiavaille - ConrtMnartiDn: Lemmaitpri=sarraoes3Q,£lymDlc TE: iE=eoaiess TE: TE= ■ **ns enn Jans lc saile sons ti lefovt dekpry You wouldn't hm fought a die dance hall without tun throwing you out (LA, An94} *LA TB. Anji. 0*84? ■ ;a. v* s*n* dlro n goes ™(hout saying lAdnwi]
Jív« řWJt f wVř íůflí-ťířťír. You're
Eůns.tĎíur [ídíoůrll jí. 1 hnftlatt. vv«i. pititfrss ŕtfson ■ iKíhaig bw a cruel man. (SB) [Adminl
sans.joiff (sťijwůj r. m 1 íjfeat blue Kí k n
<í.oe:Lvee. R*31>(*dňwft]
Ssnifl Claus Isftakbz. íÉtekbz) ň.prtp. 1 Sanla Claus
lAdmni] u ni* (9ůte| mf.
1 h«aih ■ J'atpas pu m'wipächtr de Marcher á luS. Jeáis, "li y a une qutsttonj 'arflíŕHĎfí « demander. Qucs e ea tu fats pour a šatni?'M Ii dít, "Je vos ůtf búlpt&eli* sous les sc-Srs." I couJdn'! help hut walk ovei :o ham. I sari. 'Thtie's ä qitrstiön I'd Ute to ask you. What do you do tof your health?"' He said 'I go to the dance almost every night.'" (ch: La >ieige sur la coweriure) u i votra santš ta your híaAb
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 15 / 19
Lexical database
• detailed structured database of language
► (recently) usage examples from corpus
► grammar
► valences, patterns
► language style, usage, region...
► word relations
• foundation for dictionaries and research
• PraLeD (Pražská Lexikální Databáze)
• DANTE (Database of ANalysed Texts of English)
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 16 / 19
Dictionary creation
• dictionary writing is expensive, laborious and time-consuming, competition
• B. T. Sue Atkins, Michael Rundell: The Oxford Guide to Practical Lexicography
Marketing Dept
Editorial Dept
user pro flies
extent /contents styles & sampte entriss •-'i IT Dept
i
?
develop dictionary
Design Dept
print deiign
Marketing Dept
1
Karel Pala, Adam Rambousek
PA153 N LP
Software Houie
Computational Lexicography 17 /
Dictionary content
9 macrostructure - entry list (+preface, appendices...) o heslo1 = lemma, entry term, heslové slovo, headword
► noun singular, verb infinitive
► word parts, collocations
• heslo2 = heslová stať, entry
a microstructure - structure of one entry in the dictionary
► checked by editing software
► easier orientation for the reader
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 18 / 19
Electronic dictionaries
• more information (CD, DVD, web)
► presentation space
o multimedia, searching, navigation, updates, external links
• datamining user information
► Dictionary.com, subsequent search: bastion, hiatus, enmity, decorous
• display information based on user profile
o connection with corpora - ordnet.dk, DWDS.de...
• combining resources, downloading data - Wordnik.com
9 user-created content (90-9-1) - Wiktionary, slovnik.zcu.cz...
• Macmillan - switch to digital only e shift from products to services
Karel Pala, Adam Rambousek
PA153 N LP
Computational Lexicography 19 / 19