PA153 Natural Language Processing
08 - Lexicographic tools and computational lexicography
Karel Pala, Adam Rambousek
Centrum ZPJ, Fl MU, Brno
16. listopadu 2015
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 1/19
Q Lexicography
• Introduction 9 History
• Dictionaries and computers
Q Computational Lexicography
• Data representation
• TEI
• LMF
• Dictionary Writing Systems
Q Dictionary creation
• Lexical database
• Dictionary
Karel Pala, Adam Rambousek
PA153 N LP
Lexicography
o PLIN035 Computational Lexicography
• subfield of lexicology
• lexicography, lexikografie
► the activity or occupation of compiling dictionaries (Oxford d.)
► the editing or making of a dictionary (Merriam-Webster d.)
► the job of writing a dictionary (Macmillan d.)
• practical lexicography
9 theoretical lexicography - analysis and description of the lexicon, theory of dictionary components, user groups, evaluation
• Slovník národního jazyka náleží mezi první potrebnosti vzdelaného človeka.
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 3/19
History
o Ebla (Syria) clay tablets, cca 2500-2250 BC
► Sumerian - Ebla language
• The Oxford English Dictionary (A New English Dictionary)
► 1857, Philological Society, R. C. Trench, criticizing dictionary
► 1879, James A. H. Murray appointed chief editor
► 1882-1928, published in 12 volumes, 15 487 pages, 240 000 entries
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 4/19
History
• Kancelář Slovníku jazyka českého, 1911
► volunteers gathering supporting materials
► excerpts from novels, poems, technical books, journals
► Příruční slovník jazyka českého, 1935-1957
► 10 824 pages, 250 000 entries
► quotes by "unwanted authors"censored (Karel Čapek = Lid.nov.)
Computational Lexicography
5/19
Future?
• Akademický slovník současné češtiny
► 2005-2010, lexical database (Praled)
► 2012-2016, applied research
► planned 120-150 thousands
► finished A (2700) to be published in December, B,C in 2017
► mainly electronic (web, mobile)
Dictionaries and computers
• 1960s - computers are used, lexicographers writing on paper, operators typing into database, Brown Corpus
9 1978, Longman Dictionary of Contemporary English
► 1st with limited definition dictionary, checked automatically
► special coding for NLP research
• 1980, COBUILD, University of Birmingham + Collins
► contemporary corpus (Bank of English)
► 1987, Collins COBUILD English Language Dictionary
► 1st dictionary based on corpus data
► new definition style - full sentence
► If a person, animal, or other living thing is killed, something or someone causes them to die.
• 1990s - development of specialised dictionary writing systems
• 1987, Text Encoding Initiative
Karel Pala, Adam Rambousek PA153 NLP Computational Lexicography 7/19
XML
• PB138 Modern Markup Languages
• extensible Markup Language - markup (meta)language
• rules for properly formatted document - easy machine processing and information exchange
9 actual markup specified by the user (standards, custom)
• elements content
o without content may be shortened to o attributes
Computational Lexicography
8/19
Structure and content description
• DTD (Document Type Definition)
► list of elements and attributes, and their relations
► no content checking
►
►
• XML Schema (XSD, XML Schema Definition)
► description of XML document structure and content, schema itself is XML document
► elements, attributes, structure
► possibility to define custom content types (e.g. postal address)
► content checking (e.g. number range, regular expressions, allowed values)
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 9/19
Display
• XSLT - extensible Stylesheet Language (Transformations) 9 converting XML to another format
► other XML markup, plain text, HTML, LaTeX, PDF
• small templates for parts of XML document, recursive processing of the document
• (functional programming language)
S5JC Slovník jpusYnftio jtyka idkfhn
lov
m b i«j -u)
1. stiháni a zmocňováni se wife (neji odstřelem), chytáni ryb L jelenů, divokých kachen, velryb; I. lososů, I. perel; dcbalovu; uspořádat L na medvídy; vyjet na L; právo Jovu, I. odstřelem, chytáním, lapáním, I le^ní, polní, vodní, hromadný 1. hun. liíka vyšla na ].; lovu zdar' (itrnekýpoidrov)
2. tipr chytáni, shánini Čehokotrv, vůbec získávání, přt kterém se uplatni obratnost a náhoda I. rcacníno hmyzu. sbírat*!*1 se vydat na L lidových písní; petici* podnikla L na zloděje, «pr lojel! idtimymUti^winá koupi op
3. výsledek Jovu, úlovek, kořist vrátit se s bohatým lovem subrtnou mřiep. pí»n nPr irfemr Kiimny™:oimirKůittn Haimóvnáhodou
5l»vnJ: ipuovnt ítiltny
lov
-u m
1. loveni nife a ryb lov koroptvi, lov na zajíce, Hík.a vyíla na lov, 1 úlovek ftvnol keřut iswnoi mít bohatý lov,
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 10 / 19
Storing
• XML database
o storing XML documents directly
• searching - XPath, XQuery o e.g. eXist, BaseX, Sedna
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 11 / 19
TEI
• Text Encoding Initiative, http: //www. tei-c. org/
• TEI Guidelines (current version 5, published 2007)
• XML format for semantic description of text documents
• wide range of markup tags
• TEI Lite - smaller version, "90 % needs of 90 % of users "
9 novels, poems, theatre plays, technical reference, dictionaries, corpora, alignment, text revisions, musical notation...
o tools - XSL transformations to I5TeX, docx, epub, HTML
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 12 / 19
LMF
Lexical Markup Framework,
http://www.lexicalmarkupframework.org/
ISO-24613:2008
common model for lexical resources
emphasis on machine processing and extensibility
UML diagram for the lexicon
core with basic information + extensions for various areas (morphology, syntax, semantics...)
: Global Information : Lexical Resource
languageCoding = " ISO 639-3 '
1
Lexicon
language -'eng"
; Lemma I- : Lexical Bitry
brüten Form = "clergyman* 1 pariQfSpeetri = "common Noun"
: Word Form : Word Form
wrltlsnForm = "clergyman" gram rnalicaiN um ber = 'singular* wrifleriForm ='clergymen" g ram m alicalNum ber = "plural"
Karel Pala, Adam Rambousek
PA153 N LP
Computational Lexicog
Dictionary Writing Systems
• software application for dictionary creation (usually full process) o connected to other resources (corpora, analyzers...)
• often custom developed
• commercial (IDM DPS, iLex, TLex, ABBYY Lingvo Content)
• DEB (Dictionary Editor and Browser)
► platform to build dictionary applications
► client-server, core libraries, specialized modules
► DEBDict, DEBVisDic, Internetová jazyková příručka, DEBWrite
► http://deb.fi.muni.cz
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 14 / 19
[Mew Owimttil Obj«t Mt-íd] IsliHMineliic - [t\Dirtionfliydŕ
sans
sandiuaire (*> sandále ("] sarvJiMctif)
sana> (') sangler ("j sang-méré (*} sangaueC) sani
sans(*)
■sarrs-MEur (") sarts-jaie ("J
sam* (■> ■mílii
saper|l]C) ■í
B LMHnw»na Lwm3Si0ft=B3n&MQd;liNuml>Bf = l.PartDISpe«3i=p«p .-. 1 AutoNumber^l
UTE: TE=**trwut
Example Fsüir:; c=C'e3l Ccfl quemd tu pem da j~ Example Examples* On petri Taire sans-tiavaille - CornttnaliDn: Lemmaitpri=sarrsces8a,EtymDlc TE: TE=«dl&S£ TE: TE=w Ute te^wttotrtadouM
-; ComTMrialKjn- lemma5»on=sans (que).6tymöKX ,
i f.
tot«
5ÜdÖ PBPCOlrt
■
P*tOŕSpí*ď
prap.
LUAJUUJU£JL£JULLdJUJLJ^^ H sa nj prep.
1 witboi/1 ► C est bet quand tu peux danser sans musique. It's good when yoi: cm dance without muss, (EV) - "QnptuxfetrQiQW n-owtffor it dsmancht. We cut do ft without worting on Sunday. (SLh AnS-i) ■ »ra cflssu *ndi*ss. cts*fiitt <0sSi> ■ **ns connaisMneo ^conscious
■. l ■-sans doule no tfEub! wlhout a ;c b\ ■ inns (que) a
intitis ■ £: ah nrt/ifafr /* man, bltn s£r. On aitrafiJamais laissi it mart sar\; que que'qu un soit la. And we waked 'Jt\e body, of course. We uiou&' vt never left ihe body unless- someone was there (TB) b ^(hout ■ 'T'ovrvs pas i>enn dans lc sails sans ti ttjbvt tkhors. You wouldn't hav* fought a die dance hall without run throwing you out OLA, An94} *LA T6. Anji. 0*84? ■ v* s*(\s dire n goes without saying <0aSi> [Adnwi]
Tu ts rten qtt'un sam-cnttr. You're
Eans.tŕeur [Sťikůrři jí. 1 hnftlatt. vv«i. pities s p*fsofi ■ rttüuLg bvi a cruel man (SB)
[Admm|
sans.joiff (sdjwůj r. m 1 great blue her««
<í.*e:Lv66. R*31>(*dňwft]
Sanu Claus Isftakbz. íŕtekl^il ň.prtp. 1 Sari j Claus
|Adnnii] MfftlA (íůle| rt/.
1 h«aih ■ J'atpaspu m'wipfchtr dewarchtr á luS. Jsáis, "Iiy a uns quiiiionj 'atxwŘlt it dmawttr. Quel c 'ta tu fan pour at mwí? " // dít, "Jt vos au baipretftt sous its sů-irs." I
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 15 / 19
Lexical database
• detailed structured database of language
► (recently) usage examples from corpus
► grammar
► valences, patterns
► language style, usage, region...
► word relations
• foundation for dictionaries and research
• PraLeD (Pražská Lexikální Databáze)
• DANTE (Database of ANalysed Texts of English)
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 16 / 19
Dictionary creation
• dictionary writing is expensive, laborious and time-consuming, competition
• B. T. Sue Atkins, Michael Rundell: The Oxford Guide to Practical Lexicography
Marketing Dept
Editorial Dept
user pro flies
extent /contents styles & sampte entriss •-'i IT Dept
i
?
develop dictionary
Design Dept
print deiign
Marketing Dept
1
Karel Pala, Adam Rambousek
PA153 N LP
Software Houie
Computational Lexicography 17 /
Dictionary content
9 macrostructure - entry list (+preface, appendices...) o heslo1 = lemma, entry term, heslové slovo, headword
► noun singular, verb infinitive
► word parts, collocations
• heslo2 = heslová stať, entry
a microstructure - structure of one entry in the dictionary
► checked by editing software
► easier orientation for the reader
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 18 / 19
Electronic dictionaries
• more information (CD, DVD, web)
► presentation space
• multimedia, searching, navigation, updates
• longer descriptions, links to further resources
• display information based on user profile
o connection with corpora - ordnet.dk, DWDS.de...
9 combining resources, downloading data - Wordnik.com
• user-created content (90-9-1) - Wiktionary, slovnik.zcu.cz... o Macmillan - switch to digital only
o 0ED3 - 2000 to 2037, periodical updates o shift from products to services
Karel Pala, Adam Rambousek
PA153 N LP
Computational Lexicography 19 / 19