PA153 Natural Language Processing
08 - Lexicographic tools and computational lexicography
Karel Pala, Adam Rambousek
Centrum ZPJ, Fl MU, Brno
December 16, 2020
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 1/19
Q Lexicography
• Introduction 9 History
• Dictionaries and computers
Q Computational Lexicography
• Data representation
• TEI
• LMF
• Dictionary Writing Systems
Q Dictionary creation
• Lexical database
• Dictionary
Karel Pala, Adam Rambousek
PA153 N LP
Lexicography
o PLIN035 Computational Lexicography
9 subfield of lexicology
• lexicography, lexikografie
► the activity or occupation of compiling dictionaries (Oxford d.)
► the editing or making of a dictionary (Merriam-Webster d.)
► the job of writing a dictionary (Macmillan d.)
• practical lexicography
• theoretical lexicography - analysis and description of the lexicon, theory of dictionary components, user groups, evaluation
• Slovník národního jazyka náleží mezi první potrebnosti vzdelaného
člověka.
Computational Lexicography
History
o Ebla (Syria) clay tablets, cca 2500-2250 BC
► Sumerian - Ebla language
• The Oxford English Dictionary (A New English Dictionary)
► 1857, Philological Society, R. C. Trench, criticizing dictionary
► 1879, James A. H. Murray appointed chief editor
► 1882-1928, published in 12 volumes, 15 487 pages, 240 000 entries
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 4/19
History
• Kancelář Slovníku jazyka českého, 1911
► volunteers gathering supporting materials
► excerpts from novels, poems, technical books, journals
► Příruční slovník jazyka českého, 1935-1957
► 10 824 pages, 250 000 entries
► quotes by "unwanted authors"censored (Karel Čapek = Lid.nov.)
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 5/19
Future?
Akademický slovník současné češtiny
► 2005-2010, lexical database (Praled)
► 2012-2016, applied research
► planned 120-150 thousands
► finished A (2700), B (3500), C+Č (3600), as of December 2020
► mainly electronic (web, mobile)
► slovnikcestiny.cz
The Oxford English Dictionary 3rd Edition
► 2000-2037?, budget £34M
► "Every word in the Dictionary is being reviewed"
► periodical updates in batches, 4x/year
QED3 Revision Progress
300,000
200,000 150,000 100,000
sn.ooo
2005
UNREV
2010 KtV INEW
Karel Pala, Adam Rambousek
PA153 N LP
Computational Lexicography
Dictionaries and computers
• 1960s - computers are used, lexicographers writing on paper, operators typing into database, Brown Corpus
• 1978, Longman Dictionary of Contemporary English
► 1st with limited definition dictionary, checked automatically
► special coding for NLP research
• 1980, COBUILD, University of Birmingham + Collins
► contemporary corpus (Bank of English)
► 1987, Collins COBUILD English Language Dictionary
► 1st dictionary based on corpus data
► new definition style - full sentence
► If a person, animal, or other living thing is killed, something or someone causes them to die.
9 1990s - development of specialised dictionary writing systems
• 1987, Text Encoding Initiative
Karel Pala, Adam Rambousek PA153 NLP Computational Lexicography 7/19
XML
9 PB138 Modern Markup Languages
• extensible Markup Language - markup (meta)language
• rules for properly formatted document - easy machine processing and information exchange
• actual markup specified by the user (standards, custom) 9 elements content
9 without content may be shortened to 9 attributes
Computational Lexicography
8/19
Structure and content description
• DTD (Document Type Definition)
► list of elements and attributes, and their relations
► no content checking
►
►
• XML Schema (XSD, XML Schema Definition)
► description of XML document structure and content, schema itself is XML document
► elements, attributes, structure
► possibility to define custom content types (e.g. postal address)
► content checking (e.g. number range, regular expressions, allowed values)
Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 9/19
Display
• XSLT - extensible Stylesheet Language (Transformations)
• converting XML to another format
► other XML markup, plain text, HTML, LaTeX, PDF
• small templates for parts of XML document, recursive processing of the document
• (functional programming language)
■
SSJC SLflynJ: ipLiovnthaj-icyki ŕíikthu
lov
-Um i 6 j -u)
1. stiháni t? zmocňováni se 2\iře (nejč odstřelem), chytáni ryb L jelenů, dlvokýíh kachen, velryb; I. lososů, I. perel; doba lovy, uspořádat L na medvídy; vyjet na 1.; právo lovu, 1. odstřelem, chytáním, lapáním, I leíiu, pobii, vodní, hromadný L hen lisV.a vyšla na 1.; lovu :dar' (lo-ncký poidrav)
2. tipr chytáni, shániniŕeŕioko/rv, vůbec získávám, přt kterém te uplatni obratnost a náhaáa I íiacního hmyzu, sbiratelí se vydal na 1. lidových písní; pobcle podnlHa L nazlodfje, «pr to Je L! iraimýrtälti^oéná koupí ep
D. \ XSlčdčk !0\ U. ÚlOVek, kořlSt Vľáít Sť S bohatým ]OTťm r uJortnou JT-ír; pi
SanlS (■>
saper|1](") saper|2]
< Í_iO
sans-eseur (■)
sani
■r
B L*nwasaní L^iri?Siůn^ns.Mo*fi»o:=20Q9,0ř2320
i— Prongncisbc*: 1ejrt sď
3 POSŮroup: íi*>NurintBf=1.PartĎIS()e«íi=p«&
Ö-Sense: 1 AutoNwnber=l UTE: 1EMHUI
Exgmplfr f f=C'est Cí« (JncavJ lu p««( 43 j-Gxsmple 6xample=*0npe« ■ s*ns íonnaíssanct unconscious ■ sans dome no doubt, v^lhcut a doubí ■ sans (que) a ujHíss . Ei on velUait ft wc-í-j, bltn sůr. On auratiJamais latssé It mors sans qui quiiau un soil lá. And we waked !he body, of course. We would' ve never leň ihe body unless- someone was there. fTB) & without. 'T'auras pas bams dans la salle sans ti lefoui dehors. Yoo wouldn't have fought xti the dance nail without hun throwing you aut. (LA, An94> ^LA IE. An34. Oa84"- ■ 5* v* saru dire ií goes without saying *0aS4> lAdnwi]
■
Tu ts run qtt'un sans-cttur. You're
sans.tftur [sdlSoo*ri fl. 1 heartless: crviH. ptit^ss person . rttihavt bui a truel man (SB) [Admml
sa nSijois- fsd^wa] rr. rO. 1 ^jfeat blue hewi
{*dnwA]
Sanla Claus |«dlak]iz. SE(£kbz| rt.prt^. 1 Sanla Claus,
AC. EV. 16. L^S. Ph36> |Adnvm]
unto [sdt«| rt.if
1 heanh ■ ^'ef paj ^tf «r 'wnpicher de marcher a hi. Je ths. "li y a tine question J 'atmerafs ft demander. Qttci c esj tu/ais pour ta same? " II dit, "Jt vat au balprocitt :ous its satrs." I couldn't help bin wale ever M him. I sad, "There's 1 question I'd like to ask you What do you do toi >icw heaWi^'" He sajd, "I go to !he dance almost ever>- night.'" (ch: La tieige sur ia couierture) ■ A vou-g sanla to your health