IB047 Introduction to Corpus Linguistics and Computer Lexicography

Faculty of Informatics
Spring 2008
Extent and Intensity
2/0. 2 credit(s) (plus extra credits for completion). Recommended Type of Completion: zk (examination). Other types of completion: k (colloquium), z (credit).
Teacher(s)
prof. PhDr. Karel Pala, CSc. (lecturer)
doc. Mgr. Pavel Rychlý, Ph.D. (lecturer)
Guaranteed by
prof. Ing. Václav Přenosil, CSc.
Department of Machine Learning and Data Processing – Faculty of Informatics
Contact Person: prof. PhDr. Karel Pala, CSc.
Timetable
Wed 8:00–9:50 B410
Course Enrolment Limitations
The course is also offered to the students of the fields other than those the course is directly associated with.
fields of study / plans the course is directly associated with
Course objectives
A basic introduction to the field of corpus linguistics and computational lexicography. Students will study types of corpora, corpus building and usage, especially in the sake of dictionaries building.
Syllabus
  • Information technologies and language (text) corpora. Beginning of corpus linguistics, purpose of corpora.
  • Corpus data, corpus types and their standardization, SGML, XML, TEI, CES. Annotated corpora, tagging on various levels: structural tagging, grammatical tagging -- POS, lemmata, word forms. Syntactic tagging, treebanks, skeleton analysis. Parallel corpora, alignment programes. Tools for automatic and semi-automatic annotation, disambiguation.
  • Building corpora, maintainance. Corpus tools: corpus manager. Concordance programmes. Queries, regular expressions and their use. Statistical programmes, absolute and relative frequencies, MI and T-score. Work with corpus attributes and tags.
  • Working with corpora -- CNC, SUSANNE, Prague Dependency Treebank Words, constructions, collocations.
  • Computational lexicography, lexicology.
  • Descripton of meanings (semantic features).
  • Types of computer dictionaries. Lexicography standards.
  • Data for dictionary building -- corpora.
  • Lexicography Software tools. Lemmatizers.
Literature
  • SAMPSON, Geoffrey. English for the computer : the SUSANNE corpus and analytic scheme. Oxford: Clarendon Press, 1995, ix, 499. ISBN 0198240236. info
  • RYCHLÝ, Pavel. Korpusové manažery a jejich efektivní implementace. Brno, 2000, xiv, 128. info
  • Computational lexicography for natural language processing. Edited by Ted Briscoe - Bran Boguraev. London: Longman, 1989, xiv, 310 p. ISBN 0-470-21187-3. info
  • SAMPSON, Geoffrey. Empirical linguistics. London: Continuum, 2001, viii, 226. ISBN 0-8264-4883-6. info
  • Corpus processing for lexical acquisition. Edited by Bran Boguraev - J. (James) Pustejovsky. Cambridge: Bradford Book, 1996, xi, 245 s. ISBN 0-262-02392-X. info
Language of instruction
Czech
Further Comments
The course is taught annually.
The course is also listed under the following terms Spring 2003, Spring 2004, Spring 2005, Spring 2006, Spring 2007, Spring 2009, Spring 2010, Spring 2011, Spring 2012, Spring 2013, Spring 2014, Spring 2015, Spring 2016, Spring 2017, Spring 2018, Spring 2019, Spring 2020, Spring 2021, Spring 2022, Spring 2023, Spring 2024, Spring 2025.
  • Enrolment Statistics (Spring 2008, recent)
  • Permalink: https://is.muni.cz/course/fi/spring2008/IB047