FF:PLIN057 Automatic processing of text - Course Information
PLIN057 Automatic processing of text
Faculty of ArtsAutumn 2022
- Extent and Intensity
- 0/2/0. 4 credit(s). Type of Completion: z (credit).
- Teacher(s)
- prof. Radek Čech, Ph.D. (lecturer)
- Guaranteed by
- prof. Radek Čech, Ph.D.
Department of Czech Language – Faculty of Arts
Contact Person: Jaroslava Vybíralová
Supplier department: Department of Czech Language – Faculty of Arts - Timetable
- Wed 14:00–15:40 G13
- Prerequisites
- None.
- Course Enrolment Limitations
- The course is also offered to the students of the fields other than those the course is directly associated with.
The capacity limit for the course is 20 student(s).
Current registration and enrolment status: enrolled: 2/20, only registered: 0/20, only registered with preference (fields directly associated with the programme): 0/20 - fields of study / plans the course is directly associated with
- there are 13 fields of study the course is directly associated with, display
- Course objectives
- In this course, students will learn the basic skills necessary for automatic text processing in Python. They will learn how to analyze text, extract the necessary information (especially frequency characteristics), test hypotheses, and process this information appropriately according to statistical research standards.
The course is primarily intended for students who have no experience with this topic. - Learning outcomes
- After the course the student will be familiar with the problems of text processing and will be able to:
- search texts
- create dictionaries
- analyse them with regard to their lexical diversity (word richness)
- use regular expressions
- visualise text characteristics
- statistically test differences between texts.
- Syllabus
- Basics in Python - variable types, basic functions.
- Tokenization, creating a dictionary, frequency list, relative frequencies, ordered dictionaries, stop list, creating a frequency list of autosemantics.
- Searching for words in text, creating concordance lines.
- Regular expressions.
- Lexical diversity: TTR, TTR from segment/segments of text, MATTR, hapax legomenon proportions, entropy.
- Word length: mean, median, mode, SD, length distribution, visualization (barplot, boxplot)
- Statistical testing.
- UDPipe - automatic data annotation
- Searching by 2 or more attributes.
- Comparison of POS proportions, syntactic functions - chi-squared test.
- Sentence length, clause length. Measurement of readablity.
- Cluster analysis
- Literature
- recommended literature
- Manuálové stránky jednotlivých utilit.
- BRANDEJS, Michal. UNIX - Linux : praktický průvodce. 1. vyd. Praha: Grada, 1996, 340 s. ISBN 8071691704. info
- Teaching methods
- teaching, practicing, discussion
- Assessment methods
- The credit will be awarded for attendance, active participation and passing the test.
- Language of instruction
- Czech
- Further Comments
- Study Materials
- Enrolment Statistics (Autumn 2022, recent)
- Permalink: https://is.muni.cz/course/phil/autumn2022/PLIN057