PLIN057 Automatic processing of text

Faculty of Arts
Autumn 2023
Extent and Intensity
0/2/0. 4 credit(s). Type of Completion: z (credit).
Teacher(s)
prof. Radek Čech, Ph.D. (lecturer)
RNDr. Zuzana Nevěřilová, Ph.D. (lecturer)
Mgr. Hana Žižková, Ph.D. (lecturer)
Guaranteed by
prof. Radek Čech, Ph.D.
Department of Czech Language – Faculty of Arts
Contact Person: Jaroslava Vybíralová
Supplier department: Department of Czech Language – Faculty of Arts
Timetable
Wed 10:00–11:40 G13, except Wed 15. 11.
Prerequisites
None.
Course Enrolment Limitations
The course is also offered to the students of the fields other than those the course is directly associated with.
The capacity limit for the course is 20 student(s).
Current registration and enrolment status: enrolled: 10/20, only registered: 0/20, only registered with preference (fields directly associated with the programme): 0/20
fields of study / plans the course is directly associated with
Course objectives
In this course, students will learn the basic skills necessary for automatic text processing in Python. They will learn how to analyze text, extract the necessary information (especially frequency characteristics), test hypotheses, and process this information appropriately according to statistical research standards.
The course is primarily intended for students who have no experience with this topic.
Learning outcomes
After the course the student will be familiar with the problems of text processing and will be able to:
  • search texts
  • create dictionaries
  • analyse them with regard to their lexical diversity (word richness)
  • use regular expressions
  • visualise text characteristics
  • statistically test differences between texts.
  • Syllabus
    • Basics in Python - variable types, basic functions.
    • Tokenization, creating a dictionary, frequency list, relative frequencies, ordered dictionaries, stop list, creating a frequency list of autosemantics.
    • Searching for words in text, creating concordance lines.
    • Regular expressions.
    • Lexical diversity: TTR, TTR from segment/segments of text, MATTR, hapax legomenon proportions, entropy.
    • Word length: mean, median, mode, SD, length distribution, visualization (barplot, boxplot)
    • Statistical testing.
    • UDPipe - automatic data annotation
    • Searching by 2 or more attributes.
    • Comparison of POS proportions, syntactic functions - chi-squared test.
    • Sentence length, clause length. Measurement of readablity.
    • Cluster analysis
    Literature
      recommended literature
    • Manuálové stránky jednotlivých utilit.
    • BRANDEJS, Michal. UNIX - Linux : praktický průvodce. 1. vyd. Praha: Grada, 1996, 340 s. ISBN 8071691704. info
    Teaching methods
    teaching, practicing, discussion
    Assessment methods
    The credit will be awarded for attendance, active participation and passing the test.
    Language of instruction
    Czech
    Further comments (probably available only in Czech)
    Study Materials
    Information on course enrolment limitations: Předmět není vhodný pro studenty prvního ročníku.
    The course is also listed under the following terms Spring 2018, Spring 2019, Autumn 2022, Spring 2025.
    • Enrolment Statistics (recent)
    • Permalink: https://is.muni.cz/course/phil/autumn2023/PLIN057