PLIN080 Fundamentals of language data processing for machine learning in the humanities

Faculty of Arts
Autumn 2024
Extent and Intensity
0/2/0. 4 credit(s). Type of Completion: z (credit).
Teacher(s)
prof. Radek Čech, Ph.D. (lecturer)
Mgr. Helena Medková (lecturer)
Guaranteed by
prof. Radek Čech, Ph.D.
Department of Czech Language – Faculty of Arts
Contact Person: Bc. Silvie Hulewicz, DiS.
Supplier department: Department of Czech Language – Faculty of Arts
Timetable
Mon 16:00–17:40 G13, except Mon 18. 11. to Sun 24. 11.
Prerequisites (in Czech)
FAKULTA(FF) && FORMA(P)
Course Enrolment Limitations
The course is also offered to the students of the fields other than those the course is directly associated with.
The capacity limit for the course is 20 student(s).
Current registration and enrolment status: enrolled: 14/20, only registered: 0/20, only registered with preference (fields directly associated with the programme): 0/20
fields of study / plans the course is directly associated with
Course objectives
The course suits linguistics and computational linguistics students with basic or zero knowledge of machine learning who want to gain practical skills useful for machine learning projects. Students will acquire knowledge of the fundamental aspects of computational natural language processing, emphasising creating training/test sets for machine learning applications in linguistic research.
Learning outcomes
In the course, students will gain practical experience with data collection using the corpus manager Sketch Engine, creating training/test data sets, and modifying and manipulating data using Python and selected libraries (Pandas, Re, NLTK, Scikit-Learn, Matplotlib etc.).
Syllabus
  • 1. Introduction: machine learning methods introduction, initial Google Colab exercise.
  • 2. Data set types: data sets according to learning tasks, research objectives in linguistics, and data set creation.
  • 3. Data preprocessing: data cleaning, duplicate removal, tokenization, lemmatization, morphological analysis, and syntactic analysis (UD Pipe, Majka, Desamb tools).
  • 4. Data annotation: Annotation scheme, inter-annotator agreement measurement.
  • 5. Linguistic data analysis: Data set statistics and visualization in graphs.
  • 6. Supervised machine learning: training a language model for the classification task, model evaluation, and cross-validation.
Literature
  • GÉRON, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow : concepts, tools, and techniques to build intelligent systems. Third edition. Beijing: O'Reilly, 2022, xxv, 834. ISBN 9781098125974. info
Teaching methods
Seminar, computer practice (Google Colaboratory tool), independent work, consultation. The lessons will be online.
Assessment methods
Annotated data set delivery of 500 sentences, continuous homework submission, activity in class.
Language of instruction
Czech
Further comments (probably available only in Czech)
Study Materials
The course is taught annually.
Teacher's information
The course is structured to alternate between instruction and independent student work.
The course is also listed under the following terms Autumn 2023.
  • Enrolment Statistics (recent)
  • Permalink: https://is.muni.cz/course/phil/autumn2024/PLIN080