FF:PLIN080 Intro. to quant. ling.

PLIN080 Creating data set

Faculty of Arts
Autumn 2023

Extent and Intensity

0/2/0. 4 credit(s). Type of Completion: z (credit).

Teacher(s)

prof. Radek Čech, Ph.D. (lecturer)
Mgr. Helena Medková (lecturer)

Guaranteed by

prof. Radek Čech, Ph.D.
Department of Czech Language – Faculty of Arts
Contact Person: Jaroslava Vybíralová
Supplier department: Department of Czech Language – Faculty of Arts

Timetable

Mon 16:00–17:40 G13, except Mon 13. 11.

Course Enrolment Limitations

The course is also offered to the students of the fields other than those the course is directly associated with.
The capacity limit for the course is 20 student(s).
Current registration and enrolment status: enrolled: 2/20, only registered: 0/20, only registered with preference (fields directly associated with the programme): 0/20

fields of study / plans the course is directly associated with

Computational Linguistics (programme FF, B-PLIN_) (3)

Course objectives

The course suits linguistics and computational linguistics students with basic or zero knowledge of machine learning who want to gain practical skills useful for machine learning projects. Students will acquire knowledge of the fundamental aspects of computational natural language processing, emphasising creating training/test sets for machine learning applications in linguistic research.

Learning outcomes

In the course, students will gain practical experience with data collection using the corpus manager Sketch Engine, creating training/test data sets, and modifying and manipulating data using Python and selected libraries (Pandas, Re, NLTK, Scikit-Learn, Matplotlib etc.) for data cleaning and visualization.

Syllabus

1. Introduction: Assignment overview, introduction to machine learning methods.
2. Data set types: data sets according to learning tasks, research objectives in linguistics, and data set creation.
3. Data preprocessing: data cleaning, duplicate removal, tokenization, lemmatization, morphological analysis, and syntactic analysis (UD Pipe, Majka, Desamb tools).
4. Data annotation: Annotation scheme, inter-annotator agreement measurement.
5. Linguistic data analysis: Data set statistics and visualization in graphs.
6. Machine learning: Supervised and unsupervised learning, training a language model for the classification task, model evaluation, and cross-validation.

Literature

GÉRON, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow : concepts, tools, and techniques to build intelligent systems. Third edition. Beijing: O'Reilly, 2022, xxv, 834. ISBN 9781098125974. info

Teaching methods

Seminar, computer practice (Google Colaboratory tool), independent work, consultation.

Assessment methods

To receive credits, students must deliver two well-annotated data sets of 1,000 sentences each. The evaluation criteria include submitting homework on time and active class participation.

Language of instruction

Czech

Further comments (probably available only in Czech)

Study Materials
The course is taught annually.

Teacher's information

The course is structured to alternate between instruction and independent student work.

The course is also listed under the following terms Autumn 2024, Autumn 2025.

Enrolment Statistics (Autumn 2023, recent)
Permalink: https://is.muni.cz/course/phil/autumn2023/PLIN080

FF:PLIN080 Intro. to quant. ling. - Course Information

PLIN080 Creating data set

Other applications