Rap corpus

The RapCor corpus is one of the smaller and topic-specific corpora for French language, which is being developed at the Institute of Romance Languages and Literatures of the Faculty of Arts of Masaryk University in Brno under the leadership of doc. PhDr. Alena Policka, Ph.D.

It is a corpus of spoken French in rap songs, created for socio-lexical research purposes. The specific character of rap lyrics allows for a broader knowledge of substandard French, in particular the dynamics of the development of generational and ethno-socio-geographical word formation and of neology in relation to lexicography. The corpus can also serve those interested in modern poetics or sociolinguistics (especially in relation to multiethnic suburbs).

Current status

last update
ready for corpus
number of scans in the repository
total work in progress

Introduction: what is a corpus?

The word corpus refers to a collection of texts under study, but with the development of computer capacity, the term corpus is increasingly used to mean an electronic corpus, i.e. a collection of computer-stored and processed texts (or transcripts of audio recordings) used for linguistic research. Thanks to the ease with which the results can be retrieved and evaluated, it is possible to obtain much more reliable information and statistics than was previously the case, i.e. in the era of card catalogues.

Electronic language corpora began to emerge together with the development of computer technology in the last decades of the 20th century. Today, there are a number of small and large corpora for most of the world's languages, the largest of which describe the entire national language and reach the extent of several hundred million word forms. For example, for the Czech language, the Czech National Corpus Institute at the Faculty of Arts of the Charles University in Prague is actively creating the Czech National Corpus (ČNK) , made up of several subcorpora of written and spoken texts. For the French language, the largest corpus is Frantext, a corpus of mainly literary texts, conceived at the University of Nancy. There are also a number of smaller corpora, of which we only cite spoken French corpora, e.g. Eslo or Clapi, i.a.

RapCor corpus

RapCor has been created since 2009 in the framework of the postdoctoral project of the Grant Agency of the Czech Republic - Expressivity in Youth Slang on the Background of the Search for Self and Group Identity (GP405/09/P307). The collection and primary editing of the source material is carried out with the cooperation of students of French, who obtain the lyrics of selected French rap songs either from transcriptions of fans available on the Internet or (currently, preferably) directly from original lyrics on CD covers, if they are included on the covers.

The texts are then further checked according to the audio recording and any discrepancies are corrected so that the result is a faithful transcription of the rapped text. Using the TreeTagger program, the texts are automatically segmented into individual words, which are lemmatized (converted to a basic form, the so-called lemma) and supplemented with tags for grammatical categories. Due to the high frequency of neologisms and substandard expressions, the result is refined manually, and the automatic assignment of grammatical categories is also checked. Substandard expressions are marked on the one hand from a lexicographic point of view according to a symptom in the reference dictionary (e.g. colloquial or vulgar words), on the other hand from a word-formative point of view.

Le Petit Robert électronique is used as a reference dictionary from which substandard tags are also taken. The same dictionary also serves as a difference dictionary for determining neologisms and omitted lexemes. Le Petit Larousse électronique serves as a reference difference dictionary for proper names.

An annotated table of morphosyntactic tags, lemmas and other information about the course of the song and the performer of the song or its part is finally available thanks to the technical help M.Sc. Marko Stehlík from the Computing Center of the Faculty of Informatics of the Masaryk University (CVT FI MU) converted into an html file, which is provided with metadata from the associated database. Current and older files of all processed texts can then be downloaded from our Repository and imported into the lexicometric program TXM.

With the latest version, it is also possible to work in the Sketch engine client application (corpus manager and software for text analysis, licensed resource, accessible free of charge to FF MU students). Its co-author is doc. M.Sc. Pavel Rychlý, Ph.D. from the Department of Machine Learning and Data Processing FI MU. The oldest version of the corpus used its older products (corpus manager Manatee and client application Bonito).