The C-ORAL-BRASIL Project has as its purpose the study of Brazilian Portuguese spontaneous speech through the compilation of a corpus comparable to those envisioned by the C-ORAL-ROM Project.
The Project is coordinated by Tommaso Raso and Heliana Mello from the Universidade Federal de Minas Gerais, Brazil, and has received funding from Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG), Conselho Nacional de Desenvolvimento Tecnológico (CNPq), Universidade Federal de Minas Gerais (UFMG) and Banco Santander.
Corpus specifications summary
The C-ORAL-BRASIL aims at 200 texts and 300,000 words, divided into a formal half (under development) and an informal half (concluded).
The informal half is made up of two domains: private/family (80%) and public (20%). Each domain is textually divided into monologues (1/3), dialogues (1/3) and conversations (1/3).
The corpus architecture primary goal is to represent diaphasic variation in Brazilian speech, with especial attention paid to the Mineiro diatopy (in particular the Belo Horizonte, MG, metropolitan area). Thus, recordings try to represent the greatest situational variation possible. Diastratic variation representation is the secondary goal of the project.
Texts (average of 1,500 words) re segmented into utterances and tonal units to allow the study of illocutions and information structure with a view to the Language into Act Theory*. This theory was developed by Emanuela Cresti, director of the LABLITA laboratory at Florence University (Italy).
The major compilation phases are:
Recordings with high accuracy wireless equipment;
Transcriptions undertaken by expert transcribers according to quasi-orthographic criteria (aiming at preserving ongoing speech phenomena representing grammaticalization and lexicalization) and the segmentation criteria mentioned above;
Second revision during text-to-speech alignment through WinPitch Pro software by Philippe Martin;
Morphosyntactic tagging through PALAVRAS by Eckhard Bick. This parser was especially trained to process this corpus, which was preprocessed through the computational environment R;
Minicorpus (20 texts and 30,000 words) informational tagging, based on the Language Into Act Theory*.
CRESTI, E. Corpus di Italiano parlato. v. 1. Firenze: Accademia della Crusca, 2000.
CRESTI, E.; MONEGLIA, M. Informational patterning theory and the corpus-based description of spoken language: The compositionality issue in the topic-comment pattern. In: M. Moneglia; A. Panunzi (Eds.); Bootstrapping Information from Corpora in a Cross-Linguistic Perspective. Firenze: FUP, 2010. p.13-45.