Project description

The C-ORAL-BRASIL Project has as its purpose the study of Brazilian Portuguese spontaneous speech through the compilation of a corpus comparable to those envisioned by the C-ORAL-ROM Project.

The C-ORAL-BRASIL project, coordinated by Prof. Tommaso Raso and Prof. Heliana Mello (Federal University of Minas Gerais State), is dedicated to the study of Brazilian Portuguese spontaneous speech through the compilation of spoken corpora comparable to those of the C-ORAL-ROM project.

Funding Institutions:

FAPEMIG Fundação de Amparo à Pesquisa do Estado de Minas Gerais (Minais Gerais State Research Foundation);

CNPq Conselho Nacional de Desenvolvimento Científico e Tecnológico (National Council for Scientific and Technological Development);

UFMG Universidade Federal de Minas Gerais (Federal University of Minas Gerais State);

Banco Santander.

Corpus specifications

The C-ORAL-BRASIL aimed at approximately 200 texts and 300,000 words, divided into a Formal (tbp) and an Informal (published in 2012) sections.

The Informal section is organized into Private/Family (80%) and Public (20%) contexts, each divided into monological (1/3) and dialogical (dialogues and conversations: 2/3) interaction types.

The architecture of the corpus has as primary goal to represent diaphasic variation in Brazilian spontaneous speech, focusing on the Minas Gerais State diatopy (mainly from its capital city, Belo Horizonte, and its metropolitan area). The recordings intended to document the greatest possible variety of communicative situations. Diastratic variation was aimed as a second goal, as well.

Recordings’ transcripts (average of 1,500 words) are segmented into utterances and tone units to allow the study of illocutions and information structure, following the Language into Act Theory* theoretical framework. This theory was developed by Emanuela Cresti, director of the LABLITA laboratory at Florence University (Italy).

The major compilation phases are:

  1. Recordings with high accuracy wireless equipment;

  2. Transcriptions undertaken by expert transcribers according to quasi-orthographic criteria (aiming to preserve ongoing speech phenomena such as grammaticalization and lexicalization) and prosodic segmentation criteria;

  3. Transcriptions' revision;

  4. Second transcriptions’ revision during the text-to-speech alignment process (through WinPitch Pro software by Philippe Martin)

  5. Morpho-syntactic tagging through PALAVRAS by Eckhard Bick. This parser was especially trained to process this corpus, which was preprocessed through the computational environment R;;

  6. Minicorpus (20 texts and 30,000 words) informational tagging, based on the Language Into Act Theory*.


CRESTI, E. Corpus di Italiano parlato. v. 1. Firenze: Accademia della Crusca, 2000. 

CRESTI, E.; MONEGLIA, M. Informational patterning theory and the corpus-based description of spoken language: The compositionality issue in the topic-comment pattern. In: M. Moneglia; A. Panunzi (Eds.); Bootstrapping Information from Corpora in a Cross-Linguistic Perspective. Firenze: FUP, 2010. p.13-45.

MONEGLIA, M.; RASO, T. Notes on the Language into Act Theory. In: T. Raso; H. Mello (Eds.), Spoken corpora and linguistic studies. pp. 468-489. Amsterdam/Philadelphia: John Benjamins. 2014.