The study of Illocution and Information Patterning: the C-ORAL-BRASIL Corpus (Tommaso Raso – UFMG, CNPq, FAPEMIG)
This talk has the two following objectives:
1.      To present the C-ORAL-BRASIL Corpus;
2.      To introduce the theory that inspired its format and which allows for the study of illocutions and information structure.
The C-ORAL-BRASIL  (Raso-Mello 2009 and 2010) is a Brazilian Portuguese (BP) spontaneous speech corpus compiled following the same parameters adopted by the C-ORAL-ROM corpora (Cresti-Moneglia 2005). It encompasses at least 300,000 words and 200 texts, divides into an informal part (already completed) and another half made up of formal and phone interactions. The informal half has 80% of familiar/private texts and 20% of public texts. Each domain is equally made up by dialogic, monologic and conversational texts. Each texts on average has 1,500 words. The major objective to be achieved by the corpus is to represent the diaphasic variation (searching for the greatest situational variation) in the Mineiro diatopy, however, the diastratic variation is also represented.
The corpus is segmented into utterances and tonal units. An utterance is the smallest pragmatically interpretable unit, regardless of its syntactic composition, and corresponds to an action. Speech, therefore, is segmented into utterances, recognized for being bounded by a prosodic break perceived as terminal. An utterance may be accomplished through a tonal unit or can be internally segmented into more than one unit through prosodic breaks perceived as non-terminal. The segmentation was validated after the first revision and before the second revision, through a Kappa test by three segmenters. The test reached a 0.86 score and achieves 0.91 for terminal breaks (Raso-Mittmann 2009).
The corpus transcription follows the CHILDES-CLAN format implemented for prosodic annotation (Moneglia-Cresti 1997). The transcription has an orthographic basis to which are added several modifications in order to account for speech phenomena in the process of lexicalization and grammaticalization (Mello-Raso 2009). Among these are: lack of plural marking, verbal paradigm reduction, subject personal pronoun cliticization, serialized verb forms, apheresis, apocope, articulated prepositions, contractions, and many others. The registering of these phenomena through transcription allows for quantitative studies about the system restructuring. Besides that, the segmental transcription is currently being validated. We have been using the parser PALAVRAS (Bick 2000) which has undergone several rule additions and adaptations in order to be able to account for the analytical domain under study (the utterance and tone units and not sentences and phrases) and the non-standard orthographic representations. A first trial indicated a less than 5% inaccuracy effect.
The corpus architecture and segmentation allow for the study of BP according to the Information Patterning Theory (Cresti 2000). The theory is based on the correspondence between an utterance and an action unit and, in principle, that between a tone unit and an information unit. To each utterance, therefore, there corresponds an illocution carried by the only mandatory unit, the Comment (COM). Thus utterances may be made up of only one information unit, COM, or by COM and one or more tone-information units. In the first scenario there would be a simple utterance and on the second a complex utterance.
The units which have been identified throughout decades of spontaneous speech corpora studies have diverse characteristics and are defined based on functional, prosodic and distributional principles. The first unit group constitutes the utterance text: COM carries the illocutionary force, it is autonomously interpretable therefore attributing autonomy to the utterance; it has a functional focus and its prosodic profile varies depending on the carried illocution, its distribution is free. The Topic (TOP) works as semantic delimitation for the COM, it incepts the scope of illocutionary force application; its prosodic profile predicts a rightward focus and its distribution is always leftward to the COM. The Comment Appendix (APC) and the Topic Appendix (APT) work as integration units for their respective head units; their profile does not carry focus and their distribution is always rightward to their head unit. 
The second unit grouping is not part of the utterance text itself but intervenes upon it: the Parenthetical (PAR) comments on, frequently partly or entirely modalizing the utterance text; it is leveled prosodically with a lower F0 and higher speech velocity; distributionally, except for utterance initial position, it can be placed in any position including internally to other textual units. The Locutive Introducer (INT) signals that whatever follows it makes up a domain, suspending the utterance deixis; it frequently introduces metaillocutions such as reported speech and exemplifications; its prosodic profile varies but it is usually a descending one, the elocution rate is very high and there is a strong F0 contrast with what follows it; distributionally it precedes the units which it introduces.
The third group is made up by dialogic supports. These units are directed towards the interlocutor and manage the communication process marking the opening of the channel (Phatic – PHA), the beginning of an utterance with some contrast with its predecessor (Incipitary – INP), the continuity of what precedes it (Discursive Connector – DCT), the interlocutor and social cohesion with him (Allocutive – ALL), an emotional support to the speech act (Expressive – EXP), the intention to induce the interlocutor to do or to give something up (Conative – CNT), Time Taking (TMT). Each one of these units has a dedicated prosodic profile and an obligatory or preferential position.
In some cases the biunivocal correspondence between COM and utterance is lost. This happens with some strongly conventionalized rhetoric patterns in which the sum of two utterances is performed and interpreted holistically, i.e., separated by a non-terminal break. These are Multiple Comments (CMM) and they occur in listings, comparisons, supports, necessary relations and other patterns. However, biunivocity is also lost in a different circumstance when interactive speech actionality is reduced and the textual semantic pole becomes more evident, typically in monologic and formal speech. In this case, ampler entities separated by terminal breaks might be formed – these are Stanzas (Cresti 2009) – in which several bound Comments (COB) succeed one another, sometimes establishing subpatterns with other units. COBs are homogenous and weakened in their illocution.
BICK, E. The Parsing System Palavras – Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Aarhus: Aarhus University Press, 2000.
CRESTI, E. Corpus di Italiano parlato. Firenze: Accademia della Crusca. Vol 1. 2000.
CRESTI, E. La Stanza: un´unità di costruzione testuale del parlato. Atti del X Congresso della Società di Internazionale di Filologia Italiana, 2009, p. 713-732.
CRESTI, E.; MONEGLIA, M. (eds) C-Oral-Rom: Integrated Reference Corpora For Spoken Romance Languages. Amsterdam: John Benjamins. 2005.
MELLO, H.; RASO, T. Para a transcrição da fala espontânea: o caso do C-ORAL-BRASIL. Revista Portuguesa de Humanidades. 2009, pp. 301=325.
MONEGLIA, M.; CRESTI, E. L´intonazione e i criteri di trascrizione Del parlato adulto e infantile. In: Bortolini, U. – Pizzuto, E. Il Progetto CHILDES Italia. Pisa: Del Cerro, 1997, pp. 57-90.
RASO, T.; MELLO, H., Parâmetros de compilação de um corpus oral: o caso do C-ORAL-BRASIL, Em: Veredas, 2009, p. 20-35.
RASO, T.; MELLO, H. The C-ORAL-BRASIL corpus. In: Moneglia, M.-Panunzi, A., (orgs.) Bootstrapping Information from Corpora in a Cross Linguistic Perspective. Firenze University Press, 2010.
RASO, T.; MITTMANN, M. Validação estatística dos critérios de segmentação da fala espontânea no corpus C-ORAL-BRASIL. Revista de Estudos Linguísticos, 2009.