The study of Illocution and Information
Patterning: the C-ORAL-BRASIL Corpus (Tommaso Raso – UFMG, CNPq, FAPEMIG)
This talk has the two following objectives:
1. To present the C-ORAL-BRASIL Corpus;
2. To introduce the theory that
inspired its format and which allows for the study of illocutions and
The C-ORAL-BRASIL (Raso-Mello 2009 and 2010) is a Brazilian
Portuguese (BP) spontaneous speech corpus compiled following the same
parameters adopted by the C-ORAL-ROM corpora (Cresti-Moneglia 2005). It
encompasses at least 300,000 words and 200 texts, divides into an informal part
(already completed) and another half made up of formal and phone interactions.
The informal half has 80% of familiar/private texts and 20% of public texts.
Each domain is equally made up by dialogic, monologic and conversational texts.
Each texts on average has 1,500 words. The major objective to be achieved by
the corpus is to represent the diaphasic variation (searching for the greatest
situational variation) in the Mineiro diatopy, however, the diastratic
variation is also represented.
The corpus is segmented into utterances and
tonal units. An utterance is the smallest pragmatically interpretable unit,
regardless of its syntactic composition, and corresponds to an action. Speech,
therefore, is segmented into utterances, recognized for being bounded by a
prosodic break perceived as terminal. An utterance may be accomplished through
a tonal unit or can be internally segmented into more than one unit through
prosodic breaks perceived as non-terminal. The segmentation was validated after
the first revision and before the second revision, through a Kappa test by
three segmenters. The test reached a 0.86 score and achieves 0.91 for terminal
breaks (Raso-Mittmann 2009).
The corpus transcription follows the
CHILDES-CLAN format implemented for prosodic annotation (Moneglia-Cresti 1997).
The transcription has an orthographic basis to which are added several
modifications in order to account for speech phenomena in the process of lexicalization
and grammaticalization (Mello-Raso 2009). Among these are: lack of plural
marking, verbal paradigm reduction, subject personal pronoun cliticization,
serialized verb forms, apheresis, apocope, articulated prepositions,
contractions, and many others. The registering of these phenomena through
transcription allows for quantitative studies about the system restructuring.
Besides that, the segmental transcription is currently being validated. We have
been using the parser PALAVRAS (Bick 2000) which has undergone several rule
additions and adaptations in order to be able to account for the analytical
domain under study (the utterance and tone units and not sentences and phrases)
and the non-standard orthographic representations. A first trial indicated a
less than 5% inaccuracy effect.
The corpus architecture and segmentation allow
for the study of BP according to the Information Patterning Theory (Cresti
2000). The theory is based on the correspondence between an utterance and an
action unit and, in principle, that between a tone unit and an information
unit. To each utterance, therefore, there corresponds an illocution carried by
the only mandatory unit, the Comment (COM). Thus utterances may be made up of
only one information unit, COM, or by COM and one or more tone-information
units. In the first scenario there would be a simple utterance and on the
second a complex utterance.
The units which have been identified throughout
decades of spontaneous speech corpora studies have diverse characteristics and
are defined based on functional, prosodic and distributional principles. The
first unit group constitutes the utterance text: COM carries the illocutionary
force, it is autonomously interpretable therefore attributing autonomy to the
utterance; it has a functional focus and its prosodic profile varies depending
on the carried illocution, its distribution is free. The Topic (TOP) works as
semantic delimitation for the COM, it incepts the scope of illocutionary force
application; its prosodic profile predicts a rightward focus and its
distribution is always leftward to the COM. The Comment Appendix (APC) and the
Topic Appendix (APT) work as integration units for their respective head units;
their profile does not carry focus and their distribution is always rightward
to their head unit.
The second unit grouping is not part of the
utterance text itself but intervenes upon it: the Parenthetical (PAR) comments
on, frequently partly or entirely modalizing the utterance text; it is leveled
prosodically with a lower F0 and higher speech velocity; distributionally,
except for utterance initial position, it can be placed in any position
including internally to other textual units. The Locutive Introducer (INT)
signals that whatever follows it makes up a domain, suspending the utterance
deixis; it frequently introduces metaillocutions such as reported speech and
exemplifications; its prosodic profile varies but it is usually a descending
one, the elocution rate is very high and there is a strong F0 contrast with what
follows it; distributionally it precedes the units which it introduces.
The third group is made up by dialogic
supports. These units are directed towards the interlocutor and manage the
communication process marking the opening of the channel (Phatic – PHA), the
beginning of an utterance with some contrast with its predecessor (Incipitary –
INP), the continuity of what precedes it (Discursive Connector – DCT), the
interlocutor and social cohesion with him (Allocutive – ALL), an emotional
support to the speech act (Expressive – EXP), the intention to induce the
interlocutor to do or to give something up (Conative – CNT), Time Taking (TMT).
Each one of these units has a dedicated prosodic profile and an obligatory or
In some cases the biunivocal correspondence
between COM and utterance is lost. This happens with some strongly
conventionalized rhetoric patterns in which the sum of two utterances is
performed and interpreted holistically, i.e., separated by a non-terminal
break. These are Multiple Comments (CMM) and they occur in listings,
comparisons, supports, necessary relations and other patterns. However,
biunivocity is also lost in a different circumstance when interactive speech
actionality is reduced and the textual semantic pole becomes more evident,
typically in monologic and formal speech. In this case, ampler entities
separated by terminal breaks might be formed – these are Stanzas (Cresti 2009)
– in which several bound Comments (COB) succeed one another, sometimes establishing
subpatterns with other units. COBs are homogenous and weakened in their
BICK, E. The Parsing System Palavras – Automatic
Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Aarhus:
Aarhus University Press, 2000.
CRESTI, E. Corpus
di Italiano parlato. Firenze: Accademia della Crusca. Vol 1. 2000.
CRESTI, E. La Stanza: un´unità di
costruzione testuale del parlato. Atti
del X Congresso della Società di Internazionale di Filologia Italiana, 2009,
CRESTI, E.; MONEGLIA, M. (eds) C-Oral-Rom: Integrated Reference Corpora
For Spoken Romance Languages. Amsterdam:
John Benjamins. 2005.
MELLO, H.; RASO, T. Para a transcrição da
fala espontânea: o caso do C-ORAL-BRASIL. Revista Portuguesa de Humanidades.
2009, pp. 301=325.
MONEGLIA, M.; CRESTI,
E. L´intonazione e i criteri di trascrizione Del parlato adulto e infantile.
In: Bortolini, U. – Pizzuto, E. Il
Progetto CHILDES Italia. Pisa: Del Cerro, 1997, pp. 57-90.
RASO, T.; MELLO, H., Parâmetros
de compilação de um corpus oral: o caso do C-ORAL-BRASIL, Em: Veredas, 2009, p.
RASO, T.; MELLO, H. The C-ORAL-BRASIL corpus. In: Moneglia, M.-Panunzi, A.,
Information from Corpora in a Cross Linguistic Perspective. Firenze University
RASO, T.; MITTMANN, M. Validação estatística dos critérios de
segmentação da fala espontânea no corpus C-ORAL-BRASIL. Revista de Estudos