An XML model for
multi-layer representation of spoken language
(Alessandro Panunzi - UniFi)
The annotation of spoken language is a complex
task, and requires multi-layer representation for the different levels of
analysis
The communication will show a database
prototype, that is conceived to represent spontaneous spoken language
transcripts with three levels of annotation: prosodic boundaries, information
structure and morphosyntactic tagging (PoS).
The theoretical model on which the database
relies is the Informational Patterning
Theory (Teoria dell’Articolazione
dell’informazione, see Cresti 2000, Moneglia and Cresti 2006; Scarano
2009).
Data derive from the Italian section of the
C-ORAL-ROM Corpus (Cresti and Moneglia 2005), in which the transcripts
incorporate the annotation of prosodic boundaries, that are considered relevant
for two main factors: a) the identification of the utterances; b) the internal
parsing of utterances into tonal units.
Utterances represent the referring unit of
analysis for the spoken language, and are identified by means of the terminal
prosodic breaks which segment the speech flow. Within the utterances,
non-terminal breaks can be present, structuring the utterance into a sequence
of tonal units which constitute a prosodic pattern (see Cresti and Moneglia
2010).
Since the Informational
Patterning Theory foresees a systematic correspondence between the prosodic
pattern and the information pattern of the utterance, the second level of
annotation provided each tonal unit with a tag regarding its information value,
according to the following tagset:
ALL: Allocutive
APC: Appendix of Comment
APT: Appendix of Topic
CMM: Multiple Comment
CNT: Conative
COB: Bound Comment
COM: Comment
DCT: Dialogic Connector
EMP: Interrupted Unit
EXP: Expressive
INP: Incipit
INT: Locutive Introducer
PAR: Parenthesis
PHA: Phatic
SCA: Scanning Unit
TMT: Time taking
TOP: Topic
TPL: Topic List
UNC: Unclassified
This annotation has been
produced using the WinPitch alignment software interface, by means of the
prosodic analisys of the sound data.
The third level of
annotation, i.e. the PoS tagging, has been processed automatically, exploiting
the TreeTagger software developed by the Institute for Computational
Linguistics of the University
of Stuttgart.
The whole annotation has been automatically
converted in XML format and projected on a database. The resource runs on the eXist engine, an open source database management system that
stores data according to the XML data model and features index-based
XPath/XQuery processing.
A sample of an XML document containing all the annotation levels for a
single turn, containing one utterance and two prosodic/information units,
follows:
<turn speak="EDO">
<term_seq num="1" type="utt"
proj_ill="unknown">
<tone_unit
inf="COM" ill="none">
<word
lemma="guardare" pos="VER:fin">guarda</word>
<word lemma="chi"
pos="WH">chi</word>
<word
lemma="c'" pos="ADV">c'</word>
<word
lemma="essere" pos="VER:fin">è</word>
<break
type="nonterminal">/</break>
</tone_unit>
<tone_unit
inf="ALL">
<word
lemma="nonna" pos="NOUN">nonna</word>
<break
type="terminal">//</break>
</tone_unit>
</term_seq>
</turn>
The architecture of the
database allows cross-level queries on the data. Quantitative data about the
information structure of the utterances and about the morphosyntactic and
lexical fillings of each type of information unit will be presented.
Future works on the database foresee:
- the inclusion of audio data (direct access to the sound), aligned
utterance by utterance with the transcripts;
- the inclusion of a further level of annotation, i.e. the illocutive
value of each utterance, as foreseen by the Language
into Act Theory (Teoria della lingua
in atto, Cresti 2000).
References
Cresti, E. 2000. Corpus di italiano
parlato, 2 voll., CD-ROM. Firenze: Accademia della Crusca.
Cresti, E. and M.
Moneglia (eds). 2005. C-ORAL-ROM. Integrated reference
corpora for spoken romance languages, DVD + vol. Amsterdam: Benjamins.
Cresti, E. and M.
Moneglia 2010. Informational
patterning theory and the corpus-based description of spoken language. The
compositionality issue in the topic-comment pattern. In M. Moneglia, A. Panunzi
(eds), Bootstrapping Information from
Corpora in a Cross-Linguistic Perspective. Firenze:
FUP.
eXist.
http://exist.sourceforge.net/
Moneglia, M. and E. Cresti. 2006. C-ORAL-ROM. Prosodic boundaries for
spontaneous speech analysis. In Y. Kawaguchi, S. Zaima and T. Takagaki (eds), Spoken Language Corpus and Linguistics Informatics. Amsterdam: Benjamins, -114.
Scarano, A. 2009. A The prosodic
annotation of C-ORAL-ROM and the structure of information in spoken language.
In L. Mereu (ed.), Information structures
and its interfaces. Berlin and New York: Mouton de
Gruyter, 51-74.
Treetagger.
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
WinPitch. http://www.winpitch.com/
XML.
http://www.w3.org/XML/