An XML model for multi-layer representation of spoken language (Alessandro Panunzi - UniFi)

The annotation of spoken language is a complex task, and requires multi-layer representation for the different levels of analysis
The communication will show a database prototype, that is conceived to represent spontaneous spoken language transcripts with three levels of annotation: prosodic boundaries, information structure and morphosyntactic tagging (PoS).
The theoretical model on which the database relies is the Informational Patterning Theory (Teoria dell’Articolazione dell’informazione, see Cresti 2000, Moneglia and Cresti 2006; Scarano 2009).
Data derive from the Italian section of the C-ORAL-ROM Corpus (Cresti and Moneglia 2005), in which the transcripts incorporate the annotation of prosodic boundaries, that are considered relevant for two main factors: a) the identification of the utterances; b) the internal parsing of utterances into tonal units.
Utterances represent the referring unit of analysis for the spoken language, and are identified by means of the terminal prosodic breaks which segment the speech flow. Within the utterances, non-terminal breaks can be present, structuring the utterance into a sequence of tonal units which constitute a prosodic pattern (see Cresti and Moneglia 2010).
Since the Informational Patterning Theory foresees a systematic correspondence between the prosodic pattern and the information pattern of the utterance, the second level of annotation provided each tonal unit with a tag regarding its information value, according to the following tagset:
 
ALL:  Allocutive
APC:  Appendix of Comment
APT:  Appendix of Topic
CMM:  Multiple Comment
CNT:  Conative
COB:  Bound Comment
COM:  Comment
DCT:  Dialogic Connector
EMP:  Interrupted Unit
EXP:  Expressive
INP:  Incipit
INT:  Locutive Introducer
PAR:  Parenthesis
PHA:  Phatic
SCA:  Scanning Unit
TMT:  Time taking
TOP:  Topic
TPL:  Topic List
UNC:  Unclassified
 
This annotation has been produced using the WinPitch alignment software interface, by means of the prosodic analisys of the sound data.
The third level of annotation, i.e. the PoS tagging, has been processed automatically, exploiting the TreeTagger software developed by the Institute for Computational Linguistics of the University of Stuttgart.
The whole annotation has been automatically converted in XML format and projected on a database. The resource runs on the eXist engine, an open source database management system that stores data according to the XML data model and features index-based XPath/XQuery processing.
A sample of an XML document containing all the annotation levels for a single turn, containing one utterance and two prosodic/information units, follows:

<turn speak="EDO">

      <term_seq num="1" type="utt" proj_ill="unknown">
            <tone_unit inf="COM" ill="none">
                  <word lemma="guardare" pos="VER:fin">guarda</word>
                  <word lemma="chi" pos="WH">chi</word>
                  <word lemma="c'" pos="ADV">c'</word>
                  <word lemma="essere" pos="VER:fin">è</word>
                  <break type="nonterminal">/</break>
            </tone_unit>
            <tone_unit inf="ALL">
                  <word lemma="nonna" pos="NOUN">nonna</word>
                  <break type="terminal">//</break>
            </tone_unit>
      </term_seq>
</turn>
 
The architecture of the database allows cross-level queries on the data. Quantitative data about the information structure of the utterances and about the morphosyntactic and lexical fillings of each type of information unit will be presented.
Future works on the database foresee:
- the inclusion of audio data (direct access to the sound), aligned utterance by utterance with the transcripts;
- the inclusion of a further level of annotation, i.e. the illocutive value of each utterance, as foreseen by the Language into Act Theory (Teoria della lingua in atto, Cresti 2000).
 
References
Cresti, E. 2000. Corpus di italiano parlato, 2 voll., CD-ROM. Firenze: Accademia della Crusca.
Cresti, E. and M. Moneglia (eds). 2005. C-ORAL-ROM. Integrated reference corpora for spoken romance languages, DVD + vol. Amsterdam: Benjamins.
Cresti, E. and M. Moneglia 2010. Informational patterning theory and the corpus-based description of spoken language. The compositionality issue in the topic-comment pattern. In M. Moneglia, A. Panunzi (eds), Bootstrapping Information from Corpora in a Cross-Linguistic Perspective. Firenze: FUP.
eXist. http://exist.sourceforge.net/
Moneglia, M. and E. Cresti. 2006. C-ORAL-ROM. Prosodic boundaries for spontaneous speech analysis. In Y. Kawaguchi, S. Zaima and T. Takagaki (eds), Spoken Language Corpus and Linguistics Informatics. Amsterdam: Benjamins, -114.
Scarano, A. 2009. A The prosodic annotation of C-ORAL-ROM and the structure of information in spoken language. In L. Mereu (ed.), Information structures and its interfaces. Berlin and New York: Mouton de Gruyter, 51-74.
Treetagger. http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
WinPitch. http://www.winpitch.com/
XML. http://www.w3.org/XML/