The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres

Chanier, Thierry ; Poudat, Celine ; et al.

In: JLCL-Journal for Language Technology and Computational Linguistics JLCL-Journal for Language Technology and Computational Linguistics, 2014, 29 (2), pp.1-30. 〈http://www.jlcl.org/2014_Heft2/Heft2-2014.pdf〉 Journal for language technology and computational linguistics Journal for language technology and computational linguistics, 2014, 29 (2), pp.1-30 Journal for language technology and computational linguistics, GSCL (Gesellschaft für Sprachtechnologie und Computerlinguistik) 2014, 29 (2), pp.1-30 HAL; (2014)

Online unknown

Wie komme ich dran?

Zugriff:

View record in OpenAIRE (Volltext)

Final version to Special Issue of JLCL (Journal of Language Technology and Computational Linguistics (JLCL, http://jlcl.org/): BUILDING AND ANNOTATING CORPORA OF COMPUTER-MEDIATED DISCOURSE: Issues and Challenges at the Interface of Corpus and Computational Linguistics (ed. by Michael Beißwenger, Nelleke Oostdijk, Angelika Storrer & Henk van den Heuvel); International audience; The CoMeRe project aims to build a kernel corpus of different Computer-Mediated Com-munication (CMC) genres with interactions in French as the main language, by assembling interactions stemming from networks such as the Internet or telecommunication, as well as mono and multimodal, synchronous and asynchronous communications. Corpora are assem-bled using a standard, thanks to the TEI (Text Encoding Initiative) format. This implies extending, through a European endeavor, the TEI model of text, in order to encompass the richest and the more complex CMC genres. This paper presents the Interaction Space model. We explain how this model has been encoded within the TEI corpus header and body. The model is then instantiated through the first four corpora we have processed: three corpora where interactions occurred in single-modality environments (text chat, or SMS systems) and a fourth corpus where text chat, email and forum modalities were used simultaneously. The CoMeRe project has two main research perspectives: Discourse Analysis, only alluded to in this paper, and the linguistic study of idiolects occurring in different CMC genres. As NLP algorithms are an indispensable prerequisite for such research, we present our motiva-tions for applying an automatic annotation process to the CoMeRe corpora. Our wish to guarantee generic annotations meant we did not consider any processing beyond morphosyn-tactic labelling, but prioritized the automatic annotation of any freely variant elements within the corpora. We then turn to decisions made concerning which annotations to make for which units and describe the processing pipeline for adding these. All CoMeRe corpora are verified, thanks to a staged quality control process, designed to allow corpora to move from one project phase to the next. Public release of the CoMeRe corpora is a short-term goal: corpora will be integrated into the forthcoming French National Reference Corpus, and disseminated through the national linguistic infrastructure ORTOLANG. We, therefore, highlight issues and decisions made concerning the OpenData perspective.

Titel:	The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres
Autor/in / Beteiligte Person:	Chanier, Thierry ; Poudat, Celine ; Sagot, Benoit ; Antoniadis, Georges ; Wigham, Ciara ; Hriba, Linda ; Longhi, Julien ; Seddah, Djame ; Laboratoire de Recherche sur le Langage ( LRL ) ; Université Blaise Pascal - Clermont-Ferrand 2 ( UBP ) ; Lexiques, Dictionnaires, Informatique ( LDI ) ; Université Paris 13 ( UP13 ) -Université de Cergy Pontoise ( UCP ) ; Université Paris-Seine-Université Paris-Seine-Université Sorbonne Paris Cité ( USPC ) -Centre National de la Recherche Scientifique ( CNRS ) ; Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing ( ALPAGE ) ; Paris-Rocquencourt, Inria ; Institut National de Recherche en Informatique et en Automatique ( Inria ) -Institut National de Recherche en Informatique et en Automatique ( Inria ) -Université Paris Diderot - Paris 7 ( UPD7 ) ; LInguistique et DIdactique des Langues Étrangères et Maternelles ( LIDILEM ) ; Université Stendhal - Grenoble 3-Université Grenoble Alpes ( UGA ) ; Interactions, Corpus, Apprentissages, Représentations ( ICAR ) ; École normale supérieure - Lyon ( ENS Lyon ) -Université Lumière - Lyon 2 ( UL2 ) -INRP-Ecole Normale Supérieure Lettres et Sciences Humaines-Centre National de la Recherche Scientifique ( CNRS ) ; Centre de recherche textes et francophonies ( CRTF ) ; Université de Cergy Pontoise ( UCP ) ; Université Paris-Seine-Université Paris-Seine ; Institut des Sciences Humaines Appliquées ( ISHA ) ; Université Paris-Sorbonne ( UP4 ) ; Laboratoire de Recherche sur le Langage (LRL) ; Université Blaise Pascal - Clermont-Ferrand 2 (UBP) ; Lexiques, Dictionnaires, Informatique (LDI) ; Université Paris 13 (UP13)-Université de Cergy Pontoise (UCP) ; Université Paris-Seine-Université Paris-Seine-Université Sorbonne Paris Cité (USPC)-Centre National de la Recherche Scientifique (CNRS) ; Large-scale deep linguistic processing (ALPAGE) ; Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Paris Diderot - Paris 7 (UPD7) ; LInguistique et DIdactique des Langues Étrangères et Maternelles (LIDILEM) ; Université Stendhal - Grenoble 3 ; Interactions, Corpus, Apprentissages, Représentations (ICAR) ; École normale supérieure de Lyon (ENS de Lyon)-Université Lumière - Lyon 2 (UL2)-INRP-Ecole Normale Supérieure Lettres et Sciences Humaines (ENS LSH)-Centre National de la Recherche Scientifique (CNRS) ; Centre de recherche textes et francophonies (CRTF) ; Université de Cergy Pontoise (UCP) ; Institut des Sciences Humaines Appliquées (ISHA) ; Université Paris-Sorbonne (UP4) ; Les auteurs remercient le LABEX ASLAN (ANR-10-LABX-0081) de l'Université de Lyon pour son soutien financier dans le cadre du programme 'Investissements d'Avenir' (ANR-11-IDEX-0007) de l'Etat Français géré par l'Agence Nationale de la Recherche (ANR). ; École normale supérieure - Lyon (ENS Lyon)-Université Lumière - Lyon 2 (UL2)-INRP-Ecole Normale Supérieure Lettres et Sciences Humaines (ENS LSH)-Centre National de la Recherche Scientifique (CNRS) ; Université Sorbonne Paris Cité (USPC)-Université de Cergy Pontoise (UCP) ; Université Paris-Seine-Université Paris-Seine-Université Paris 13 (UP13)-Centre National de la Recherche Scientifique (CNRS)
Link:	View record in OpenAIRE (Volltext) Volltext
Quelle:	JLCL-Journal for Language Technology and Computational Linguistics JLCL-Journal for Language Technology and Computational Linguistics, 2014, 29 (2), pp.1-30. 〈http://www.jlcl.org/2014_Heft2/Heft2-2014.pdf〉 Journal for language technology and computational linguistics Journal for language technology and computational linguistics, 2014, 29 (2), pp.1-30 Journal for language technology and computational linguistics, GSCL (Gesellschaft für Sprachtechnologie und Computerlinguistik) 2014, 29 (2), pp.1-30 HAL; (2014)
Veröffentlichung:	HAL CCSD, 2014
Medientyp:	unknown
ISSN:	0175-1336 (print) ; 2190-6858 (print)
Schlagwort:	CMC [ SHS.LANGUE ] Humanities and Social Sciences/Linguistics corpus [SHS.LANGUE]Humanities and Social Sciences/Linguistics Computer Mediated Communication CoMeRe [SHS.LANGUE] Humanities and Social Sciences/Linguistics
Sonstiges:	Nachgewiesen in: OpenAIRE Sprachen: English File Description: application/pdf Language: English Rights: OPEN

Klicken Sie ein Format an und speichern Sie dann die Daten oder geben Sie eine Empfänger-Adresse ein und lassen Sie sich per Email zusenden.

BibTeX Citavi, JabRef, u.a.
(Literaturverwaltung)

PDF kein Volltext!
(Merkzettel, Notizen)

RIS Endnote, Citavi u.a.
(Literaturverwaltung)

MODS
(XML zur Weiterverarbeitung)

oder

Wählen Sie das für Sie passende Zitationsformat und kopieren Sie es dann in die Zwischenablage, lassen es sich per Mail zusenden oder speichern es als PDF-Datei.

Gewünschter Zitations-Stil:

oder

Bitte prüfen Sie, ob die Zitation formal korrekt ist, bevor Sie sie in einer Arbeit verwenden. Benutzen Sie gegebenenfalls den "Exportieren"-Dialog, wenn Sie ein Literaturverwaltungsprogramm verwenden und die Zitat-Angaben selbst formatieren wollen.