Corpus

Metadata

CORDIAL-SIN is based on geographically representative body of selected excerpts of spontaneous and semi-directed speech, drawn from CLUL’s rich recorded speech collection –of about 4,500 hours speech recordings– obtained in more than 200 locations throughout Portugal as part of the following linguistic atlases projects:

The data collected in these projects was gathered between 1974 and 2004 in rural and fishing communities and produced by speakers with the sociological profile of traditional dialectal informants: typically aged, non-educated and born and raised in the place of interview.

CORDIAL-SIN is around 650,000 words long, the result of transcribing approximately 68 hours of fieldwork interviews carried out in the 42 localities or micro-regions shown on the map below:

  1. VPA – Vila Praia de Âncora (Viana do Castelo)
  2. CTL – Castro Laboreiro (Viana do  Castelo)
  3. PFT – Perafita (Vila Real)
  4. AAL – Cast.Vide, Porto da Esp., S. Salv. Aramenha, Sapeira, Alpalhão, Nisa (Portalegre)
  5. PAL – Porches, Alte (Faro)
  6. CLC – Câmara de Lobos, Caniçal (Funchal)
  7. PST – Camacha, Tanque (Funchal)
  8. MST – Monsanto (Castelo Branco)
  9. FLF – Fajãzinha (Horta)
  10. MIG – Ponta Garça (Ponta Delgada)
  11. OUT – Outeiro (Bragança)
  12. CBV – Cabeço de Vide (Portalegre)
  13. MIN – Arcos de Valdevez, Bade, S. Lourenço da Montaria (Viana do Castelo)
  14. FIG – Figueiró da Serra (Guarda)
  15. ALV – Alvor (Faro)
  16. SRP – Serpa (Beja)
  17. LVR – Lavre (Évora)
  18. ALC – Alcochete (Setúbal)
  19. COV – Covo (Aveiro)
  20. PIC – Bandeiras, Cais do Pico (Horta)
  21. PVC – Porto de Vacas (Coimbra)
  22. EXB – Enxara do Bispo (Lisboa)
  23. TRC – Fontinhas (Angra do Heroísmo)
  24. MTM – Moita do Martinho (Leiria)
  25. LAR – Larinha (Bragança)
  26. LUZ – Luzianes (Beja)
  27. FIS – Fiscal (Braga)
  28. GIA – Gião (Porto)
  29. STJ – Santa Justa (Santarém)
  30. UNS – Unhais da Serra (Castelo Branco)
  31. VPC – Vila Pouca do Campo (Coimbra)
  32. GRJ – Granjal (Viseu)
  33. CRV – Corvo (Horta)
  34. GRC – Graciosa (Angra do Heroísmo)
  35. MLD – Melides (Setúbal)
  36. STA – Santo André (Vila Real)
  37. MTV – Montalvo (Santarém)
  38. CLH – Calheta (Angra do Heroísmo)
  39. CPT – Carrapatelo (Évora)
  40. AJT – Aljustrel (Beja)
  41. STE – Santo Espírito (Ponta Delgada)
  42. CDR – Cedros (Horta)
Transcription and annotation

The transcription and text markup of CORDIAL-SIN followed the guidelines established for the CORAL – Corpus de Diálogo Etiquetado (see Transcription conventions). The corpus transcription adopted a conservative approach including the marking up of generalized spoken language phenomena, such as pauses, speech overlappings, hesitations, repetitions, rephrased segments, false starts, truncated words, unclear productions, phonetic and morphophonological variants, etc. This layer of transcription is of particular interest to studies focused on the common strategies of spoken discourse.

The linguistic annotation was completed on an edited version of the original transcript, which only includes orthographic transcriptions of full sentences or phrasal fragments – usually unfinished sentences – that can be syntactically analysed and parsed. Therefore, this ‘normalized’ version does not contain repetitions and phrasal fragments abandoned as a result of reformulation, postponement of production, and hesitation.

The CORDIAL-SIN morphosyntactic annotation system has a CLAWS format and consists of an adaptation (revision/extension) of the system developed for Portuguese by the Tycho Brahe project team. It combines POS tags with sub-tags, mostly inflectional, allowing for a very detailed annotation of the lexical units in the corpus (see Morphosyntactic Annotation Manual). The similarity between the tagsets adopted in the two projects made it possible for CORDIAL-SIN’s morphosyntactic annotation to be implemented using the probabilistic tagger developed by Marcelo Finger (and improved by Fabio Natanael Kepler and Marcelo Finger) for the Tycho Brahe corpus.

The syntactic annotation of CORDIAL-SIN adopts the system originally conceived for the Penn Parsed Corpora of Historical English – a rich constituency-based annotation system that marks constituent boundaries, phrase and clause dependencies, sentence types, grammatical relations, discourse functions, some null categories, and certain transformational relations. The adaptation of the original representation schema to the requirements of Portuguese annotation was developed in close collaboration with the Tycho Brahe and Penn Corpora project teams (see Syntactic Annotation Manual).
The syntactic annotation of the corpus was automatically implemented over part-of-speech tagged texts with the ParsPort tool, a rule-based parser developed by Catarina Magro that operates through CorpusSearch (by Beth Randall). The parser’s output was manually revised using Annotald, a graphical editing interface by Jana Beck, Aaron Ecay and Anton Ingason.
The syntactic annotation of CORDIAL-SIN generated a treebank of 177,596 parse trees with systematically and exhaustively searchable configurations.

These four levels of transcription/annotation – which once existed independently – are now integrated in a digital edition in the XML format, hosted on TEITOK, complying with the standards defined by the Text-Encoding Initiative. In this edition of the CORDIAL-SIN corpus, prepared as part of the Synapse project, the whole data (transcription, textual mark-up, POS annotation and lemma), as well as the audio record and the metadata for each dialect survey are stored in full-fledged XML files (see XML-TEI Edition Manual). Syntactic annotation is stored as standoff; the PSDX files are aligned with and linked to the corresponding XML files, making it possible to cross-reference syntactic searches with metadata.

Project guidelines and documentation:

Browsing and searching

To view, query and download the corpus, visit the following pages:

Reference

Martins, A. M. (coord.) [1999-2022]. CORDIAL-SIN: Corpus Dialetal para o Estudo da Sintaxe / Syntax-oriented Corpus of Portuguese Dialects. Lisboa, Centro de Linguística da Universidade de Lisboa. URL: https://www.clul.ulisboa.pt/projeto/cordial-sin-corpus-dialectal-para-o-estudo-da-sintaxe


CORDIAL-SIN by Centro de Linguística da Universidade de Lisboa is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Diseña un sitio como este con WordPress.com
Comenzar
search previous next tag category expand menu location phone mail time cart zoom edit close