Data Formats: representing annotations

This section contains the definition of the formalisms to be used by the project participants for resource annotation and exchange. A detailed description of the proposed formalisms is given in deliverable 1.1 of the project. The following files are distributed:

  • transread_v1-2.xsd: the XML scheme. This is the definition of the annotation formalism.
  • transread_v1-2.dtd: the DTD of the annotation scheme. This file introduces the structure of the annotation scheme.

We also provide a text annotated according to this scheme. This sample is based on an excerpt of the novel "The Last of the Mohicans" by F. Cooper and is composed of three files:

  • sample_Mohicans_en.xhtml: the first two chapters of the English version of the novel, taken from the Gutenberg Project.
  • sample_Mohicans_fr.xhtml: the first two chapters of the French version of the novel, taken from Gutenberg Project.
  • sample_Mohicans_annot.xml: the annotation file linking the previous two documents. The file contains alignments at various levels, morpho-syntactic annotations, word sense disambiguation information, etc.

These files can be downloaded here.

Parallel Corpora

This section lists the parallel corpora created and manually annotated during the project. These corpora are distributed under the CC-BY licence. If you use it for your research, please cite:

  author = 	 {Xu, Yong and Yvon, François},
  title = 	 {Novel annotation schemes for sentential and sub-sentential alignments of bi-texts},
  booktitle = {Proceedings of 10th Language Resources and Evaluation Conference },
  year = 	 2016,
  series = 	 {LREC'16},
  address = 	 {Portorož (Slovenia)}

Thanks to a supplementary funding of the French Ministère de la Culture / Direction Générale à la langue française et aux langues de France, additionnal subsentential manual alignments for 5 French/English short stories were collected using a "divisive" alignment scheme, and were also automatically annotated with a rich syntactic annotation, totalling about 12K bilingual segments. This extension lead us to slighly modify the alignment format (see the documentation in the archive below). Download the complete alignment files and the related documentation. - see the README file.

The bilingual reader


The digital roll

  • Watch the Digital Roll Simulator

  • You can download the d-roll simulator for MacOS X here: Reading with a Digital Roll