Place: Stockholm University, Department of Linguistics, Södra huset C
Organisation Committee: Sofia Gustafson-Capková, Yvonne Samuelsson, Martin Volk (Coordinator)
Page last modified on: Tuesday, 19 September 2006
The combined research on treebanks and parallel corpora has recently led to
parallel treebanks. A parallel treebank consists of syntactically annotated
sentences in two or more languages, taken from translated (i.e. parallel)
documents. In addition, the syntax
trees of two corresponding sentences are aligned on a sub-sentential level. This
means word level, phrase level and clause level. Parallel treebanks can be used
as training or evaluation corpora for word and phrase alignment, as input for
example-based
machine translation (EBMT), as training corpora for transfer rules, or for
translation studies.
This Symposium brings together researchers who work on the automatic annotation and exploitation of parallel corpora. Their focus varies from Machine Translation, Alignment, Parsing, Grammar Extraction to Translation Studies. But they are all interested in how to best represent and take advantage of the information provided by parallel texts. They use statistical and linguistic methods to process natural language corpora and to extract interesting knowledge from them.
The Symposium is open for the general public.
| Thursday, 21 September 2006 | |||
| 14.00-14.15 | Martin Volk (Stockholm): Introduction | Room C307 | |
| 14.15-15.35 | Session 1: (2 presentations) | Stefan Evert and Bettina Schrader
(Osnabrück): Parallel Treebank Word Alignment Evaluation [Abstract] Mary Hearne (Dublin): Exploiting linguistically-annotated parallel corpora for translation [Abstract] | |
| 15.35-16.00 | Coffee Break | ||
| 16.00-17.20 | Session 2: (2 presentations) | Eckhard Bick (ISK, University of Southern Denmark):
Automatic Syntactic and Dependency Annotation as a Tool for Deeper Alignment [Abstract] Joakim Nivre (Växjö and Uppsala): Multilingual Dependency Parsing and Parallel Treebanks [Abstract] | |
| 17.20-17.40 | Short Break | ||
| 17.40-18.20 | Session 3: (1 presentation) | Jonas Kuhn (Saarbrücken and Potsdam): Multilingual parallel corpora as training data for grammar inference [Abstract] | |
| 19.30- | Symposium Dinner (at Restaurant Pelikan, Blekingegatan, Södermalm) | ||
|
Friday, 22 September 2006 | |||
| 09.15-10.35 | Session 4: (2 presentations) | Christophe Chenon (Grenoble): TransTree, a formalism to capture nested correspondences at sub-sentential level [Abstract] Jan Hajic (Prague): Syntax meets Semantics (in the family of Prague Dependency Treebanks) [Abstract] | Room C307 |
| 10.35-11.00 | Coffee Break | ||
| 11:00-12:20 | Session 5: (2 presentations) | Silvia Hansen-Schirra (Saarbrücken): The CroCo
Corpus: towards a parallel treebank for translation studies and practice [Abstract] Yvonne Samuelsson, Sofia Gustafson-Capková, and Martin Volk (Stockholm): Experiences from building an English-German-Swedish Parallel Treebank [Abstract] | |
| 12:20-12:45 | Closing remarks | ||
| 13:00 | Lunch (Faculty Club) | ||
| 14:29-15:10 | Archipelago Excursion: Bus from Stockholm University to Vaxholm | ||
| 16:15-17:10 | Boat from Vaxholm to Stockholm City | ||
| Optional: Walk through the old city "Gamla Stan" | |||
Note: Each presentation is given 40 minutes including discussion.
Authors: Stefan Evert and Bettina Schrader (Osnabrück)
Abstract: Evaluating word alignment quality is generally considered to be an important but also very difficult task, as the sheer number of projects, shared tasks, workshops and articles devoted to the topic indicates. However, until recently, it seemed that a standard methodology had been established in the field: the automatic alignment is evaluated against a gold standard, i.e. a parallel text aligned by at least one human annotator, and alignments done by the annotator are judged either possible or sure, depending on how much confidence the human annotators had their decisions. Often, one of the languages used is English, and the annotator knows the Blinker annotation guidelines (Melamed, 1998). As evaluation metric, the alignment error rate (AER, Och & Ney, 2000) is used. In short, all the ingredients of a good evaluation - a well-defined metric, manual gold annotation, and clear guidelines, are in place.
However, the situation has changed recently: at the joint COLING/ACL conference this year, alignment error rate has been shown to give misleading results (Fraser & Marcu, 2006), and the authors cautioned against using gold standards that distinguish between sure and possible alignments. In other words, they have discredited the gold standards and methodologies on which virtually all recent evaluation results have been based. In order to remedy this situation and establish sound evaluation standards, the following issues will have to be explored:
In our talk, we are going to address these questions. We will give examples of problematic cases, suggest solutions, and report on open questions based on our own efforts at creating a gold standard: We have manually aligned 242 German-English sentence pairs at the word level, following a modified version of the Blinker guidelines. All sentence pairs were independently aligned by two annotators (both native speakers of German with excellent knowledge of English), and and disagreements were resolved by discussion. The annotations were identical in roughly 80% of all cases. However, most of the remaining 20% proved to be idioms and other multiword expressions and were hence difficult to annotate. In some cases, we noticed that aligning at the word level is not always sufficient but that alignments at the phrase level are necessary.
Finally, we are suggesting to use parallel treebanks to thoroughly evaluate alignment quality: firstly, because treebanks are usually carefully annotated with well-devised annotation guidelines and annotation procedures, and hence allow to replace existing gold standards with comparatively better, and possibly larger, data. Secondly, because treebanks also contain monolingual linguistic annotation that allows to explore the strengths and deficiencies of an an alignment program to greater detail.
References:
Title: Exploiting linguistically-annotated parallel corpora for translation
Author: Mary Hearne (Dublin)
Abstract: Data-Oriented Translation (DOT) is a data-driven model of translation which exploits linguistic annotations associated with the training sentence pairs. It is founded on an approach to data-driven syntactic analysis (Data-Oriented Parsing, or DOP) and, consequently, linguistic knowlegde is fundamental to the model - the examples themselves, the linguistic annotations and the statistical inferences drawn from the data all play an equally important role in the translation process.
In this talk, I will focus mainly on the Tree-DOT model. This model assumes that each training sentence pair is annotated with
I will discuss the impact of the assumed annotations on the translation process, describing the linguistic and probabilistic generalisations we extract. I will present some of our empirical findings where the links between tree nodes were inserted manually.
We have developed a sub-sentential alignment algorithm to automatically induce the links between nodes in the source and target trees. I will present the algorithm, and empirical findings for the Tree-DOT model where automatically-induced alignments are used.
Finally, DOT models can be defined over linguistic formalisms other than phrase-structure trees. I will describe one such model which assumes Lexical Functional Grammar representations.
Title: Automatic Syntactic and Dependency Annotation as a Tool for Deeper Alignment
Author: Eckhard Bick (ISK, University of Southern Denmark)
Abstract: Alignment in parallel corpora is usually achieved with the help of statistical tools, which are very efficient at the sentence level, but somewhat error-prone in word-alignment, due to problems with many-to-one and many-to-many correspondences, discontinuities, morphological variation etc. At the same time, for many applications, it would be desirable to match not words or even chains of adjacent words, but chunks with some syntactic or semantic substance. Thus, in a parallel treebank, constituents can be matched instead or words, ideally based not only on chunking, but also on form-and-function equivalence. However, constituent treebanks are usually created or at least revised by hand, limiting their size and creating sparse data problems for some applications. As an alternative, aiming at the automatic creation of parallel treebanks, dependency annotation can provide more or less the same structural information as constituent grammar, while at the same time being more robust as an annotation method, and closer to simple word/token alignment - promising robustness also in terms of the alignment algorithm. By aligning head tokens and letting daughters and deeper descendents follow their head, constituent alignment is still achieved - in an implicit way.
Of course, a prerequisite for the dependency-based alignment of a large body of parallel text is the existence of automatic, reliable and - not least - compatible dependency parsers. The VISL projetct at the Institute of Language and Communication (http://beta.visl.sdu.dk) has developed a cross-language set of grammatical form and function categories, and implemented it in small teaching treebanks for 27 languages, while at the same time embarking on the creation of automatic parsers and larger annotated corpora (http://corp.hum.sdu.dk) for a smaller number of "research" languages (8 Romance and Germanic languages). For 4 languages (da, pt, en, fr), dependency parsers are under development as add-ons for the existing Constraint Grammar parsers.
My Symposium talk will present this approach using the Danish DanGram parser and Arboretum treebanks as examples, and show how the dependency annotation can be used for the Danish-English alignment of a sample of the Europarl corpus (work in progress). Apart from dependency relations, CG syntactic tags (subject, object, postnominal, auxiliary argument etc.) are also used, as well as the bilingual lexicon from a CG- and dependency-based Machine Translation system. Ultimately, alignment results are expected to provide feedback and enrichment to the MT lexicon, in particular with regard to multi-word expressions and phrase translation memory. The output of the MT system, incidentally, creates parallel treebanks in its own right, of sorts at least, from monolingual data - as a byproduct, since translation and transformation are based on source language dependency trees. While not representing authentic data on the target language side, such an "MT treebank" could still be used for machine learning, or a frequency based structural comparison of monolingual English data with human- vs. machine-translated "Danish" English, pointing out strengths and weaknesses of different methods.
Title: Multilingual Dependency Parsing and Parallel Treebanks
Author: Joakim Nivre (Växjö and Uppsala)
Abstract: Multilingual dependency parsing, in the sense of applying the same dependency parser to multiple languages, has seen a recent surge of interest, boosted in particular by the shared task at CoNLL 2006. In this talk, I will present a classifier-based approach to dependency parsing, which has been shown to result in accurate parsing for a wide range of languages, often with fairly limited amounts of training data, and which is implemented in the freely available MaltParser system. After presenting the underlying parsing and learning methodology and reviewing recent experimental results, I will discuss the role that such a system could play in the development of a truly parallel (dependency) treebank.
Title: Multilingual parallel corpora as training data for grammar inference
Author: Jonas Kuhn (Saarbrücken and Potsdam)
Abstract: In this talk, I will give an overview of the ongoing PTOLEMAIOS project, which explores the use of multilingual parallel text as a basis for grammar inference/induction and for weakly supervised learning/bootstrapping of grammars A main focus is on inference of monolingual grammars -- here, the parallel text in the other languages is "just" used as a source of implicit structural and semantic information. Technically, the cross-language linking is performed by a standard statistical word alignment, which we try to exploit in various grammar learning algorithms.
Our approach is related to the "annotation projection" idea (from work by Hwa, Resnik and others); however, we do not exploit a mature NLP tool (such as a parser) for one of the languages, projecting its analysis to a new language as the (pseudo-)annotation for supervised learning. Instead we explore how unsupervised language learning is affected by the addition of streams of unanalyzed translation data, hoping to gather some insight into formal questions of grammar learnability. Although superficially, our grammar inference architecture is quite different from the language acquisition scenario of a human learner, one can argue that at a higher level of abstraction, the available streams of additional parallel text data (with their own inherent, but initially unknown structuring principles) bear some resemblance to the streams of additional perception data, as they are available to the human learner in parallel to the primary language data. Therefore, an exploration of the role that parallel text data can play in grammar inference may contribute to our understanding of the cognitive language acquisition problem.
Title: TransTree, a formalism to capture nested correspondences at sub-sentential level
Author: Christophe Chenon (Grenoble)
Abstract: Computer-aided translation environments based on translation memories use text segments (typically whole sentences) delineated and aligned thanks to the translator's expertise, and do not perform any linguistically motivated analysis. The goal of the TransTree formalism is to capture nested correspondences between sub-segments of bilingual texts.
These complex correspondences, called amphigrams, make up a tree structure that is easily expressed in XML. With a simple shallow transformation, a dynamical visualization of several levels of correspondences between sub-segments can be obtained. The computation of a TransTree structure is based on the comparison of binary trees produced with statistical methods on aligned segments. We use an index that we call "secability" to produce such binary trees. The comparison is supported at the typographical word level by atomic, statistically motivated correspondences.
Several implemention options have been investigated to refine the TransTree structure computed on one bisegment. Some reorganization of atomic correspondences can be achieved with a view to optimizing the overall tree structure, and abstract translation patterns can be computed with clustering techniques over examples found in the corpus. We will detail some of these options in our presentation.
Title: Syntax meets Semantics (in the family of Prague Dependency Treebanks)
Author: Jan Hajic (Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic)
Abstract: The Prague Dependency Treebank project is aimed at a linguistically complex, multi-tier annotation of naturally occuring sentences in natural langauge. There are four tiers at present: the basic token tier (level 0), and the morphological, surface syntax, and semantics (called "tectogrammatics") tiers. The syntactic and tectogrammatic tiers are based on the dependency representation principle. So far, the project produced three corpora: the Czech-language-only Prague Dependency Treebank, the Prague Czech-English Dependency Treebank and the Prague Arabic Dependency Treebank. In the talk, the principles of the Prague Dependency Treebank linguistic annotation scheme will be presented, focusing on the highest (or "deepest") tier (the tectogrammatical one, where the syntactic annotation is complemented with several semantic phenomena). Some technical details will also be discussed, and if time allows, some annotation tools and anotated data will be demonstrated.
Title: The CroCo Corpus: towards a parallel treebank for translation studies and practice
Author: Silvia Hansen-Schirra (Saarbrücken)
Abstract: In this talk a multiple annotated and aligned corpus and its
use in translation studies and practice will be presented. The research
described here is part of the CroCo project funded by the German Research
Foundation. The CroCo Corpus consists of English originals, their German
translations as well as German originals and their English translations. Both
translation directions are represented in eight registers. Alltogether the
corpus comprises one million words. The corpus is tokenized and annotated for
part-of-speech, morphology, lemmas, phrasal categories and grammatical
functions. Furthermore, alignment layers for words, clauses and sentences as
well as a mapping of the grammatical functions are provided. The XCES-conformant
XML files of the corpus also include a header with meta-information on each
text.
On the basis of the CroCo Corpus, it will be shown how linguistic properties of
translations, such as explicitation or simplification, can be investigated
empirically. Moreover, it will be presented how such a resource can be used as
translation memory, exploiting the linguistic enrichment of the corpus for
translation practice and teaching.
Title: Experiences from building an English-German-Swedish Parallel Treebank
Authors: Yvonne Samuelsson, Sofia Gustafson-Capková, and Martin Volk (Stockholm)
Abstract: We have built an English-German-Swedish parallel treebank. All sentences were annotated with Parts-of-Speech and constituent structure trees and manually checked. After double-checking by a second annotator and automatic consistency checking, the monolingual trees were aligned on both the word and the phrase level.
In this talk we will report on the experiences in this project. We will explain our steps in building the parallel treebank and describe our alignment tool. We will discuss the choice of
and how they affected the project. And we will further report on how we work towards a high quality resource with completeness and consistency checking of the treebanks and the alignment.
Our parallel treebank contains the first two chapters of Jostein Gaarder's novel "Sofie's World" with about 500 sentences in English, German and Swedish. In addition it contains 500 sentences from economy texts (a quarterly report by a multinational company as well as part of a bank's annual report) in the 3 languages. The trees were aligned between English and German and between English and Swedish. The third alignment between German and Swedish was automatically derived.