Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Constructing corpora for the development and evaluation of paraphrase systems
Cohn T., Callison-Burch C., Lapata M. Computational Linguistics34 (4):597-614,2008.Type:Article
Date Reviewed: Sep 14 2009

Automatic paraphrasing that provides a means to check the semantic equivalence of syntactically different written statements deriving from different sources improves plagiarism detection, semantic news aggregation, and even information integration for topic focused summarization tasks. Cohn, Callison-Burch, and Lapata do an excellent job of reviewing the field’s corpora, with respect to the characteristics of automatic paraphrase research. The introduction effectively explains how the field of automatic paraphrase research is relevant and touches on how automatic systems work. According to the authors, the field does not have a formal definition as a standard; they define paraphrasing based on word alignment and semantic equivalence, with the hope that this definition will be used as a gold standard in the field.

The section on corpus creation and annotation explains the logic behind how the corpus was derived and why tradeoffs were made that add consistency toward standardization: “Our corpus was compiled from three data sources[:] ... the Multiple-Translation Chinese (MTC) corpus, Jules Verne’s Twenty Thousand Leagues Under the Sea novel[,] ... and the Microsoft Research (MSR) paraphrase corpus.” They use automatic word alignment between the parent corpora, corrected by humans. This section is a good introduction on how to consider the criteria needed for the automatic evaluation of a system against a corpus and how to utilize human annotators to cover these criteria. This complements Section 3, “Human Agreement,” providing an extensive explanation of how to use inter-coder agreement in checking a corpus’ annotations, including what modifications to consider when sampling for its analysis.

Section 3 starts by introducing the standard measures of precision, recall, and F1, and then adds the “chance-correction agreement” based on the Kappa standard. Next, it delineates the differences in annotator choices in word-based and phrase-based agreements. Then, the authors adequately explain the necessary divergences from the standard measures used in similar areas, such as machine translation, and how their new measures were derived, formula by formula. Finally, it explains how the differences in agreement relate to the characteristics of the three parent corpora. This is probably the most crucial section, as it sheds light on how human annotations are affected by the syntactic composition of the three diversely created parent corpora and how these effects carry over to their corpus.

The experiments seem to support the use of their corpus for paraphrase modeling--both lexical and structural, by showing that grammar rules can be derived--and as an evaluation set that can judge automatic systems. However, only a limited number of human annotators were available--a total of four--and there are a few problems with the liberal nature of how they estimated the necessary probability distributions for the values of predicted annotator editing in their chance-corrected agreement measure. This makes their corpus susceptible to the same errors they claim to factor out when dealing with human annotators: attention span, background, and annotator training. It is important to note that annotator agreement was taken using two annotators on a sampling of subsets derived from the parent corpora. Lastly, as previously mentioned, an automatic means was first used to align the corpus and it was then corrected by hand. This system, which was also trained on the largest of the three parent corpora, was then used in the evaluation of their work, comparing it with a co-training system that was not properly parametrized for the corpus and, of course, performed poorly; yet, it is suggested that a hypothetical paraphrase extractor based on automatic word alignment outperforms a co-training approach.

Despite the aforementioned issue of proper evaluation, the early part of their study serves as a precursor for later research on creating a human-annotated paraphrase corpus. The greatest contribution is the detailed analysis of considerations used in creating the corpus and working with human annotators. Although I cannot recommend this work as a worthy standard for the field, it is good learning material for experienced computational linguists who are familiar enough with corpus creation to be able to improve upon the concerns with the methodology, including: proper and even training of automatic systems used in the creation of the corpus; questionable probability distribution estimation; the need for more human annotators to help alleviate the aforementioned problems of human factors in corpus building; and a proper, unbiased evaluation of the final corpus.

Reviewer:  Quinsulon L. Israel Review #: CR137299 (1005-0518)
Bookmark and Share
 
Language Generation (I.2.7 ... )
 
 
Linguistics (J.5 ... )
 
Would you recommend this review?
yes
no
Other reviews under "Language Generation": Date
A phonemic transcription program for Polish
Chomyszyn J. International Journal of Man-Machine Studies 25(3): 271-293, 1986. Type: Article
Aug 1 1988
Speech synthesis and the rhythm of English
Isard S., Prentice Hall International (UK) Ltd., Hertfordshire, UK, 1985. Type: Book (9789780131638419)
Jun 1 1988
Knowledge-intensive natural language generation
Jacobs P. Artificial Intelligence 33(3): 325-378, 1987. Type: Article
Dec 1 1988
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy