Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Syntax-based collocation extraction
Seretan V., Springer-Verlag New York, Inc., New York, NY, 2010. 212 pp. Type: Book (978-9-400701-33-5)
Date Reviewed: Sep 29 2011

This relatively short book is very interesting for the justification it makes for syntactic parsing in the extraction of word types from texts. It gives a very clear and painstakingly documented account of two experiments in extracting collocations, or associated words, from digital texts in English and three other languages--French, Italian, and Spanish.

Collocations first appear in the linguistic literature in the writings of the mid-20th century British linguist J. R. Firth. He noted that combinations of words typically occur together, so knowing which combinations of words are typical of a language is an important feature of speaking and writing the language. As later writers have pointed out, the words in a collocation are related syntactically and semantically. While the combination is transparent grammatically, it is not fully predictable. Examples of English collocations include adjective-noun pairs like “petty attempt,” verb-noun (object) pairs like “face dilemma,” and noun (subject)-verb pairs like “river flow.” An example of a nonadjacent word pair is found in the sentence “It is a very pressing issue(i) which Mr. Sacrédeus is addressing (e(i)),” which contains the collocation address issue, but not in normal adjacent verb-noun order.

Nonnative speakers of a language have to learn collocations; there are numerous collocation dictionaries, mainly for English and French, but also for other languages such as Italian and Russian. Collocations also present a problem for machine translation, as the literal translation of a collocation is not the idiomatic way of expressing the meaning in the other language. Some examples are (French) “effectuer visite” (“accomplish, bring about a visit”), equivalent to the English “pay a visit,” or (English) “bridge gap,” equivalent to the French “combler lacune” (“fill a lacuna”), or “combler fossé” (“fill a ditch”).

Both experiments use part-of-speech (POS) tagging and lemmatization, and contrast two methods of extracting collocations. The more common and simpler method uses a sliding window on POS tagged texts to pick out adjacent words that may be collocations. The innovation reported in these experiments is the use of a “deep” parser (the FIPS multilingual parser), which has two components--a modified transformational grammar that is able to cross-reference words separated in the text and link them together, and an attribute-value matrix from Lexical Functional Grammar that allows cross-referencing of semantically related words, even if they are not adjacent.

The corpus for the first experiment was a portion of the French text from the Hansard Canadian Parliament deliberations. The second experiment used a much larger multilingual corpus including French, English, Italian, and Spanish. The baseline experiment used a sliding window on the corpus texts, which underwent several steps of processing. Lemmas, forms of the same words, were selected and filtered for part of speech, to eliminate function words, auxiliaries, and so on, and words separated by punctuation. Candidate pairs were identified and filtered for POS combinations that do not have a syntactic relation; the candidates were ranked for those likely to be collocations.

The two parsing-based experiments proceeded in stages. Each sentence or subconstituent was analyzed by the FIPS parser, and then candidates for collocations were identified. The candidates were checked for morphological criteria, and for a specific syntactic relation, such as a modifier relation of adjective and noun, verb and object, and so on. The candidates were ranked for likelihood of being a collocation. In both the parsing and sliding window experiments, sets of words in a possible collocation were manually evaluated by trained linguists who were native speakers of the language in question. In both experiments, the deep parsing method was more successful and more consistently accurate than the sliding window with only POS tagging, comparing the judgments of speakers with the rankings made by the deep versus shallow parsers.

The background of this research work is described fully. Extensive discussion of other parsing methods, statistical tests that have been used as association measures for collocation extraction, and very detailed appendices are included. The appendices contain lists of collocation dictionaries, definitions of collocations, common association measures, the statistical results of the two experiments, and detailed lists of word pairs identified in French, English, Italian, and Spanish, comparing the parsing and window methods, as well as the judgments of the two linguist judges who rated the combinations as collocations or other kinds of combination, for each language.

This book makes an interesting contribution to the challenges of dealing with the properties inherent to natural language that have made computational approaches difficult.

Reviewer:  Alice Davison Review #: CR139477 (1204-0358)
Bookmark and Share
  Featured Reviewer  
 
Natural Language Processing (I.2.7 )
 
 
Linguistics (J.5 ... )
 
Would you recommend this review?
yes
no
Other reviews under "Natural Language Processing": Date
Current research in natural language generation
Dale R. (ed), Mellish C. (ed), Zock M., Academic Press Prof., Inc., San Diego, CA, 1990. Type: Book (9780122007354)
Nov 1 1992
Incremental interpretation
Pereira F., Pollack M. Artificial Intelligence 50(1): 37-82, 1991. Type: Article
Aug 1 1992
Natural language and computational linguistics
Beardon C., Lumsden D., Holmes G., Ellis Horwood, Upper Saddle River, NJ, 1991. Type: Book (9780136128137)
Jul 1 1992
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy